very fast processing of large xml-files
The <xml>cmp-tools are optimized for processing of large xml-files. They are very fast and they don't need much memory.
Performance depends on the hardware-configuration.
There are following different performance-measurements.
The comparison of two 1.400 MB xml-files needs only 6 minutes.
(Configuration: AMD Opteron Processor 244, 2 processors, 7 GB memory, 2 RAID-10-Diskarrays)
Following parameters are important for a good performance:
- count of cpu's
- performance of cpu's
- performance of disks
- size of memory
<xml>cmp works with several threads. So more processors improve performance.
The performance of disks is very important, for the files are written/sorted for several times in work-files.
It is not necessary to have a large memory to process large files. But a large memory can cache large parts of the file-system, which has a very positive influence on performance.
Increasing the java-heap-size-parameter may improve performance - but only about some percentages. (Look at the metrics down there.)
The tests have been done with the following hardware-configurations:
|system||Linux 2.6.9-18.104.22.168 unsupportedsmp #1 SMP||Linux 2.4.21-27.0.2.ELsmp #1 SMP||Linux 2.4.20-4GB under Vmware|
|cpu||AMD Opteron(tm) Processor 248||AMD Opteron(tm) Processor 244||Intel(R) Pentium(R) 4 CPU 2.66GHz|
|cpu cache-size KB||1024||1024||512|
To give a imagination of the general performance of theses hardware-configurations here are some elapsed-times needed for copying and sorting of files:
copying of files:
sorting of files:
Times for sorting are relativ high. Reason for that: The files are xml-files. In every line there is always only one element. Because of that the lines are not very significant, and so the effort for sorting ist relativ high.
The metric are all based on this basic-control-file:
|example-content of test-files:|
|file-size||lines||count of xml-
The test-files are sorted by "person@id". They should now be sorted by: "name", "firstname", "residence"
The above mentioned test-files have only a few differences. These test-files will now be merged. The merge-rules are: The result should contain all rows of both files. If elements of a row differ, the the value of file2 sholud overwrite the value of file1.
For the performance-measurements the test-files will be regrouped in a xml-structure, which contains all residences, to every residence the streets, to every street the housenumbers and at least to every housenumber the persons living there.
|Example with converted data:|
fast disks improve performance
The performance of disks is decisive, because the files will be written several times in temp-files. You can define up to four directory for these temp-files via the shell-variables TMPDIR, TMPDIR1, TMPDIR2 and TMPDIR3. You get best performance-results, if these temp-directories are on file-systems, which lay on disks, which have their own disk-controllers.
low influence of parameter java-heap-size
Increasing the java-heap-size may effect some little performance-improvements. But: Even with a low java-heap-size the achieved performance-results are good..
In the following measures there have been compared two files, each with a file-size of 700 MB. The two files have two differences. With parameter "-Xmx" the maximal java-heap-size has been explicitly defined.