performance

very fast processing of large xml-files

The <xml>cmp-tools are optimized for processing of large xml-files. They are very fast and they don't need much memory.

Performance depends on the hardware-configuration.

There are following different performance-measurements.

Example:
The comparison of two 1.400 MB xml-files needs only 6 minutes.
(Configuration: AMD Opteron Processor 244, 2 processors, 7 GB memory, 2 RAID-10-Diskarrays)

parameter

Following parameters are important for a good performance:

  • count of cpu's
  • performance of cpu's
  • performance of disks
  • size of memory

Explanation:
<xml>cmp works with several threads. So more processors improve performance.
The performance of disks is very important, for the files are written/sorted for several times in work-files.
It is not necessary to have a large memory to process large files. But a large memory can cache large parts of the file-system, which has a very positive influence on performance.

Increasing the java-heap-size-parameter may improve performance - but only about some percentages. (Look at the metrics down there.)

hardware-configurations

The tests have been done with the following hardware-configurations:

A B C
system Linux 2.6.9-22.0.2.106 unsupportedsmp #1 SMP Linux 2.4.21-27.0.2.ELsmp #1 SMP Linux 2.4.20-4GB under Vmware
cpu AMD Opteron(tm) Processor 248 AMD Opteron(tm) Processor 244 Intel(R) Pentium(R) 4 CPU 2.66GHz
cpu MHZ 2190.584 1792.254 2659.303
cpu cache-size KB 1024 1024 512
processors 2 2 1
memory KB 8.030.056 7.956.884 514.952


To give a imagination of the general performance of theses hardware-configurations here are some elapsed-times needed for copying and sorting of files:

copying of files:

file-size A B C
100 MB 1s 2s 3s
300 MB 5s 5s 24
700 MB 11s 22s 48s
1400 MB 23s 1m40s 1m56s

sorting of files:

Times for sorting are relativ high. Reason for that: The files are xml-files. In every line there is always only one element. Because of that the lines are not very significant, and so the effort for sorting ist relativ high.

file-size rows A B C
100 MB 2.752.921 2m22s 2m23s 18s
300 MB 8.260.808 7m51s 8m01s 1m23s
700 MB 19.929.075 19m10s 21m14s 3m30s
1400 MB 39.858.146 40m12s 44m37s 1h40m47s


performance-metrics

The metric are all based on this basic-control-file:

basic-control-file:
<?xml version="1.0" encoding="UTF-8"?>
<delivery>
    <list_person>
        <person ident_att_id="true">
            <name cmp_text="true" />
            <firstname cmp_text="true" />
            <typ cmp_text="true" />
            <output cmp_text="true" />
            <id cmp_text="true" />
            <valid_since cmp_text="true" />
            <valid_until cmp_text="true" />
            <timestamp cmp_text="true" />
            <delivery_id cmp_text="true" />
            <status cmp_text="true" />
            <list_address>
                <adress ident_att_id="true">
                    <residence cmp_text="true" />
                    <street cmp_text="true" />
                    <hausno cmp_text="true" />
                </adress>
            </list_address>
        </person>
    </list_person>
</delivery>

example-content of test-files:
<?xml version="1.0" encoding="UTF-8"?>
<delivery>
    <list_person>
        <person id="1">
            <name>Baker</name>
            <firstname>John</firstname>
            <typ>P</typ>
            <output>Dr. John Baker</output>
            <id>1</id>
            <valid_since>2000-12-01</valid_since>
            <valid_until>2003-01-01</valid_until>
            <timestamp>2001-01-01-00.00.00.000000</timestamp>
            <delivery_id>345</delivery_id>
            <status>U</status>
            <list_address>
                <adress id="1">
                    <residence>New York</residence>
                    <street >Dr. Smith Street</street >
                    <hausno >23</hausno >
                </adress>
                <adress id="2">
                    <residence>Detroit York</residence>
                    <street >Michigan Street</street >
                    <hausno >43</hausno >
                </adress>
            </list_address>
        </person>
    </list_person>
</delivery>


comparing xml-files

file-size lines count of xml-
differences
A B C
100 2.752.921 2 28s 35s 1m49s
300 8.260.808 3 44s 1m14s 7m09s
700 19.929.075 5 3m01s 6m20s 19m47s
1400 39.858.146 2 5m56s 13m40s 48m22s


sorting xml-files

The test-files are sorted by "person@id". They should now be sorted by: "name", "firstname", "residence"


sort-control-file:
<?xml version="1.0" encoding="UTF-8"?>
<sort>
    <identity path='/delivery/list_person/person' >
        <identityfield path='/name' sort='+' />
        <identityfield path='/firstname' sort='+' />
        <identity path='/list_address/address' >
            <identityfield path='/residence' sort='+' />
        </identity>
    </identity>
</sort>

file-size A B C
100 1m15s 1m34s 3m32s
300 3m39s 4m31s 10m36s
700 8m59s 11m8s 26m42s
1400 18m11s 21m40s 55m11s


merging xml-files

The above mentioned test-files have only a few differences. These test-files will now be merged. The merge-rules are: The result should contain all rows of both files. If elements of a row differ, the the value of file2 sholud overwrite the value of file1.

merge-control-file:
<?xml version="1.0" encoding="UTF-8"?>
<merge>
    <identity path='/lieferung/liste_kv_person/kv_person' merge='333' >
        <value path='/id' merge='3332' />
        <value path='/vorname' merge='3332' />
        <value path='/kz_typ_person' merge='3332' />
        <value path='/gueltig_bis' merge='3332' />
        <value path='/ausgabe' merge='3332' />
        <value path='/gueltig_von' merge='3332' />
        <value path='/aend_zeitstempel' merge='3332' />
        <value path='/name' merge='3332' />
        <value path='/vs_lieferant_id' merge='3332' />
        <value path='/kz_aktiv' merge='3332' />
        <identity path='/liste_kv_pers_adresse/kv_pers_adresse' merge='333' >
            <value path='/wohnort_name' merge='3332' />
            <value path='/hsnr' merge='3332' />
            <value path='/strassenab_name' merge='3332' />
        </identity>
    </identity>
</merge>
</delivery>

file-size A B C
100 1m11s 1m47 3m56s
300 3m27s 5m13s 13m52s
700 8m25s 14m15s 33m33s
1400 17m30s 31m51s 59m40s


regrouping xml-files

For the performance-measurements the test-files will be regrouped in a xml-structure, which contains all residences, to every residence the streets, to every street the housenumbers and at least to every housenumber the persons living there.

toxml-control-file:
<?xml version="1.0" encoding="UTF-8"?>
<delivery>
    <list_residence>
        <residence ident_att_name="true" path_text="/delivery/liste_person/person/list_address/address/residence">
            <list_street>
                <street ident_att_name="true" path_att_name="/delivery/list_person/person/list_address/address/street">
                    <list_hsnr>
                        <hsnr ident_att_nr="true" path_att_nr="/delivery/list_person/person/list_address/address/hsnr">
                            <list_person>
                                <person ident_att_id="true" path_att_id="/delivery/list_person/person/@id">
                                    <name ident_text="true" />
                                    <firstname ident_text="true" />
                                    <typ ident_text="true" />
                                    <output ident_text="true" />
                                    <id ident_text="true" />
                                    <valid_since ident_text="true" />
                                    <valid_until ident_text="true" />
                                    <timestamp ident_text="true" />
                                    <delivery_id ident_text="true" />
                                    <status ident_text="true" />
                                </person>
                            </list_person>
                        </hsnr>
                    </list_hsnr>
                </street>
            </list_street>
        </residence>
    </list_residence>
</delivery>

Example with converted data:
<?xml version="1.0" encoding="UTF-8"?>
<delivery>
    <list_residence>
        <residence name="New York">
            <list_street>
                <street name="Dr. Smith Street">
                    <list_hsnr>
                        <hsnr nr="23">
                            <list_person>
                                <person id="1">
                                    <name>Baker</name>
                                    <firstname>John</firstname>
                                    <typ>P</typ>
                                    <output>Dr. John Baker</output>
                                    <id>1</id>
                                    <valid_since>2000-12-01</valid_since>
                                    <valid_until>2003-01-01</valid_until>
                                    <timestamp>2001-01-01-00.00.00.000000</timestamp>
                                    <delivery_id>345</delivery_id>
                                    <status>U</status>
                                </person>
                            </list_person>
                        </hsnr>
                    </list_hsnr>
                </street>
            </list_street>
        </residence>
        <residence name="Detroit">
            <list_street>
                <street name="Michigan Street">
                    <list_hsnr>
                        <hsnr nr="43">
                            <list_person>
                                <person id="1">
                                    <name>Baker</name>
                                    <firstname>John</firstname>
                                    <typ>P</typ>
                                    <output>Dr. John Baker</output>
                                    <id>1</id>
                                    <valid_since>2000-12-01</valid_since>
                                    <valid_until>2003-01-01</valid_until>
                                    <timestamp>2001-01-01-00.00.00.000000</timestamp>
                                    <delivery_id>345</delivery_id>
                                    <status>U</status>
                                </person>
                            </list_person>
                        </hsnr>
                    </list_hsnr>
                </street>
            </list_street>
        </residence>
    </list_residence>
</delivery>

file-size A B C
100 1m40s 1m50s 5m1s
300 5m00s 6m30s 15m33s
700 12m38s 15m50s 38m22s
1400 24m05s 28m56s 1h15m10s


fast disks improve performance

The performance of disks is decisive, because the files will be written several times in temp-files. You can define up to four directory for these temp-files via the shell-variables TMPDIR, TMPDIR1, TMPDIR2 and TMPDIR3. You get best performance-results, if these temp-directories are on file-systems, which lay on disks, which have their own disk-controllers.



low influence of parameter java-heap-size

Increasing the java-heap-size may effect some little performance-improvements. But: Even with a low java-heap-size the achieved performance-results are good.

.

In the following measures there have been compared two files, each with a file-size of 700 MB. The two files have two differences. With parameter "-Xmx" the maximal java-heap-size has been explicitly defined.

Java-Heap-Size A
50 MB 3m01s
100 MB 2m55s
250 MB 3m01s
500 MB 2m55s
1000 MB 3m00s
2000 MB 2m49s
Logo SOFIKA GmbH

<xml>cmp-toolbox

  • comparing xml-files
  • merging xml-files
  • regrouping xml-files
  • sorting xml-files

<xml>cmp and large xml-files

  • designed for large xml-files
  • low memory consumption
  • very good performance

<xml>cmp-interfaces

  • command line interface (unix/dos)
  • java-api

differences are shown in the context of the xml-files:

  • all data + differences
  • only differences
  • output: xml and pdf
Software Fischer SOFIKA GmbH
Freseniusstr. 65
D-81247 Munich
Germany
Tel: +49 (0)89 / 81 00 90 15
Fax: +49 (0)89 / 81 00 90 16
Email: info@sofika.de