• << Return Home
  • XSLT:Blog[@author = 'M. David Peterson']/Main: Non-Muenchian Methods Archives
              • December 21, 2004

                The results are in...

                In a recent post to XSL-List Dimitre Novatchev helps us all understand the benefits of using time tested XML data grouping methodologies in XSLT by pushing the envelope one level deeper into the recursive nature of such activity. In this post he showcases the source for two different methodologies used in grouping similar XML elements, attributes, and/or the data they contain and brings out the following results:

                The results in milliseconds, using Saxon 6.5.3 on a 3GHz 2GB RAM PC are as follows: Muenchian code: 1265 Sergio recursive: 38188

                He then follows with:

                I will be glad to provide the source xml document used in the test to any interested person (and why not put them somewhere on the web?).

                Ummmm… might I suggest <XSLT:Blog/> as an appropriate publishing platform and download location for these results? :D

                Dimitre, Thanks for showcasing the need to push the envelope further when it comes to making an attempt to understand the complex-nature of XML grouping methodologies. For those interested in looking at the details of his post but don’t subscribe to XSL-List I’ve added the content from the original post to the extended portion of this posting.

                [POST UPDATED: I’ve added Dimitre’s second and most recent post to XSL-List in regards to his Muenchian Method findings. In this post he has suggested plans to take me up on my offer to host the download of these stylesheets. I’m watching for his email now and will update this entry with the download links as soon as they are available.]

                [UPDATED: Download file Content of this zip file is explained in the extended portion of this entry]

                Hi Sergio,

                I performed some tests and here are the results on a source xml document containing 20516 elements that had to be grouped in two levels. The document has the same structure as the one in your examples, but more “instances”.

                I had to upgrade your code to make it general, so that it can now handle any number of levels of grouping. The names of the attributes on which to group consecutively are given as a parameter.

                Here’s your modified code (50 lines):

                <xsl:stylesheet version="1.0"
                   xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
                
                 <xsl:output omit-xml-declaration="yes" indent="yes"/>
                
                 <xsl:template match="/booklist">
                   <authors>
                     <xsl:call-template name="RecursiveGrouping">
                       <xsl:with-param name="list" select="book"/>
                       <xsl:with-param name="attrList" select="/*/book[1]/@*"/>
                     </xsl:call-template>
                   </authors>
                 </xsl:template>
                
                 <xsl:template name="RecursiveGrouping">
                   <xsl:param name="list"/>
                   <xsl:param name="attrList"/>
                
                   <xsl:if test="$attrList">
                      <!-- Selecting the first attribute name as group identifier and the group
                     itself-->
                     <xsl:variable name="attrName" select="name($attrList[1])"/>
                
                     <xsl:variable name="group-identifier" select="$list[1]/@*[name()
                = $attrName]"/>
                     <xsl:variable name="group"
                select="$list[@*[name()=$attrName]=$group-identifier]"/>
                
                      <!-- Do some work for the group -->
                      <xsl:element name="{$attrName}">
                        <xsl:attribute name="value">
                          <xsl:value-of select="$group-identifier"/>
                        </xsl:attribute>
                
                         <xsl:call-template name="RecursiveGrouping">
                           <xsl:with-param name="list" select="$group"/>
                           <xsl:with-param name="attrList" select="$attrList[position() > 1]"/>
                         </xsl:call-template>
                      </xsl:element>
                
                      <!-- If there are other groups left, calls itself -->
                     <xsl:if test="count($list)>count($group)">
                       <xsl:call-template name="RecursiveGrouping">
                         <xsl:with-param name="list"
                             select="$list[not(@*[name()=$attrName]=$group-identifier)]"/>
                         <xsl:with-param name="attrList" select="$attrList"/>
                       </xsl:call-template>
                     </xsl:if>
                   </xsl:if>
                 </xsl:template>
                </xsl:stylesheet>
                

                Here’s the code using the Muenchian method:

                <xsl:stylesheet version="2.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
                
                <xsl:output omit-xml-declaration="yes" indent="yes"/>
                
                 <xsl:key name="kAuth" match="book" use="@author"/>
                 <xsl:key name="kAuthTitle" match="book"
                  use="concat(@author, '+', @title)"/>
                
                 <xsl:template match="/">
                   <xsl:for-each select=
                      "/*/book[generate-id()
                              =
                               generate-id(key('kAuth', @author)[1])
                              ]">
                     <xsl:element name="author">
                       <xsl:attribute name="value">
                         <xsl:value-of select="@author"/>
                       </xsl:attribute>
                       <xsl:for-each select=
                         "key('kAuth', @author)
                             [generate-id()
                             =
                              generate-id(key('kAuthTitle',
                                              concat(@author, '+', @title)
                                              )[1]
                                          )
                             ]">
                         <xsl:element name="title">
                           <xsl:attribute name="value">
                             <xsl:value-of select="@title"/>
                           </xsl:attribute>
                         </xsl:element>
                       </xsl:for-each>
                       </xsl:element>
                   </xsl:for-each>
                 </xsl:template>
                </xsl:stylesheet>
                

                The results in milliseconds, using Saxon 6.5.3 on a 3GHz 2GB RAM PC are as follows:

                Muenchian code: 1265

                Sergio recursive: 38188

                I will be glad to provide the source xml document used in the test to any interested person (and why not put them somewhere on the web?).

                Cheers,

                Dimitre


                Follow-up post by Dimitre:

                On Tue, 21 Dec 2004 20:09:22 +1100, Dimtre Novatchev dnovatchev@gmail.com wrote:

                The results in milliseconds, using Saxon 6.5.3 on a 3GHz 2GB RAM PC are as follows:

                Muenchian code: 1265

                Sergio recursive: 38188

                More data — this time testing the originally-provided one-level grouping:

                Muenchian code: 9796

                Sergiu recursive: 16016

                However, the original code produces also the count of each group. This is not a requirement for grouping and is not part of the Muenchian method at all.

                With counting removed, the results are:

                Muenchian code: 953

                Sergiu recursive: 15922

                >

                I will be glad to provide the source xml document used in the test to any interested person (and why not put them somewhere on the web?).

                Thanks to M.David Peterson, who offered to host the test data on www.XSLTBLOG.com

                I’m sending the data to him in a moment.

                The data I’m using is artificial but not at all unusual. It contains 183 different authors. In real cases data extracted from a database will contain thousands of different objects (persons, authors, employees etc.)

                Cheers,

                Dimitre.

                Test files can be found Here.

                An explantion can be found in Dimitre’s email pdated with explanation of the test files used and location to download these files.

                The test xml file and the different stylesheet modules used in the benchmark:

                my25600.xml — the source xml file

                test-groupSergiuOriginal.xsl — Sirgiu Ignat’s original code as downloaded from your site.

                test-groupSergiuOriginal.xsl — as above, but with counting of the groups removed

                test-group.xsl — Sergiu’s code updated by me so that it can handle any level of nested grouping

                test-group-m.xsl — the Muenchian code as downloaded from your site

                test-group-mNoCount.xsl — as above, but with counting of the groups removed

                Posted by m.david at 02:13 AM | Comments (0) | TrackBack

                December 17, 2004

                New Alternative to Muenchian Method of Grouping?

                In a post to XSL-List yesterday Sergiu Ignat states:

                I would like to present you a simple, XSLT 1.0, fast grouping method with a O(N*log(N)) complexity, the same as sorting. The only grouping method I knew so far is Muenchian that has O(N^2) complexity.

                As Jay Bryant states in a follow up at first look this method seems interesting but as Michael Kay points out things may not be exactly as Sergiu suggests. The best way to find out of course is to write the code using the Muenchian method that uses the same criteria to produce the same output and then transform the test XML with both stylesheets using Saxon and the -t option to determine the time each method takes to finish. So I’ve done just that. The code and the results of the transformations are as follows:

                Suggested alternative to Muenchian Method:

                <xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
                <xsl:template match="/booklist">
                 <authors>
                  <xsl:call-template name="RecursiveGrouping">
                   <xsl:with-param name="list" select="book"/>
                  </xsl:call-template>
                 </authors>
                </xsl:template>
                
                <xsl:template name="RecursiveGrouping">
                 <xsl:param name="list"/>
                
                 <!-- Selecting the first author name as group identifier and the group
                itself-->
                 <xsl:variable name="group-identifier" select="$list[1]/@author"/>
                 <xsl:variable name="group" select="$list[@author=$group-identifier]"/>
                
                 <!-- Do some work for the group -->
                 <author name="{$group-identifier}" number-of-books="{count($group)}"/>
                
                 <!-- If there are other groups left, calls itself -->
                 <xsl:if test="count($list)>count($group)">
                 <xsl:call-template name="RecursiveGrouping">
                   <xsl:with-param name="list"
                select="$list[not(@author=$group-identifier)]"/>
                 </xsl:call-template>
                 </xsl:if>
                </xsl:template>
                </xsl:stylesheet>
                

                Solution using Muenchian Method:

                While I technically wrote this code this is a technique I learned from Jeni Tennisons online tutorial a while back so credit must be given to her and Steve Muench, the originally developer in whom first suggested this method and as such the method was named after:

                <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
                  <xsl:key name="author" match="@author" use="."/>
                  <xsl:template match="/booklist">
                    <authors>
                      <xsl:apply-templates 
                      select="book[generate-id(@author) =   generate-id(key('author', @author))]"/>
                    </authors>
                  </xsl:template>
                  <xsl:template match="book">
                    <xsl:variable name="count" select="count(following-sibling::book[@author = current()/@author])"/>
                    <author name="{@author}" number-of-books="{$count + 1}"/>
                  </xsl:template>
                </xsl:stylesheet>
                

                —

                The XML file presented for use is as follows:

                <booklist>
                  <book author="Frank Herbert" title="Dune"/>
                  <book author="Roberto Quaglia" title="Bread, Butter and Paradoxine"/>
                  <book author="Kate Wilhelm" title="Where Late the Sweet Bird Sang"/>
                  <book author="Anthony Burgess" title="A Clockwork Orange"/>
                  <book author="Frank Herbert" title="Dragon in the Sea"/>
                  <book author="Anthony Burgess" title="Earthly Powers"/>
                  <book author="Isaak Asimov" title="The Foundation Trilogy"/>
                  <book author="Frank Herbert" title="Children of Dune"/>
                  <book author="Isaak Asimov" title="The Caves of Steel"/>
                </booklist>
                

                —

                Rather a small sampling so this isn’t going to give us any sort of realistic feedback as to the actual processing speed of either method. I don’t plan to draw any conclusions from this, just simply present the code and data and let you decide from there if any of it is useful.

                The description of the testing process used is as follows:

                The first run was used to determine the initial compilation of the stylesheet + transformation.

                I took note of the compilation time on the second, third, fourth, and fifth runs but generally speaking this is going to be smaller as the compiled stylesheet will be stored in memory from this first go round. However this is not always the case. See the manual for the processor you are using to determine the specifics in how a compiled stylesheet is handled while it is still the primary stylesheet of record.

                With this in mind in all 5 cases I took note of the compile time and process time.

                For this test I used Saxon 8.1.1 from Saxonica. The test was run on a Pentium4 2.8 Ghz processor with 1 GByte of 400Mhz Front-side Bus Ram.

                In all cases (totalling 10 between the two) the following XML was returned by Saxon:

                The output was as follows:

                <?xml version="1.0" encoding="UTF-8"?>
                <authors>
                  <author number-of-books="3" name="Frank Herbert"/>
                  <author number-of-books="1" name="Roberto Quaglia"/>
                  <author number-of-books="1" name="Kate Wilhelm"/>
                  <author number-of-books="2" name="Anthony Burgess"/>
                  <author number-of-books="2" name="Isaak Asimov"/>
                </authors>
                

                Sergiu’s Method

                Run One: Compile Time: 938 Milliseconds Execution Time: 78 Milliseconds

                Run Two: Compile Time: 672 Milliseconds Execution Time: 109 Milliseconds

                Run Three: Compile Time: 438 Milliseconds Execution Time: 62 Milliseconds

                Run Four: Compile Time: 453 Milliseconds Execution Time: 63 Milliseconds

                Run Five: Compile Time: 437 Milliseconds Execution Time: 63 Milliseconds

                It seems the execution time of run two was out of sorts in comparison. My guess is that another process on the computer caused the processing to slow and has nothing to do with Sergiu’s stylesheet.

                Meunchian Method

                Run One: Compile Time: 438 Milliseconds Execution Time: 78 Milliseconds

                Run Two: Compile Time: 453 Milliseconds Execution Time: 62 Milliseconds

                Run Three: Compile Time: 437 Milliseconds Execution Time: 63 Milliseconds

                Run Four: Compile Time: 453 Milliseconds Execution Time: 63 Milliseconds

                Run Five: Compile Time: 488 Milliseconds Execution Time: 62 Milliseconds

                It seems that except for Run Two of Sergiu’s code the execution time for each method was near exact.

                Oddly, the compile time for the Meunchian method on average increased through the completion of the 5 runs. In my opinion however the increase was not of significant proportion to let this anamoly raise any sort of concern.

                The explanation as to the methode behind Sergiu’s code is as follows:

                The main idea is to have a named template that takes as a parameter the node list that should be grouped, processes the group defined by the first element and recursively calls itself for the rest of the list excluding that group. The example below will group books in a book list and will compute how many books each author has (input XML is shown after it).

                Sergiu Ignat Dynamic Ventures www.dyve.com

                Again, I have no intention of drawing any conclusions. I will bring out the fact that the Meuncian Method allows <xsl:sort ../> to be run against the transformation where as Sergiu’s method can not use this built in method for sorting; at least not in its current state as the sort element can only be a child element of xsl:for-each and xsl:apply-templates.

                Posted by m.david at 10:15 AM | Comments (7) | TrackBack

          • © 2005 :: <XSLT:Blog/> (xsltblog.com) is a product of M. David Peterson and FunctionalX Consulting. See Licensing Info Below.
          • Except where otherwise noted, this sites content and source code is licensed under the Attribution License from Creative Commons.