• « Google Search Enhanced for Firefox
    • |
    • Main
    • |
    • Saxon 8.4 Now Available - Saxon.NET RC1 based on these bits will follow in the next few days »
            • April 04, 2005

              My reaction a recent response from Dimitre Novatchev to my post regarding experience with text processing in XSLT 2.0

            • stunned-by-dimitre-fxsl.jpg

              [NOTE: See extended portion of entry for my post and his response]
              [UPDATE: Added additional follow-up by Dimitre to the end of this post]

              My post:

              == Text processing on XSLT 2.0 ==

              Working on projects such as XBiblio/Citeproc lead by Bruce D'Arcus
              I have realized that even as far as the XSLT 2.0 working draft goes in
              regards to bringing Perl'esque type text processing to the XML
              developer it is still up to the developer to fine-tune these
              capabilities to cover their specific needs. For example, a spell
              checker.

              Can anyone who may have extended experience in regards to the
              development of such capabilities using XSLT share with the rest of us
              your experience?

              == Re: [xsl] Text processing on XSLT 2.0 ==

              Hi Mark,

              These days I had fun with an f:binSearch() function and then,
              logically, with f:spell().

              I have a dictionary of about 47000 English wordforms, on which I
              search with f:binSearch()

              I had to produce a faster fn than the current quadratical
              str-split-to-words template -- this is the f:getWords() function.

              All these functions can be downloaded from the FXSL CVS (just let me
              know if you'd want me to send you the zip archive).

              The combination of these functions works quite well.

              This transformation (test-FuncSpell.xsl):


              <xsl:stylesheet version="2.0"
              xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
              xmlns:xs="http://www.w3.org/2001/XMLSchema"
              xmlns:f="http://fxsl.sf.net/"
              exclude-result-prefixes="f xs"
              >
              <xsl:import href="../f/func-getWords.xsl"/>
              <xsl:import href="../f/func-spell.xsl"/>

              <xsl:output omit-xml-declaration="yes"/>

              <xsl:variable name="vDelim" as="xs:string">
              ,—:.-&#9;&#10;&#13;'!?;</xsl:variable>

              <!-- To be applied on ../data/othello.xml -->
              <xsl:template match="/">
              <xsl:variable name="vwordNodes" as="element()*">
              <xsl:for-each select="//text()/lower-case(.)">
              <xsl:sequence select="f:getWords(., $vDelim, 1)"/>
              </xsl:for-each>
              </xsl:variable>

              <xsl:variable name="vUnique" as="xs:string+">
              <xsl:perform-sort select="distinct-values($vwordNodes)">
              <xsl:sort select="."/>
              </xsl:perform-sort>
              </xsl:variable>

              <xsl:variable name="vnotFound" as="xs:string*"
              select="$vUnique[not(f:spell(.))]"/>

              <xsl:value-of separator="&#xA;"
              select="$vnotFound"/>

              A total of <xsl:value-of select="count($vwordNodes)"/> words
              were spelt, (<xsl:value-of select="count($vUnique)"/>) distinct.

              <xsl:value-of select="count($vnotFound)"/> not found.
              </xsl:template>
              </xsl:stylesheet>

              when applied on othello.xml (around 29000 words)

              produces this result:

              Saxon 8.3 from Saxonica
              Java version 1.5.0_01
              Stylesheet compilation time: 1140 milliseconds
              Processing file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml
              Building tree for
              file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml using
              class net.sf.saxon.tinytree.TinyBuilder
              Tree built in 94 milliseconds
              Tree size: 18539 nodes, 154557 characters, 0 attributes
              Building tree for file:/C:/CVS-DDN/fxsl-xslt2/f/func-getWords.xsl
              using class net.sf.saxon.tinytree.TinyBuilder
              Tree built in 0 milliseconds
              Tree size: 43 nodes, 143 characters, 22 attributes
              Building tree for file:/C:/CVS-DDN/fxsl-xslt2/data/dictEnglish.xml
              using class net.sf.saxon.tinytree.TinyBuilder
              Tree built in 188 milliseconds
              Tree size: 139140 nodes, 528397 characters, 0 attributes
              Execution time: 7015 milliseconds


              A total of 28622 words
              were spelt, (3669) distinct.

              567 not found.


              So, checking 3669 distinct words in 7015 milliseconds makes

              523.02 words/sec.

              The actual speed is faster, as the total time includes splitting up
              the words and finding the distinct words.

              Among the unknown words are such nice words as:

              affordeth
              affrighted
              ariseth
              arithmetician
              arrivance
              bethink
              betimes
              bewhored

              :o)

              Cheers,

              Dimitre


              == Follow-up from Dimitre ==

              === Update One ===

              I didn't mention that the text I was spelling was the play:

              "Othello"

              by William Shakespeare


              === Update Two ===

              On Apr 5, 2005 7:10 AM, M. David Peterson wrote:
              > Well, I think that about covers it... FXSL it is then :)
              >
              > Please see http://www.xsltblog.com/archives/2005/04/my_reaction_a_r.html
              > for a slightly extended reaction...
              >
              > Thank you Dimitre!!! As always the capabilites of FXSL have proven to
              > be flat out amazing.

              Actually, the *great praise* here goes to Saxon 8.3 and Mike Kay.

              On this transform Saxon 8.3 is about 20 times faster than Saxon 8.2,
              with which I get:

              Saxon 8.2 from Saxonica
              Java version 1.5.0_01
              Stylesheet compilation time: 969 milliseconds
              Processing file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml
              Building tree for
              file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml using
              class net.sf.saxon.tinytree.TinyBuilder
              Tree built in 156 milliseconds
              Tree size: 18539 nodes, 154557 characters, 0 attributes
              Building tree for file:/C:/CVS-DDN/fxsl-xslt2/f/func-getWords.xsl
              using class net.sf.saxon.tinytree.TinyBuilder
              Tree built in 0 milliseconds
              Tree size: 43 nodes, 143 characters, 22 attributes
              Building tree for file:/C:/CVS-DDN/fxsl-xslt2/data/dictEnglish.xml
              using class net.sf.saxon.tinytree.TinyBuilder
              Tree built in 187 milliseconds
              Tree size: 139140 nodes, 528397 characters, 0 attributes
              Execution time: 135921 milliseconds

              So,

              Saxon 8.3 7 sec.

              Saxon 8.2 136 sec.

              I strongly hope that this great achievement is not revereted in future
              versions of Saxon.

              There are some other extremely nice features of Saxon, which I've been using.

              For example, can someone guess what would be the time if I didn't look
              up in the dictionary just the distinct words, but all words as they
              come in the text?

              Cheers,
              Dimitre Novatchev.

            • Posted by m.david : April 4, 2005 03:58 PM GMT

            Trackback Pings

            TrackBack URL for this entry:
            http://www.xsltblog.com/xslt-blog-mt/mt-tb.cgi/745

            Listed below are links to weblogs that reference My reaction a recent response from Dimitre Novatchev to my post regarding experience with text processing in XSLT 2.0:

            » Home Equity Loan from Home Equity Loan
            Home Equity Loan [Read More]

            Tracked on March 4, 2006 03:31 AM

            » Beauty from Beauty
            Beauty [Read More]

            Tracked on March 6, 2006 12:37 AM

            » voyer pics from
            hidden dressing room voyeur cum gushing vince voyeur [Read More]

            Tracked on March 28, 2006 01:04 AM

            Comments

            Post a comment




            Remember Me?

            (you may use HTML tags for style)

          • © 2005 :: <XSLT:Blog/> (xsltblog.com) is a product of M. David Peterson and FunctionalX Consulting. See Licensing Info Below.
          • Except where otherwise noted, this sites content and source code is licensed under the Attribution License from Creative Commons.