Ambiguity in iXML

and how to control it

Norm Tovey-Walsh

Saxonica

Balisage 2023, 31 July - 4 August

What is iXML?

We tend to be good at picking out implicit structure because it’s embedded in the conventions of how we communicate; telling computers how to do is often difficult.

  • It’s a declarative way of describing implicit structure.

  • Consider “07/31/2023 Balisage presentation at 16:00UTC”

  • Consider “07/31/2023 Balisage presentation at 16:00UTC” →

    <diary>
       <appointment>
          <date>
             <month>07</month>
             <day>31</day>
             <year>2023</year>
          </date>
          <text>Balisage presentation at 
             <time tz='UTC'>16:00</time>
          </text>
       </appointment>
    </diary>

Structure and ambiguity

  • Implicit structures aren’t always unambiguous

    • One of iXML’s strengths is that it supports ambiguous parses

      Unlike, lex/yacc LR(n)/LL(n) parsers.

  • This is good because:

    • Some things really are ambiguous.

    • Ambiguous grammars are often easier to write, even if your goal is to refine the grammar until it isn’t ambiguous.

  • (Pro tip: refine your grammars until they aren’t ambiguous.)

A quick example

  • Suppose you wanted to parse a list of numbers: “1”, “1,2”, “1,2,3”, etc.

  • You could use this simple grammar:

    list-of-numbers = (number, sep)+ .
             number = ["0"-"9"]+ .
               -sep = -','* .
    
  • To turn “1,2,3” into:

    <list-of-numbers>
       <number>1</number>
       <number>2</number>
       <number>3</number>
    </list-of-numbers>
  • Alternatively, you might express that more concisely as:

    list-of-numbers = number++sep .
             number = ["0"-"9"]+ .
               -sep = -','* .
    
  • (Though that does accept a very slightly different set of strings.)

Terminology

  • Invisible XML is a way of defining a grammar.

    list-of-numbers = (number, sep)+ .
             number = ["0"-"9"]+ .
               -sep = -','* .
    
  • A grammar is a set of rules. Rules define nonterminal symbols

    list-of-numbers = …
  • in terms of other symbols

    list-of-numbers = (number, sep)+ .
  • Symbols that directly match characters in your input are called terminal symbols:

             number = ["0"-"9"]+ .

What is ambiguity?

Here are two possible definitions:

  • A context-free grammar G such that some word has two parse trees is said to be ambiguous—Hopcroft and Ullman, Introduction to Automata Theory, Languages, and Computation, p87.

  • A grammar is ambiguous if it can produce two different production trees with the same leaves in the same order—Grune et. al., Modern Compiler Design, p38.

What is ambiguity?

What does iXML say about ambiguity?

Nothing nearly as formal:

  • iXML doesn’t try to define ambigity because different parsing algorithms have different characteristics.

    It describes what to do with the results of ambiguity

    …if more than one parse results, one is chosen; …, but the resulting serialization should include the attribute ixml:state on the document element with a value that includes the word ambiguous.

An example of ambiguity

product         =  dessert-topping | floor-wax .
dessert-topping = "Cool Whip" | "Marshmallow Fluff" | "Shimmer" .
floor-wax       = "Osmo" | "Briwax" | "Shimmer" .
  • A still from the SNL parody commercial for Shimmer
  • For the input “Osmo”, there’s no ambiguity: it’s a floor wax.

    <product>
       <floor-wax>Osmo</floor-wax>
    </product>
  • For the input “Shimmer”, we have to select one of two equally valid trees:

    <product>
       <floor-wax>Shimmer</floor-wax>
    </product>

    or

    <product>
       <dessert-topping>Shimmer</dessert-topping>
    </product>

    So how does this work? With forests.

An ambiguous forest

Here we’ve picked dessert-topping

A graph of the Shimmer parse forest

How can things be ambiguous?

Ambiguity can arise in a grammar in one of two ways. Taxonomically, these can be described as either “vertical” or “horizontal”.

Brabrand, Claus, Robert Giegerich, and Anders Möller. “Analyzing Ambiguity of Context-Free Grammars.” Science of Computer Programming, volume 75, number 3. Elsevier. (2010). https://cs.au.dk/~amoeller/papers/ambiguity/. doi:10.1016/j.scico.2009.11.002.

Vertical ambiguity

  1. If there’s a choice between two nonterminals that both match part of the input, that’s “vertical ambiguity”.

    character = letter 
              | numeral .
       letter = ['A'-'Z'] .
      numeral = arabic | roman .
       arabic = ['0'-'9'] .
        roman = ["C" | "D" | "I" | "L" | "M" | "V" | "X"] .
    

    There are two parses for V (for example).

Vertical ambiguity (graph)

A graph of the vertically ambiguous parse forest

<character><letter><V</letter></character>

character = letter 
          | numeral .
   letter = ['A'-'Z'] .
  numeral = arabic | roman .
   arabic = ['0'-'9'] .
    roman = ["C" | "D" | "I" | "L" | "M" | "V" | "X"] .

Horizontal ambiguity

  1. If there’s a sequence of nonterminals where the end of one subsequence “overlaps” with the beginning of the next, that’s “horizontal ambiguity”.

 S = Xa, aY .
Xa = 'x', 'a'? .
aY =      'a'?, 'y' .

There are two parses for xay.

Why is this formatted this way

Horizontal ambiguity (graph)

A graph of the horizontally ambiguous parse forest
Another graph of the horizontally ambiguous parse forest
<S><Xa>x</Xa><aY>ay</aY></S><S><Xa>xa</Xa><aY>y</aY></S>

How does ambiguity arise?

Nonterminals that can match nothing are especially susceptible to ambiguity.

Suppose we want to match strings like “1,2,a”; “12a”; 12,34a,bcd”; etc.

           list = list-of-numbers, sep, list-of-letters .
list-of-letters = (letter, sep)+ .
list-of-numbers = (number, sep)+ .
         letter = ["a"-"z"] .
         number = ["0"-"9"] .
           -sep = -','* .

That looks perfectly reasonable.

Let’s parse 1,2,a

But it’s ambiguous! There are two ways to parse “1,2,a”.

<list xmlns:ixml='http://invisiblexml.org/NS'
      ixml:state='ambiguous'>
   <list-of-numbers>
      <number>1</number>
      <number>2</number>
   </list-of-numbers>
   <list-of-letters>
      <letter>a</letter>
   </list-of-letters>
</list>

Ambiguous?

  • How is this ambiguous?

               list = list-of-numbers, sep, list-of-letters .
    list-of-letters = (letter, sep)+ .
    list-of-numbers = (number, sep)+ .
             letter = ["a"-"z"] .
             number = ["0"-"9"] .
               -sep = -','* .
    
  • Ask the processor for a hint!

  • My implementation, NineML, supports an option which runs Möller’s ambiguity analyzer.

    The grammar is ambiguous:
    *** horizontal ambiguity: list[#1]: \
      list-of-numbers <--> sep list-of-letters
        ambiguous string: "0,a"
        matched as "0" <--> ",a" or "0," <--> "a"
    
    Found 2 possible parses.

(This is an especially frustrating case because both parses look the same!)

1,2,a (graph)

A graph of the ambiguous list parse 1,2,a

Avoid ambiguity: remove the optionality!

           list = list-of-numbers, sep, list-of-letters .
list-of-letters = (letter, sep)+ .
list-of-numbers = (number, sep)+ .
         letter = ["a"-"z"] .
         number = ["0"-"9"] .
           -sep = -',' .

But now we can only match “1,2,,a,”, we can’t match things without commas and we need two commas between the lists!

Avoid ambiguity: optional last

One strategy is to place optional elements at the end.

           list = list-of-numbers, list-of-letters .
list-of-letters = (letter, sep)+ .
list-of-numbers = (number, sep)+ .
         letter = ["a"-"z"] .
         number = ["0"-"9"] .
           -sep = -','* .

Avoid ambiguity: optional between

Explain what ++ means

Another strategy is to put optional elements between non-optional ones.

           list = list-of-numbers, sep, list-of-letters .
list-of-letters = letter++sep .
list-of-numbers = number++sep .
         letter = ["a"-"z"] .
         number = ["0"-"9"] .
           -sep = -','* .

Don’t do both!

Other strategies may be available.

Aside: infinite ambiguity

  • A graph with a loop is infinitely ambiguous.

      name = letter | symbol .
    letter = symbol | ["a"-"z"] .
    symbol = letter | ["α"-"ω"] .
    
  • If we parse “β”, we get this graph:

    A graph of an infinitely ambiguous parse
  • For which this is a valid parse:

    <name xmlns:ixml='http://invisiblexml.org/NS'
          ixml:state='ambiguous'>
       <symbol>β</symbol>
    </name>
    
  • So is this:

    <name xmlns:ixml='http://invisiblexml.org/NS'
          ixml:state='ambiguous'>
       <letter>
          <symbol>β</symbol>
       </letter>
    </name>
    
  • And this:

    <name xmlns:ixml='http://invisiblexml.org/NS'
          ixml:state='ambiguous'>
       <letter>
          <symbol>
             <letter>
                <symbol>β</symbol>
             </letter>
          </symbol>
       </letter>
    </name>
    
  • And this:

    <name xmlns:ixml='http://invisiblexml.org/NS'
          ixml:state='ambiguous'>
       <letter>
          <symbol>
             <letter>
                <symbol>
                   <letter>
                      <symbol>
                         <letter>
                            <symbol>
                               <letter>
                                  <symbol>
                                     <letter>
                                        <symbol>
                                           <letter>
                                              <symbol>
                                                 <letter>
                                                    <symbol>β</symbol>
                                                 </letter>
                                              </symbol>
                                           </letter>
                                        </symbol>
                                     </letter>
                                  </symbol>
                               </letter>
                            </symbol>
                         </letter>
                      </symbol>
                   </letter>
                </symbol>
             </letter>
          </symbol>
       </letter>
    </name>
    
  • And so on…

Inherent ambiguity

Suppose that we get documents from both sides of the Atlantic, so we have to manage dates in both “US style” and “European style”:

    date = usdate | eudate .
 -usdate = month, -'/', day, -'/', year .
 -eudate = day, -'/', month, -'/', year .
   month = d, d? .
     day = d, d? .
    year = d, d, d, d .
      -d = ['0'-'9'] .
  • What is 07/31/2023?

  • It’s either a US date:

    <date xmlns:ixml='http://invisiblexml.org/NS'
          ixml:state='ambiguous'>
       <month>07</month>
       <day>31</day>
       <year>2023</year>
    </date>

    or an EU date:

    <date xmlns:ixml='http://invisiblexml.org/NS'
          ixml:state='ambiguous'>
       <day>07</day>
       <month>31</month>
       <year>2023</year>
    </date>

A better dates grammar

    date = usdate | eudate .
 -usdate = month, -'/', day, -'/', year .
 -eudate = day, -'/', month, -'/', year .
   month = '0'?, d | '10' | '11' | '12' .
     day = ['0'-'2']?, d | '30' | '31' .
    year = d, d, d, d .
      -d = ['0'-'9'] .
  • What is 08/01/2023?

  • It’s either a US date:

    <date xmlns:ixml='http://invisiblexml.org/NS'
          ixml:state='ambiguous'>
       <month>08</month>
       <day>01</day>
       <year>2023</year>
    </date>

    or an EU date:

    <date xmlns:ixml='http://invisiblexml.org/NS'
          ixml:state='ambiguous'>
       <day>08</day>
       <month>01</month>
       <year>2023</year>
    </date>

We can’t fix this with the grammar

🙁

Extra-grammatical information

Pragmas in iXML were introduced last year at Balisage. A quick recap:

  • To a processor that doesn’t support pragmas, they look like comments.

  • You declare them at the top of your grammar with the “pragma” pragma:

    {[+pragma n "https://nineml.org/ns/pragma/"]}
  • You use them to annotate rules or symbols:

    {[n data]} A = 'a'.

NineML pragmas

The NineML family of tools recognize several pragmas:

  • ns to declare a default namespace for the output.

  • regex for matching with regular expressions.

  • discard empty to discard elements that turn out to be empty.

  • Pragmas for record-oriented processing.

  • And several others.

Priority pragma

  • The priority pragma allows you to assign a priority to a nonterminal.

    {[n priority 1]} -eudate = day, -'/', month, -'/', year .
  • The priority pragma has no effect on parsing!

  • The priority pragma has no effect on parsing!

  • However, when choosing trees, if there’s a choice between symbols (vertical ambiguities) and one has a higher priority than the others, it will be selected first.

  • (The default priority is 0; priorities are positive integers.)

A prioritized dates grammar

{[+pragma n "https://nineml.org/ns/pragma/"]}

                    date = usdate | eudate .
                 -usdate = month, -'/', day, -'/', year .
{[n priority 1]} -eudate = day, -'/', month, -'/', year .
                   month = '0'?, d | '10' | '11' | '12' .
                     day = ['0'-'2']?, d | '30' | '31' .
                    year = d, d, d, d .
                      -d = ['0'-'9'] .

If you know most of your data comes from the EU, you might wish to always treat ambiguous dates as if they came from the EU

  • What is 08/01/2023?

  • Now the processor can disambiguate the forest and prefer the 8 January 2023 tree.

    <date>
       <day>08</day>
       <month>01</month>
       <year>2023</year>
    </date>

Yeah, but…

  • What if you’re parsing something that has multiple dates.

  • Like a diary:

    31/07/2023 Balisage begins!
    11/08/2023 Shave yak.
    18/08/2023 Dentist at 13:00.
  • Isn’t it obvious then which date format to use?

Dynamic selection

  • NineML can use XPath expressions or functions to make the choice.

  • A choice function looks like this:

    <xsl:function name="cp:choose-alternative" as="map(*)">
      <xsl:param name="context" as="element()"/>
      <xsl:param name="options" as="map(*)"/>
      …
    </xsl:function>

Date selection

If the graph contains an unambiguous use of date, prefer that option when the date is ambiguous.

<xsl:function name="cp:choose-alternative" as="map(*)">
  <xsl:param name="context" as="element()"/>
  <xsl:param name="options" as="map(*)"/>

  <!-- Look for the first unambiguous use of the symbol -->
  <xsl:variable name="symbols" as="element()+">
    <xsl:for-each select="$context/root()/graph
        /symbol[@name=$context/@name and count(children) = 1]">
      <xsl:sort select="xs:integer(@start)"/>
      <xsl:sequence select="children"/>
    </xsl:for-each>
  </xsl:variable>

  <!-- Did we find one? -->
  <xsl:variable name="ambiguous" select="empty($symbols)"/>

  <xsl:variable name="choice" as="xs:string">
    <xsl:choose>
      <xsl:when test="$ambiguous">
        <!-- If not, just use take the first choice -->
        <xsl:sequence select="$options?available-choices[1]"/>
      </xsl:when>
      <xsl:otherwise>
        <!-- Otherwise, make the same choice here -->
        <xsl:variable name="name"
            select="$symbols[1]/symbol/@name/string()"/>
        <xsl:sequence
            select="$context/children[symbol/@name = $name]
                      /@id/string()"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:variable>

  <!-- Return the selection and if it was ambiguous -->
  <xsl:sequence select="map{'selection': $choice,
                            'ambiguous-choice': $ambiguous}"/>
</xsl:function>

Unambiguous EU dates

Parsing:

31/07/2023 Balisage begins!
11/08/2023 Shave yak.
18/08/2023 Dentist at 13:00.

Produces:

<diary>
   <appointment>
      <date>
         <day>31</day>
         <month>07</month>
         <year>2023</year>
      </date>
      <text>Balisage begins!</text>
   </appointment>
   <appointment>
      <date>
         <day>11</day>
         <month>08</month>
         <year>2023</year>
      </date>
      <text>Shave yak.</text>
   </appointment>
   <appointment>
      <date>
         <day>18</day>
         <month>08</month>
         <year>2023</year>
      </date>
      <text>Dentist at 13:00.</text>
   </appointment>
</diary>

Unambiguous US dates

Parsing:

07/31/2023 Balisage begins!
08/11/2023 Shave yak.
08/18/2023 Dentist at 13:00.

Produces:

<diary>
   <appointment>
      <date>
         <month>07</month>
         <day>31</day>
         <year>2023</year>
      </date>
      <text>Balisage begins!</text>
   </appointment>
   <appointment>
      <date>
         <month>08</month>
         <day>11</day>
         <year>2023</year>
      </date>
      <text>Shave yak.</text>
   </appointment>
   <appointment>
      <date>
         <month>08</month>
         <day>18</day>
         <year>2023</year>
      </date>
      <text>Dentist at 13:00.</text>
   </appointment>
</diary>

Thank you!

Questions?