Ambiguity in iXML
and how to control it
- What is iXML?
- Structure and ambiguity
- A quick example
- Terminology
- What is ambiguity?
- What does iXML say about ambiguity?
- An example of ambiguity
- An ambiguous forest
- How can things be ambiguous?
- Vertical ambiguity
- Vertical ambiguity (graph)
- Horizontal ambiguity
- Horizontal ambiguity (graph)
- How does ambiguity arise?
- Let’s parse 1,2,a
- Ambiguous?
- 1,2,a (graph)
- Avoid ambiguity: remove the optionality!
- Avoid ambiguity: optional last
- Avoid ambiguity: optional between
- Aside: infinite ambiguity
- Inherent ambiguity
- A better dates grammar
- We can’t fix this with the grammar
- Extra-grammatical information
- NineML pragmas
- Priority pragma
- A prioritized dates grammar
- Yeah, but…
- Dynamic selection
- Date selection
- Unambiguous EU dates
- Unambiguous US dates
- Thank you!
What is iXML?
We tend to be good at picking out implicit structure because it’s embedded in the conventions of how we communicate; telling computers how to do is often difficult.
It’s a declarative way of describing implicit structure.
Consider “07/31/2023 Balisage presentation at 16:00UTC”
Consider “07/31/2023 Balisage presentation at 16:00UTC” →
<diary> <appointment> <date> <month>07</month> <day>31</day> <year>2023</year> </date> <text>Balisage presentation at <time tz='UTC'>16:00</time> </text> </appointment> </diary>
Structure and ambiguity
Implicit structures aren’t always unambiguous
One of iXML’s strengths is that it supports ambiguous parses
Unlike, lex/yacc LR(n)/LL(n) parsers.
This is good because:
Some things really are ambiguous.
Ambiguous grammars are often easier to write, even if your goal is to refine the grammar until it isn’t ambiguous.
(Pro tip: refine your grammars until they aren’t ambiguous.)
A quick example
Suppose you wanted to parse a list of numbers: “1”, “1,2”, “1,2,3”, etc.
You could use this simple grammar:
list-of-numbers = (number, sep)+ . number = ["0"-"9"]+ . -sep = -','* .
To turn “1,2,3” into:
<list-of-numbers> <number>1</number> <number>2</number> <number>3</number> </list-of-numbers>
Alternatively, you might express that more concisely as:
list-of-numbers = number++sep . number = ["0"-"9"]+ . -sep = -','* .
(Though that does accept a very slightly different set of strings.)
Terminology
Invisible XML is a way of defining a grammar.
list-of-numbers = (number, sep)+ . number = ["0"-"9"]+ . -sep = -','* .
A grammar is a set of rules. Rules define nonterminal symbols
list-of-numbers = …
in terms of other symbols
list-of-numbers = (number, sep)+ .
Symbols that directly match characters in your input are called terminal symbols:
number = ["0"-"9"]+ .
What is ambiguity?
Here are two possible definitions:
A context-free grammar G such that some word has two parse trees is said to be ambiguous—Hopcroft and Ullman, Introduction to Automata Theory, Languages, and Computation, p87.
A grammar is ambiguous if it can produce two different production trees with the same leaves in the same order—Grune et. al., Modern Compiler Design, p38.
What is ambiguity?
What does iXML say about ambiguity?
Nothing nearly as formal:
iXML doesn’t try to define ambigity because different parsing algorithms have different characteristics.
It describes what to do with the results of ambiguity
…if more than one parse results, one is chosen; …, but the resulting serialization should include the attribute
ixml:state
on the document element with a value that includes the wordambiguous
.
An example of ambiguity
product = dessert-topping | floor-wax .
dessert-topping = "Cool Whip" | "Marshmallow Fluff" | "Shimmer" .
floor-wax = "Osmo" | "Briwax" | "Shimmer" .
For the input “Osmo”, there’s no ambiguity: it’s a floor wax.
<product> <floor-wax>Osmo</floor-wax> </product>
For the input “Shimmer”, we have to select one of two equally valid trees:
<product> <floor-wax>Shimmer</floor-wax> </product>
or
<product> <dessert-topping>Shimmer</dessert-topping> </product>
So how does this work? With forests.
An ambiguous forest
Here we’ve picked dessert-topping
How can things be ambiguous?
Ambiguity can arise in a grammar in one of two ways. Taxonomically, these can be described as either “vertical” or “horizontal”.
Brabrand, Claus, Robert Giegerich, and Anders Möller. “Analyzing Ambiguity of Context-Free Grammars.” Science of Computer Programming, volume 75, number 3. Elsevier. (2010). https://cs.au.dk/~amoeller/papers/ambiguity/. doi:10.1016/j.scico.2009.11.002.
Vertical ambiguity
If there’s a choice between two nonterminals that both match part of the input, that’s “vertical ambiguity”.
character = letter | numeral . letter = ['A'-'Z'] . numeral = arabic | roman . arabic = ['0'-'9'] . roman = ["C" | "D" | "I" | "L" | "M" | "V" | "X"] .
There are two parses for
V
(for example).
Vertical ambiguity (graph)
<character><letter><V</letter></character>
character = letter
| numeral .
letter = ['A'-'Z'] .
numeral = arabic | roman .
arabic = ['0'-'9'] .
roman = ["C" | "D" | "I" | "L" | "M" | "V" | "X"] .
Horizontal ambiguity
If there’s a sequence of nonterminals where the end of one subsequence “overlaps” with the beginning of the next, that’s “horizontal ambiguity”.
S = Xa, aY .
Xa = 'x', 'a'? .
aY = 'a'?, 'y' .
There are two parses for xay
.
Why is this formatted this way
Horizontal ambiguity (graph)
|
|
<S><Xa>x</Xa><aY>ay</aY></S> | <S><Xa>xa</Xa><aY>y</aY></S> |
How does ambiguity arise?
Nonterminals that can match nothing are especially susceptible to ambiguity.
Suppose we want to match strings like “1,2,a”; “12a”; 12,34a,bcd”; etc.
list = list-of-numbers, sep, list-of-letters .
list-of-letters = (letter, sep)+ .
list-of-numbers = (number, sep)+ .
letter = ["a"-"z"] .
number = ["0"-"9"] .
-sep = -','* .
That looks perfectly reasonable.
Let’s parse 1,2,a
But it’s ambiguous! There are two ways to parse “1,2,a”.
<list xmlns:ixml='http://invisiblexml.org/NS'
ixml:state='ambiguous'>
<list-of-numbers>
<number>1</number>
<number>2</number>
</list-of-numbers>
<list-of-letters>
<letter>a</letter>
</list-of-letters>
</list>
Ambiguous?
How is this ambiguous?
list = list-of-numbers, sep, list-of-letters . list-of-letters = (letter, sep)+ . list-of-numbers = (number, sep)+ . letter = ["a"-"z"] . number = ["0"-"9"] . -sep = -','* .
Ask the processor for a hint!
My implementation, NineML, supports an option which runs Möller’s ambiguity analyzer.
The grammar is ambiguous: *** horizontal ambiguity: list[#1]: \ list-of-numbers <--> sep list-of-letters ambiguous string: "0,a" matched as "0" <--> ",a" or "0," <--> "a" Found 2 possible parses.
(This is an especially frustrating case because both parses look the same!)
1,2,a (graph)
Avoid ambiguity: remove the optionality!
list = list-of-numbers, sep, list-of-letters .
list-of-letters = (letter, sep)+ .
list-of-numbers = (number, sep)+ .
letter = ["a"-"z"] .
number = ["0"-"9"] .
-sep = -',' .
But now we can only match “1,2,,a,”, we can’t match things without commas and we need two commas between the lists!
Avoid ambiguity: optional last
One strategy is to place optional elements at the end.
list = list-of-numbers, list-of-letters .
list-of-letters = (letter, sep)+ .
list-of-numbers = (number, sep)+ .
letter = ["a"-"z"] .
number = ["0"-"9"] .
-sep = -','* .
Avoid ambiguity: optional between
Explain what ++ means
Another strategy is to put optional elements between non-optional ones.
list = list-of-numbers, sep, list-of-letters .
list-of-letters = letter++sep .
list-of-numbers = number++sep .
letter = ["a"-"z"] .
number = ["0"-"9"] .
-sep = -','* .
Don’t do both!
Other strategies may be available.
Aside: infinite ambiguity
A graph with a loop is infinitely ambiguous.
name = letter | symbol . letter = symbol | ["a"-"z"] . symbol = letter | ["α"-"ω"] .
If we parse “β”, we get this graph:
For which this is a valid parse:
<name xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'> <symbol>β</symbol> </name>
So is this:
<name xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'> <letter> <symbol>β</symbol> </letter> </name>
And this:
<name xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'> <letter> <symbol> <letter> <symbol>β</symbol> </letter> </symbol> </letter> </name>
And this:
<name xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'> <letter> <symbol> <letter> <symbol> <letter> <symbol> <letter> <symbol> <letter> <symbol> <letter> <symbol> <letter> <symbol> <letter> <symbol>β</symbol> </letter> </symbol> </letter> </symbol> </letter> </symbol> </letter> </symbol> </letter> </symbol> </letter> </symbol> </letter> </symbol> </letter> </name>
And so on…
Inherent ambiguity
Suppose that we get documents from both sides of the Atlantic, so we have to manage dates in both “US style” and “European style”:
date = usdate | eudate .
-usdate = month, -'/', day, -'/', year .
-eudate = day, -'/', month, -'/', year .
month = d, d? .
day = d, d? .
year = d, d, d, d .
-d = ['0'-'9'] .
What is 07/31/2023?
It’s either a US date:
<date xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'> <month>07</month> <day>31</day> <year>2023</year> </date>
or an EU date:
<date xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'> <day>07</day> <month>31</month> <year>2023</year> </date>
A better dates grammar
date = usdate | eudate .
-usdate = month, -'/', day, -'/', year .
-eudate = day, -'/', month, -'/', year .
month = '0'?, d | '10' | '11' | '12' .
day = ['0'-'2']?, d | '30' | '31' .
year = d, d, d, d .
-d = ['0'-'9'] .
What is 08/01/2023?
It’s either a US date:
<date xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'> <month>08</month> <day>01</day> <year>2023</year> </date>
or an EU date:
<date xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'> <day>08</day> <month>01</month> <year>2023</year> </date>
We can’t fix this with the grammar
🙁
Extra-grammatical information
Pragmas in iXML were introduced last year at Balisage. A quick recap:
To a processor that doesn’t support pragmas, they look like comments.
You declare them at the top of your grammar with the “pragma” pragma:
{[+pragma n "https://nineml.org/ns/pragma/"]}
You use them to annotate rules or symbols:
{[n data]} A = 'a'.
NineML pragmas
The NineML family of tools recognize several pragmas:
ns
to declare a default namespace for the output.regex
for matching with regular expressions.discard empty
to discard elements that turn out to be empty.Pragmas for record-oriented processing.
And several others.
Priority pragma
The
priority
pragma allows you to assign a priority to a nonterminal.{[n priority 1]} -eudate = day, -'/', month, -'/', year .
The priority pragma has no effect on parsing!
The priority pragma has no effect on parsing!
However, when choosing trees, if there’s a choice between symbols (vertical ambiguities) and one has a higher priority than the others, it will be selected first.
(The default priority is 0; priorities are positive integers.)
A prioritized dates grammar
{[+pragma n "https://nineml.org/ns/pragma/"]}
date = usdate | eudate .
-usdate = month, -'/', day, -'/', year .
{[n priority 1]} -eudate = day, -'/', month, -'/', year .
month = '0'?, d | '10' | '11' | '12' .
day = ['0'-'2']?, d | '30' | '31' .
year = d, d, d, d .
-d = ['0'-'9'] .
If you know most of your data comes from the EU, you might wish to always treat ambiguous dates as if they came from the EU
What is 08/01/2023?
Now the processor can disambiguate the forest and prefer the 8 January 2023 tree.
<date> <day>08</day> <month>01</month> <year>2023</year> </date>
Yeah, but…
What if you’re parsing something that has multiple dates.
Like a diary:
31/07/2023 Balisage begins! 11/08/2023 Shave yak. 18/08/2023 Dentist at 13:00.
Isn’t it obvious then which date format to use?
Dynamic selection
NineML can use XPath expressions or functions to make the choice.
A choice function looks like this:
<xsl:function name="cp:choose-alternative" as="map(*)"> <xsl:param name="context" as="element()"/> <xsl:param name="options" as="map(*)"/> … </xsl:function>
Date selection
If the graph contains an unambiguous use of date
,
prefer that option when the date is ambiguous.
<xsl:function name="cp:choose-alternative" as="map(*)">
<xsl:param name="context" as="element()"/>
<xsl:param name="options" as="map(*)"/>
<!-- Look for the first unambiguous use of the symbol -->
<xsl:variable name="symbols" as="element()+">
<xsl:for-each select="$context/root()/graph
/symbol[@name=$context/@name and count(children) = 1]">
<xsl:sort select="xs:integer(@start)"/>
<xsl:sequence select="children"/>
</xsl:for-each>
</xsl:variable>
<!-- Did we find one? -->
<xsl:variable name="ambiguous" select="empty($symbols)"/>
<xsl:variable name="choice" as="xs:string">
<xsl:choose>
<xsl:when test="$ambiguous">
<!-- If not, just use take the first choice -->
<xsl:sequence select="$options?available-choices[1]"/>
</xsl:when>
<xsl:otherwise>
<!-- Otherwise, make the same choice here -->
<xsl:variable name="name"
select="$symbols[1]/symbol/@name/string()"/>
<xsl:sequence
select="$context/children[symbol/@name = $name]
/@id/string()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<!-- Return the selection and if it was ambiguous -->
<xsl:sequence select="map{'selection': $choice,
'ambiguous-choice': $ambiguous}"/>
</xsl:function>
Unambiguous EU dates
Parsing:
31/07/2023 Balisage begins!
11/08/2023 Shave yak.
18/08/2023 Dentist at 13:00.
Produces:
<diary>
<appointment>
<date>
<day>31</day>
<month>07</month>
<year>2023</year>
</date>
<text>Balisage begins!</text>
</appointment>
<appointment>
<date>
<day>11</day>
<month>08</month>
<year>2023</year>
</date>
<text>Shave yak.</text>
</appointment>
<appointment>
<date>
<day>18</day>
<month>08</month>
<year>2023</year>
</date>
<text>Dentist at 13:00.</text>
</appointment>
</diary>
Unambiguous US dates
Parsing:
07/31/2023 Balisage begins!
08/11/2023 Shave yak.
08/18/2023 Dentist at 13:00.
Produces:
<diary>
<appointment>
<date>
<month>07</month>
<day>31</day>
<year>2023</year>
</date>
<text>Balisage begins!</text>
</appointment>
<appointment>
<date>
<month>08</month>
<day>11</day>
<year>2023</year>
</date>
<text>Shave yak.</text>
</appointment>
<appointment>
<date>
<month>08</month>
<day>18</day>
<year>2023</year>
</date>
<text>Dentist at 13:00.</text>
</appointment>
</diary>
Thank you!
Questions?
InvisibleXML: https://invisiblexml.org/
NineML: https://nineml.org/
These slides: https://norm.tovey-walsh.com/talks/2023/balisage/
Syntax highlighting for Invisible XML uses a (crude) language definition in my fork of PrismJS at https://github.com/ndw/prism/ (in the
ixml
branch)