This lecture note shows how to describe structured data in a finite way no matter their size, using a notion of formal grammar. A grammar provides a finite recipe to construct arbitrarily big data, inductively, by enumerating constructors in the form of templates. (And symmetrically, the notion of structural recursion makes it possible to write finite programs to process data that were constructed inductively. We will abundantly come back to structural recursion in the coming weeks.) Formal grammars are a topic of study in themselves; here, we simply use them as a stepping stone.
A formal grammar works much in the same way as an informal grammar in natural languages. Consider, for example, a language that consists of the following sentences:
Checking whether a given sentence is grammatically correct amounts to checking whether this given sentence is one of the sentences above. The language is a set of sentences, and checking whether a given sentence is grammatically correct is implemented as set membership. Here, this set is finite (since it contains finitely many sentences), and so the language is finite.
Alternatively to representing a language as a set of sentences, we could structure this language with a grammar. We could say, for example, that in this given language,
This grammar is sound because
we construct one of the sentences in the given language.
This grammar is also complete because any of the sentences in the given language can be obtained as the concatenation of a subject, a verb, a complement, and a period.
Checking whether a given sentence is grammatically correct is done based on the grammar: a sentence is well formed with respect to a given grammar if it can be derived / generated from this grammar.
Formally, such a grammar is specified either logically or as a rewriting system.
We can specify the grammar of the language above using relations and proof rules.
the relations:
the proof rules:
The relations about subjects, verbs, and complements are defined with unconditional rules, i.e., axioms, and the relation about sentences is defined with conditional rules.
I | ||
subject I |
YOU | ||
subject You |
LIKE | ||
verb like |
DO-NOT-LIKE | ||
verb do not like |
ICE-CREAM | ||
complement ice-cream |
CARBONATED-BEVERAGES | ||
complement carbonated beverages |
SUBJECT-VERB-COMPLEMENT-DOT | subject s | verb v | complement c | |
sentence s v c |
A sentence is grammatically correct if we can construct a proof tree for it, using the rules above.
(Note: this example is only indicative, in that is is too simple to be illustrative. Check out A BNF for regular expressions, logically for a more telling illustration.)
Let us use what is called the “BNF” notation:
What is between brackets is called a “non-terminal”, and what is not is called a “terminal”. The symbol ::= is read “can give rise to”, or again “rewrites to”, and the vertical bar is read “or”.
In words, this grammar features
For example I like ice-cream. is well formed according to this grammar, but I like you. is not, since we can construct a derivation for I like ice-cream., using a production rule at every step, but we cannot for I like you.:
Here is how to derive I like ice-cream. as a sentence, using --> to mean “rewrites to”:
<sentence>
-->
<subject> <verb> <complement>.
-->
I <verb> <complement>.
-->
I like <complement>.
-->
I like ice-cream.
In each line, we have replaced the left-most non-terminal by what it gives rise to according to its production. And since the result of the derivation does not contain any non-terminal, we conclude that the sentence I like ice-cream. is grammatically correct.
Conversely, we cannot derive I like you. as a sentence:
<sentence>
-->
<subject> <verb> <complement>.
-->
I <verb> <complement>.
-->
I like <complement>.
The derivation is stuck here because <complement> cannot give rise to you, since there is no production rule <complement> ::= you. Since the result of the derivation contains a non-terminal, we conclude that the sentence I like you. is grammatically incorrect.
In practice, the proof that, e.g., I like ice-cream. is a grammatically correct sentence is represented as a derivation tree (a.k.a. abstract-syntax tree):
<sentence>__________
/ | \ \
<subject> <verb> <complement>.
| | |
I like ice-cream
The root of the tree (i.e., its top-most component) is tagged with the non-terminal <sentence>. Corresponding to the production rule for <sentence>, it has 4 sub-trees: one that is tagged with <subject>, one that is tagged with <verb>, one that is tagged with <complement>, and one that is tagged with .. Corresponding to the production rule for <subject>, the sub-tree tagged with <subject> has one sub-tree, which is tagged with I. Corresponding to the production rule for <verb>, the sub-tree tagged with <verb> has one sub-tree, which is tagged with like. And corresponding to the production rule for <complement>, the sub-tree tagged with <complement> has one sub-tree, which is tagged with ice-cream.
Some terminology:
(And yes, in Computer Science, trees are drawn upside down.)
The concrete syntax of a program is what we write (typically in a file) and we supply to a language processor. The spacings in the file, the line breaks, and the comments we have annotated our program with, are all grammatically irrelevant.
The abstract syntax of a program is its grammatical view, i.e., how it is structured according to the grammar rules, disregarding spacings, line breaks, and comments.
Food for thought:
Besides verifying whether a given sentence is grammatically correct, we can use the grammar to generate sentences that are grammatically correct. In the present case, we merely need to pick either I or You as subject, either like or do not like as verb, and either ice-cream or carbonated beverages as complement, and then presto: concatenating these picked subject, verb, and complement gives a grammatically correct sentence.
In the early days of computing, Computer Science pioneer Christopher Strachey specified a grammar for love letters, and implemented a generator of such letters. (This web page needs to be reloaded a few times to be appreciated.) This outside web page does the same for mathematical proofs. (Make sure to reload it a few times.)
Rhetorical question: how could we account for infinite languages?
Answer: by using self-reference, namely by allowing the derivation induced by a non-terminal to give rise to this non-terminal.
For example, let us move the period from the sentence production to the end of each complement production, and extend the complement rule with an occurrence of <sentence>:
The sentence I like that you like that I like ice-cream., for example, is well formed according to this grammar since it gives rise to the following abstract-syntax tree:
<sentence>
/ | \
<subject> <verb> <complement>
| | |
I like that <sentence>
/ | \
<subject> <verb> <complement>
| | |
you like that <sentence>
/ | \
<subject> <verb> <complement>
| | |
I like ice-cream.
Thanks to self-reference, the language specified by this grammar is now infinite, in that it contains arbitrarily long sentences of the type “blah likes that blah likes that blah likes ...”.
A recursive program following the structure of a self-referential grammar can generate random sentences that are guaranteed to be grammatically correct. Such a generator is useful to stress-test a parser, i.e., to test it on input of an arbitrarily large size.
One can also implement a generator of arbitrarily large sentences with exactly one syntax error, to test that a parser does yield the error message corresponding to this syntax error.
Likewise, we can implement a generator of arbitrarily large sentences with more than one syntax error, to detect which error the parser will detect first, or whether it will detect all errors.
Consider the following grammar of sentences:
Does this grammar account for a finite number of sentences or for an infinite one? Why?
Are the following sentences well formed with respect to this grammar? Justify your answer either with a derivation or with an abstract-syntax tree if the answer is positive, or with the beginning of a derivation if the answer is negative.
(Note: half of the positive answers should feature a derivation, and the other half should feature an abstract-syntax tree.)
Consider the following grammar of words:
Consider the following grammar of digits:
Does this grammar account for a finite number of digits or for an infinite one? Why?
A recipe for representing something is sound whenever, using this recipe, one can only obtain proper representations. It is complete whenever any of the things we want to represent can be represented using the recipe.
The following grammar provides a sound representation of natural numbers, if we interpret O as the natural number 0 and S n as adding 1 to the interpretation of n:
n ::= O | S n
Indeed, the sentences that are well formed according to this grammar are interpreted as adding 1 arbitrarily many times to 0. And adding 1 arbitrarily many times to 0 yields a natural number – it cannot yield something else.
The following grammar provides an unsound representation of natural numbers, if we interpret O as the natural number 0, S n as adding 1 to the interpretation of n, and P n as subtracting 1 from the interpretation of n:
n ::= O | S n | P n
Indeed, subtracting 1 from 0 would yield a negative integer, and natural numbers are not negative – we should not be able to construct the representation of a negative number.
The following grammar provides a complete representation of natural numbers, if we interpret O as the natural number 0 and S n as adding 1 to the interpretation of n:
n ::= O | S n
Indeed, any natural number can be obtained by adding 1 enough times to 0, and therefore it can be represented by a well-formed sentence according to this grammar.
The following grammar provides an incomplete representation of natural numbers, if we interpret O as the natural number 0 and SS n as adding 2 to the interpretation of n:
n ::= O | SS n
Indeed, any existing even number can be represented by adding 2 enough times to 0, but odd numbers cannot be represented.
As stated above, the following grammar provides a representation of natural numbers that is both sound and complete, if we interpret O as the natural number 0 and S n as adding 1 to the interpretation of n:
n ::= 0 | S n
This syntactic representation of natural numbers is due to Giuseppe Peano. Concretely, it says that O represents the natural number 0, and that given any representation of a natural number n, S n also represents a natural number. This representation of natural numbers can therefore be constructed inductively with Peano’s construction. This mathematical construction gives rise to mathematical induction and this representational construction gives rise to structural induction over natural numbers.
This associated induction principle reads as follows, given a predicate P indexed by a natural number:
MATHEMATICAL_INDUCTION | P(O) | for any natural number k, P(k) => P(S k) | |
for any natural number n, P(n) |
In words: if
then P holds for all natural numbers.
Data, in our computers, are representations of information, and our computations are operations over these representations. The best we can hope is that our data properly represent whatever information they are supposed to represent, that our programs operate over data in a way that is consistent with what these data represent, and that the resulting data represent meaningful information. The soundness and completeness of data representations are part of the job of programming-languages implementers – a job that matters because failing to do it properly opens the door to insecurities.
In retrospect, the light-hearted definition of an onion as either nothing or as a layer with an onion inside can be firmed up as a grammar for onion bulbs specifying that a bud is an onion bulb and that a layer over an onion bulb is also an onion bulb:
<onion-bulb> ::= bud | (<onion-bulb>)
So bud, (bud), ((bud)), (((bud))), etc. are all valid representations of onion bulbs, respectively with 0, 1, 2, and 3 layers. To wit, using derivations:
* <onion-bulb> -->
bud
* <onion-bulb> -->
(<onion-bulb>) -->
(bud)
* <onion-bulb> -->
(<onion-bulb>) -->
((<onion-bulb>)) -->
((bud))
Or also, using an abstract-syntax tree:
<onion-bulb>
|
(<onion-bulb>)
|
(<onion-bulb>)
|
(<onion-bulb>)
|
bud
This grammar is unlikely to be sound, as onions only contain a bounded number of layers in the real world, and so adding as many matching parentheses as we like around bud is unlikely to give the representation of an onion in real life. This grammar is only complete if onion bulbs in the real world only contain one onion bulb inside every layer.
The point of interest here is that this grammar of onion bulbs is isomorphic to the grammar of Peano numbers:
The function mapping onion bulbs to Peano numbers (resp. Peano numbers to onion bulbs) is defined recursively over the structure of onion bulbs (resp. Peano numbers), i.e., it is structurally recursive. The inverseness properties are proved inductively over the structure of onion bulbs (resp. Peano numbers), i.e., by structural induction. These recursive functions can be programmed in OCaml, which is about to become our programming language of discourse here, and these properties can be formalized, e.g., using the Coq proof assistant, in a later course.
The isomorphism between the grammar of onion bulbs and the grammar of Peano numbers has a key consequence: what matters is the structure of our data, not the name of their constructors. In the present case, it doesn’t matter that the terminal, in the base case, is called nothing, empty, bud, or O, and that we use matching parentheses, the name “layer”, or S in the induction step. Either way, the resulting data have the same structure.
Shallots are a type of onion that can contain more than one bulb inside a layer. So compared to that of onion bulbs, their grammar needs to be refined,
either to allow arbitrarily many bulbs, where <shallot-bulb> is the root non-terminal:
<shallot-bulbs> ::= <shallot-bulb> | <shallot-bulb> <shallot-bulbs>
<shallot-bulb> ::= bud | (<shallot-bulbs>)
or to allow, say, up to 3 bulbs inside a layer:
<shallot-bulb> ::= bud
| (<shallot-bulb>)
| (<shallot-bulb> <shallot-bulb>)
| (<shallot-bulb> <shallot-bulb> <shallot-bulb>)
So bud, (((bud)) (bud)), etc. are all valid representations of shallot bulbs. To wit, using the first grammar and proceeding from left to right:
* <shallot-bulb> -->
bud
* <shallot-bulb> -->
(<shallot-bulbs>) -->
(<shallot-bulb> <shallot-bulbs>) -->
((<shallot-bulbs>) <shallot-bulbs>) -->
((<shallot-bulb>) <shallot-bulbs>) -->
(((<shallot-bulbs>)) <shallot-bulbs>) -->
(((<shallot-bulb>)) <shallot-bulbs>) -->
(((bud)) <shallot-bulbs>) -->
(((bud)) <shallot-bulb>) -->
(((bud)) (<shallot-bulbs>)) -->
(((bud)) (<shallot-bulb>)) -->
(((bud)) (bud))
And using the second grammar:
* <shallot-bulb> -->
bud
* <shallot-bulb> -->
(<shallot-bulb> <shallot-bulb>) -->
((<shallot-bulb>) <shallot-bulb>) -->
(((<shallot-bulb>)) <shallot-bulb>) -->
(((bud)) <shallot-bulb>) -->
(((bud)) (<shallot-bulb>)) -->
(((bud)) (bud))
The grammar allowing arbitrarily many bulbs inside a layer is unlikely to be sound, as shallots only contain a bounded number of bulbs in the real world. The grammar allowing up to 3 bulbs inside a layer is unlikely to be complete, as shallots containing, e.g., 4 bulbs might well exist in the real world, to say nothing about axillary buds and offsets. Modeling, i.e., representing real-world information as data, is a non-trivial affair.
Added the part about isomorphisms in the section about onions [27 Jan 2020]
Added the pointers to axillary buds and offsets, thanks to Youngbean Kim’s active attention [27 Jan 2020]
Added the derivations in the sections about onions and shallots [27 Jan 2020]
Added the sections about onions and shallots [26 Jan 2020]
Reshaped this note according to the presentation at the lecture [24 Jan 2020]
Changed the arrow for grammatical derivation from -> to --> [11 Feb 2019]
Created [21 Jan 2019]