Finite description of arbitrarily large data

Goal

This lecture note shows how to describe structured data in a finite way no matter their size, using a notion of formal grammar. A grammar provides a finite recipe to construct arbitrarily big data, inductively, by enumerating constructors in the form of templates. (And symmetrically, the notion of structural recursion makes it possible to write finite programs to process data that were constructed inductively. We will abundantly come back to structural recursion in the coming weeks.) Formal grammars are a topic of study in themselves; here, we simply use them as a stepping stone.

Grammars for finite languages

A formal grammar works much in the same way as an informal grammar in natural languages. Consider, for example, a language that consists of the following sentences:

  • I like ice-cream.
  • I like carbonated beverages.
  • I do not like ice-cream.
  • I do not like carbonated beverages.
  • You like ice-cream.
  • You like carbonated beverages.
  • You do not like ice-cream.
  • You do not like carbonated beverages.

Checking whether a given sentence is grammatically correct amounts to checking whether this given sentence is one of the sentences above. The language is a set of sentences, and checking whether a given sentence is grammatically correct is implemented as set membership. Here, this set is finite (since it contains finitely many sentences), and so the language is finite.

Alternatively to representing a language as a set of sentences, we could structure this language with a grammar. We could say, for example, that in this given language,

  • a sentence can be a subject followed by a verb and a complement, and punctuated with a period, where
    • a subject can be I or you,
    • a verb can be like or do not like, and
    • a complement can be ice-cream or carbonated beverages.

This grammar is sound because

  • if we construct a sentence as a subject followed by a verb and a complement, and punctuated with a period,
  • if we use either I or you as the subject,
  • if we use either like or do not like as the verb, and
  • if we use either ice-cream or carbonated beverages as the complement,

we construct one of the sentences in the given language.

This grammar is also complete because any of the sentences in the given language can be obtained as the concatenation of a subject, a verb, a complement, and a period.

Checking whether a given sentence is grammatically correct is done based on the grammar: a sentence is well formed with respect to a given grammar if it can be derived / generated from this grammar.

Formally, such a grammar is specified either logically or as a rewriting system.

Grammars as a logical system

We can specify the grammar of the language above using relations and proof rules.

  • the relations:

    • the relation subject s, where s denotes a terminal,
    • the relation verb v, where v denotes a terminal,
    • the relation complement c, where c denotes a terminal, and
    • the relation sentence s v c, where s, v, and c denote non-terminals
  • the proof rules:

    The relations about subjects, verbs, and complements are defined with unconditional rules, i.e., axioms, and the relation about sentences is defined with conditional rules.

    I 
    subject I
    YOU 
    subject You
    LIKE 
    verb like
    DO-NOT-LIKE 
    verb do not like
    ICE-CREAM 
    complement ice-cream
    CARBONATED-BEVERAGES 
    complement carbonated beverages
    SUBJECT-VERB-COMPLEMENT-DOTsubject sverb vcomplement c
    sentence s v c

A sentence is grammatically correct if we can construct a proof tree for it, using the rules above.

(Note: this example is only indicative, in that is is too simple to be illustrative. Check out A BNF for regular expressions, logically for a more telling illustration.)

Grammars as a rewriting system

Let us use what is called the “BNF” notation:

<sentence> ::= <subject> <verb> <complement>.
<subject> ::= I | You
<verb> ::= like | do not like
<complement> ::= ice-cream | carbonated beverages

What is between brackets is called a “non-terminal”, and what is not is called a “terminal”. The symbol ::= is read “can give rise to”, or again “rewrites to”, and the vertical bar is read “or”.

In words, this grammar features

  • 4 non-terminals, <sentence>, <subject>, <verb>, and <complement>, the first of which is said to be the root non-terminal;
  • 7 terminals, ., I, You, like, do not like, ice-cream, and carbonated beverages; and
  • 7 production rules, namely
    • the non-terminal <sentence> gives rise to <subject> <verb> <complement>.,
    • the non-terminal <subject> gives rise to I,
    • the non-terminal <subject> gives rise to You,
    • the non-terminal <verb> gives rise to like,
    • the non-terminal <verb> gives rise to do not like,
    • the non-terminal <complement> gives rise to ice-cream, and
    • the non-terminal <complement> gives rise to carbonated beverages.

For example I like ice-cream. is well formed according to this grammar, but I like you. is not, since we can construct a derivation for I like ice-cream., using a production rule at every step, but we cannot for I like you.:

  • Here is how to derive I like ice-cream. as a sentence, using --> to mean “rewrites to”:

    <sentence>
    -->
    <subject> <verb> <complement>.
    -->
    I <verb> <complement>.
    -->
    I like <complement>.
    -->
    I like ice-cream.
    

    In each line, we have replaced the left-most non-terminal by what it gives rise to according to its production. And since the result of the derivation does not contain any non-terminal, we conclude that the sentence I like ice-cream. is grammatically correct.

  • Conversely, we cannot derive I like you. as a sentence:

    <sentence>
    -->
    <subject> <verb> <complement>.
    -->
    I <verb> <complement>.
    -->
    I like <complement>.
    

    The derivation is stuck here because <complement> cannot give rise to you, since there is no production rule <complement> ::= you. Since the result of the derivation contains a non-terminal, we conclude that the sentence I like you. is grammatically incorrect.

Abstract-syntax trees

In practice, the proof that, e.g., I like ice-cream. is a grammatically correct sentence is represented as a derivation tree (a.k.a. abstract-syntax tree):

        <sentence>__________
        /   |    \          \
<subject> <verb> <complement>.
    |       |         |
    I      like   ice-cream

The root of the tree (i.e., its top-most component) is tagged with the non-terminal <sentence>. Corresponding to the production rule for <sentence>, it has 4 sub-trees: one that is tagged with <subject>, one that is tagged with <verb>, one that is tagged with <complement>, and one that is tagged with .. Corresponding to the production rule for <subject>, the sub-tree tagged with <subject> has one sub-tree, which is tagged with I. Corresponding to the production rule for <verb>, the sub-tree tagged with <verb> has one sub-tree, which is tagged with like. And corresponding to the production rule for <complement>, the sub-tree tagged with <complement> has one sub-tree, which is tagged with ice-cream.

Some terminology:

  • the top-most component of the tree is said to be the “root” of the tree;
  • the components tagged with a non-terminal are said to be the “nodes” of the tree; and
  • the components tagged with a terminal are said to be the “leaves” of the tree.

(And yes, in Computer Science, trees are drawn upside down.)

Concrete and abstract syntax

The concrete syntax of a program is what we write (typically in a file) and we supply to a language processor. The spacings in the file, the line breaks, and the comments we have annotated our program with, are all grammatically irrelevant.

The abstract syntax of a program is its grammatical view, i.e., how it is structured according to the grammar rules, disregarding spacings, line breaks, and comments.

Parsing and unparsing

  • Parsing from concrete syntax to abstract syntax: if a concrete-syntax representation (typically a string of characters) is syntactically correct, then it can be represented as an abstract-syntax tree. Parsing is implemented by a parser.
  • Unparsing from abstract syntax to concrete syntax: if an abstract-syntax tree is constructed correctly, then it can be represented as a concrete-syntax string. Unparsing is implemented by an unparser.

Food for thought:

  • A parser implements a compiler from the language of concrete syntax to the language of abstract syntax.
  • An unparser implements a compiler from the language of abstract syntax to the language of concrete syntax.
  • Parsing the result of unparsing gives the same abstract-syntax tree.
  • If the spacings, the line breaks, and the comments are lost during parsing, then unparsing the result of parsing a program is unlikely to give the same program.

Using a grammar to generate grammatically correct sentences

Besides verifying whether a given sentence is grammatically correct, we can use the grammar to generate sentences that are grammatically correct. In the present case, we merely need to pick either I or You as subject, either like or do not like as verb, and either ice-cream or carbonated beverages as complement, and then presto: concatenating these picked subject, verb, and complement gives a grammatically correct sentence.

In the early days of computing, Computer Science pioneer Christopher Strachey specified a grammar for love letters, and implemented a generator of such letters. (This web page needs to be reloaded a few times to be appreciated.) This outside web page does the same for mathematical proofs. (Make sure to reload it a few times.)

Grammars for infinite languages

Rhetorical question: how could we account for infinite languages?

Answer: by using self-reference, namely by allowing the derivation induced by a non-terminal to give rise to this non-terminal.

For example, let us move the period from the sentence production to the end of each complement production, and extend the complement rule with an occurrence of <sentence>:

<sentence> ::= <subject> <verb> <complement>
<subject> ::= I | You
<verb> ::= like | do not like
<complement> ::= ice-cream. | carbonated beverages. | that <sentence>

The sentence I like that you like that I like ice-cream., for example, is well formed according to this grammar since it gives rise to the following abstract-syntax tree:

        <sentence>
        /   |    \
<subject> <verb> <complement>
    |       |         |
    I      like  that <sentence>
                      /   |    \
              <subject> <verb> <complement>
                  |       |         |
                 you     like  that <sentence>
                                    /   |    \
                            <subject> <verb> <complement>
                                |       |         |
                                I      like   ice-cream.

Thanks to self-reference, the language specified by this grammar is now infinite, in that it contains arbitrarily long sentences of the type “blah likes that blah likes that blah likes ...”.

Using a grammar to generate grammatically correct sentences, continued

A recursive program following the structure of a self-referential grammar can generate random sentences that are guaranteed to be grammatically correct. Such a generator is useful to stress-test a parser, i.e., to test it on input of an arbitrarily large size.

One can also implement a generator of arbitrarily large sentences with exactly one syntax error, to test that a parser does yield the error message corresponding to this syntax error.

Likewise, we can implement a generator of arbitrarily large sentences with more than one syntax error, to detect which error the parser will detect first, or whether it will detect all errors.

Exercise 3

Consider the following grammar of sentences:

<sentence> ::= <subject> <verb> <complement>
<subject> ::= I | You | you
<verb> ::= see | do not see
<complement> ::= me. | myself. | you. | yourself. | the mirror. | that <sentence>
  1. Does this grammar account for a finite number of sentences or for an infinite one? Why?

  2. Are the following sentences well formed with respect to this grammar? Justify your answer either with a derivation or with an abstract-syntax tree if the answer is positive, or with the beginning of a derivation if the answer is negative.

    1. I see myself.
    2. I do not see myself.
    3. I see you.
    4. I do not see you.
    5. You see me.
    6. I see the mirror.
    7. I do not see the mirror.
    8. I see that I see the mirror.
    9. I see that you do not see yourself.
    10. I see that you see that I do not see the mirror.
    11. I see that I do not see that I see the mirror.
    12. I see.
    13. You see yourself in the mirror.
    14. The mirror sees you.
    15. You see that I see that you see that I see the mirror.
    16. Mirror, mirror.

    (Note: half of the positive answers should feature a derivation, and the other half should feature an abstract-syntax tree.)

Exercise 4

Consider the following grammar of words:

<word> ::= 0 | 1 | 0 <word> 0 | 1 <word> 1
  1. Does this grammar account for a finite number of words or for an infinite one?
  2. Are the following words well formed with respect to this grammar? Justify your answer.
    1. 0
    2. 1
    3. 1 0
    4. 1 1
    5. 1 0 0
    6. 1 0 1
    7. 1 1 0
    8. 1 1 1
    9. 1 0 0 0
    10. 1 0 0 1
    11. 1 0 1 0
    12. 1 0 1 1
    13. 1 1 0 0
    14. 1 1 0 1
    15. 1 1 1 0
    16. 1 1 1 1
    17. 1 0 0 0 0
    18. 1 0 0 0 1

Exercise 5

Consider the following grammar of digits:

<digit> ::= 0 | 1
<number> ::= <digit> | <digit> <number>
<magic-word> ::= abracadabra

Does this grammar account for a finite number of digits or for an infinite one? Why?

Soundness and completeness

A recipe for representing something is sound whenever, using this recipe, one can only obtain proper representations. It is complete whenever any of the things we want to represent can be represented using the recipe.

Example of a sound representation

The following grammar provides a sound representation of natural numbers, if we interpret O as the natural number 0 and S n as adding 1 to the interpretation of n:

n ::= O | S n

Indeed, the sentences that are well formed according to this grammar are interpreted as adding 1 arbitrarily many times to 0. And adding 1 arbitrarily many times to 0 yields a natural number – it cannot yield something else.

Example of an unsound representation

The following grammar provides an unsound representation of natural numbers, if we interpret O as the natural number 0, S n as adding 1 to the interpretation of n, and P n as subtracting 1 from the interpretation of n:

n ::= O | S n | P n

Indeed, subtracting 1 from 0 would yield a negative integer, and natural numbers are not negative – we should not be able to construct the representation of a negative number.

Example of a complete representation

The following grammar provides a complete representation of natural numbers, if we interpret O as the natural number 0 and S n as adding 1 to the interpretation of n:

n ::= O | S n

Indeed, any natural number can be obtained by adding 1 enough times to 0, and therefore it can be represented by a well-formed sentence according to this grammar.

Example of an incomplete representation

The following grammar provides an incomplete representation of natural numbers, if we interpret O as the natural number 0 and SS n as adding 2 to the interpretation of n:

n ::= O | SS n

Indeed, any existing even number can be represented by adding 2 enough times to 0, but odd numbers cannot be represented.

A sound and complete representation of natural numbers

As stated above, the following grammar provides a representation of natural numbers that is both sound and complete, if we interpret O as the natural number 0 and S n as adding 1 to the interpretation of n:

n ::= 0 | S n

This syntactic representation of natural numbers is due to Giuseppe Peano. Concretely, it says that O represents the natural number 0, and that given any representation of a natural number n, S n also represents a natural number. This representation of natural numbers can therefore be constructed inductively with Peano’s construction. This mathematical construction gives rise to mathematical induction and this representational construction gives rise to structural induction over natural numbers.

This associated induction principle reads as follows, given a predicate P indexed by a natural number:

MATHEMATICAL_INDUCTIONP(O)for any natural number k, P(k) => P(S k)
for any natural number n, P(n)

In words: if

  • P holds for O (base case), and
  • for all natural numbers k such that P holds for k (which is the induction hypothesis), P holds for S k (induction step),

then P holds for all natural numbers.

Computer data

Data, in our computers, are representations of information, and our computations are operations over these representations. The best we can hope is that our data properly represent whatever information they are supposed to represent, that our programs operate over data in a way that is consistent with what these data represent, and that the resulting data represent meaningful information. The soundness and completeness of data representations are part of the job of programming-languages implementers – a job that matters because failing to do it properly opens the door to insecurities.

The onion, revisited

In retrospect, the light-hearted definition of an onion as either nothing or as a layer with an onion inside can be firmed up as a grammar for onion bulbs specifying that a bud is an onion bulb and that a layer over an onion bulb is also an onion bulb:

<onion-bulb> ::= bud | (<onion-bulb>)
  • a bud is represented with the terminal bud, and
  • a layer is represented with two matching parentheses.

So bud, (bud), ((bud)), (((bud))), etc. are all valid representations of onion bulbs, respectively with 0, 1, 2, and 3 layers. To wit, using derivations:

* <onion-bulb> -->
  bud

* <onion-bulb> -->
  (<onion-bulb>) -->
  (bud)

* <onion-bulb> -->
  (<onion-bulb>) -->
  ((<onion-bulb>)) -->
  ((bud))

Or also, using an abstract-syntax tree:

 <onion-bulb>
       |
(<onion-bulb>)
       |
(<onion-bulb>)
       |
(<onion-bulb>)
       |
      bud

This grammar is unlikely to be sound, as onions only contain a bounded number of layers in the real world, and so adding as many matching parentheses as we like around bud is unlikely to give the representation of an onion in real life. This grammar is only complete if onion bulbs in the real world only contain one onion bulb inside every layer.

The point of interest here is that this grammar of onion bulbs is isomorphic to the grammar of Peano numbers:

  • given an onion bulb, if we replace bud by O and each matching pair of parentheses by S, we obtain a Peano number;
  • given a Peano number, if we replace O by bud and each occurrence of S by a matching pair of parentheses, we obtain an onion bulb;
  • given any onion bulb, if we map it to a Peano number and then map this Peano number back to an onion bulb, we obtain the same onion bulb, i.e., an onion bulb that has the same structure as the given onion bulb; and
  • given any Peano number, if we map it to an onion bulb and then map this onion bulb back to a Peano number, we obtain the same Peano number, i.e., a Peano number that has the same structure as the given Peano number.

The function mapping onion bulbs to Peano numbers (resp. Peano numbers to onion bulbs) is defined recursively over the structure of onion bulbs (resp. Peano numbers), i.e., it is structurally recursive. The inverseness properties are proved inductively over the structure of onion bulbs (resp. Peano numbers), i.e., by structural induction. These recursive functions can be programmed in OCaml, which is about to become our programming language of discourse here, and these properties can be formalized, e.g., using the Coq proof assistant, in a later course.

The isomorphism between the grammar of onion bulbs and the grammar of Peano numbers has a key consequence: what matters is the structure of our data, not the name of their constructors. In the present case, it doesn’t matter that the terminal, in the base case, is called nothing, empty, bud, or O, and that we use matching parentheses, the name “layer”, or S in the induction step. Either way, the resulting data have the same structure.

More agricultural vagaries: the case of shallots

Shallots are a type of onion that can contain more than one bulb inside a layer. So compared to that of onion bulbs, their grammar needs to be refined,

  • either to allow arbitrarily many bulbs, where <shallot-bulb> is the root non-terminal:

    <shallot-bulbs> ::= <shallot-bulb> | <shallot-bulb> <shallot-bulbs>
    
    <shallot-bulb> ::= bud | (<shallot-bulbs>)
    
  • or to allow, say, up to 3 bulbs inside a layer:

    <shallot-bulb> ::= bud
                     | (<shallot-bulb>)
                     | (<shallot-bulb> <shallot-bulb>)
                     | (<shallot-bulb> <shallot-bulb> <shallot-bulb>)
    

So bud, (((bud)) (bud)), etc. are all valid representations of shallot bulbs. To wit, using the first grammar and proceeding from left to right:

* <shallot-bulb> -->
  bud

* <shallot-bulb> -->
  (<shallot-bulbs>) -->
  (<shallot-bulb> <shallot-bulbs>) -->
  ((<shallot-bulbs>) <shallot-bulbs>) -->
  ((<shallot-bulb>) <shallot-bulbs>) -->
  (((<shallot-bulbs>)) <shallot-bulbs>) -->
  (((<shallot-bulb>)) <shallot-bulbs>) -->
  (((bud)) <shallot-bulbs>) -->
  (((bud)) <shallot-bulb>) -->
  (((bud)) (<shallot-bulbs>)) -->
  (((bud)) (<shallot-bulb>)) -->
  (((bud)) (bud))

And using the second grammar:

* <shallot-bulb> -->
  bud

* <shallot-bulb> -->
  (<shallot-bulb> <shallot-bulb>) -->
  ((<shallot-bulb>) <shallot-bulb>) -->
  (((<shallot-bulb>)) <shallot-bulb>) -->
  (((bud)) <shallot-bulb>) -->
  (((bud)) (<shallot-bulb>)) -->
  (((bud)) (bud))

The grammar allowing arbitrarily many bulbs inside a layer is unlikely to be sound, as shallots only contain a bounded number of bulbs in the real world. The grammar allowing up to 3 bulbs inside a layer is unlikely to be complete, as shallots containing, e.g., 4 bulbs might well exist in the real world, to say nothing about axillary buds and offsets. Modeling, i.e., representing real-world information as data, is a non-trivial affair.

Version

Added the part about isomorphisms in the section about onions [27 Jan 2020]

Added the pointers to axillary buds and offsets, thanks to Youngbean Kim’s active attention [27 Jan 2020]

Added the derivations in the sections about onions and shallots [27 Jan 2020]

Added the sections about onions and shallots [26 Jan 2020]

Reshaped this note according to the presentation at the lecture [24 Jan 2020]

Changed the arrow for grammatical derivation from -> to --> [11 Feb 2019]

Created [21 Jan 2019]