Syntax of regular expressions

Goal

This lecture note illustrates how to specify the syntax of regular expressions using a BNF and how to represent a regular expression top down using an abstract-syntax tree and, equivalently, bottom up using a proof tree.

A BNF for regular expressions

Here is a BNF for fully parenthesized regular expressions:

<regexp> ::= (empty) | (atom <atom>) | (any) | (seq <regexp> <regexp>) | (disj <regexp> <regexp>) | (star <regexp>) | (plus <regexp>)
<atom> ::= ...any integer...
Harald: Baby steps?
Alfrothul: Baby steps.

NB: The meaning of regular expressions (i.e., their semantics) is irrelevant here. All we care about is how to write them (i.e., their syntax).

Let us look at a variety of regular expressions:

  • The abstract-syntax tree of the regular expression (atom 10) reads as follows:

    <regexp>
       |
       |
     (atom <atom>)
              |
              10
    

    And indeed (atom 10) can be derived as follows:

    <regexp> -->
    (atom <atom>) -->
    (atom 10)
    
  • The abstract-syntax tree of the regular expression (any) reads as follows:

    <regexp>
       |
       |
     (any)
    

    And indeed (any) can be derived as follows:

    <regexp> -->
    (any)
    
  • The abstract-syntax tree of the regular expression (seq (any) (any)) reads as follows:

    <regexp>
       |
       |
     (seq <regexp> <regexp>)
             |        |
           (any)    (any)
    

    And indeed (seq (any) (any)) can be derived as follows, from left to right:

    <regexp> -->
    (seq <regexp> <regexp>) -->
    (seq (any) <regexp>) -->
    (seq (any) (any))
    

    It can also be derived from right to left as follows:

    <regexp> -->
    (seq <regexp> <regexp>) -->
    (seq <regexp> (any)) -->
    (seq (any) (any))
    

    The resulting abstract-syntax tree is unique because this grammar of regular expressions is unambiguous.

  • The abstract-syntax tree of the regular expression (seq (atom 10) (any)) reads as follows:

    <regexp>
       |
       |
     (seq <regexp> <regexp>)
            /         |
           /          |
     (atom <atom>)  (any)
             |
             10
    

    And indeed (seq (atom 10) (any)) can be derived from left to right as follows:

    <regexp> -->
    (seq <regexp> <regexp>) -->
    (seq (atom <atom>) <regexp>) -->
    (seq (atom 10) <regexp>) -->
    (seq (atom 10) (any))
    

    It can also be derived from right to left as follows:

    <regexp> -->
    (seq <regexp> <regexp>) -->
    (seq <regexp> (any)) -->
    (seq (atom <atom>) (any)) -->
    (seq (atom 10) (any))
    
  • The abstract-syntax tree of the regular expression (seq (any) (atom 20)) reads as follows:

    <regexp>
       |
       |
     (seq <regexp> <regexp>)
             |         \
             |          \
           (any)    (atom <atom>)
                            |
                            20
    

    And indeed (seq (any) (atom 20)) can be derived from left to right as follows:

    <regexp> -->
    (seq <regexp> <regexp>) -->
    (seq (any) <regexp>) -->
    (seq (any) (atom <atom>)) -->
    (seq (any) (atom 20))
    

    It can also be derived from right to left.

  • The abstract-syntax tree of the regular expression (seq (seq (atom 1) (atom 2)) (seq (atom 3) (atom 4))) reads as follows:

                      <regexp>
                         |
                         |
                         |
                      (seq <regexp> <regexp>)
                              /          \
                             /            \
                            /              \
     (seq <regexp> <regexp>)                (seq <regexp> <regexp>)
            /          \                           /          \
           /            \                         /            \
          /              \                       /              \
    (atom <atom>)   (atom <atom>)          (atom <atom>)    (atom <atom>)
            |               |                      |                |
            1               2                      3                4
    

    And indeed (seq (seq (atom 1) (atom 2)) (seq (atom 3) (atom 4))) can be derived from left to right as follows:

    <regexp> -->
    (seq <regexp> <regexp>) -->
    (seq (seq <regexp> <regexp>) <regexp>) -->
    (seq (seq (atom <atom>) <regexp>) <regexp>) -->
    (seq (seq (atom 1) <regexp>) <regexp>) -->
    (seq (seq (atom 1) (atom <atom>)) <regexp>) -->
    (seq (seq (atom 1) (atom 2)) <regexp>) -->
    (seq (seq (atom 1) (atom 2)) (seq <regexp> <regexp>)) -->
    (seq (seq (atom 1) (atom 2)) (seq (atom <atom>) <regexp>)) -->
    (seq (seq (atom 1) (atom 2)) (seq (atom 3) <regexp>)) -->
    (seq (seq (atom 1) (atom 2)) (seq (atom 3) (atom <atom>))) -->
    (seq (seq (atom 1) (atom 2)) (seq (atom 3) (atom 4)))
    

    It can also be derived from right to left.

  • The abstract-syntax tree of the regular expression (disj (atom 10) (atom 20)) reads as follows:

    <regexp>
       |
       |
    (disj <regexp> <regexp>)
            /          \
           /            \
     (atom <atom>)  (atom <atom>)
             |              |
             10             20
    

    And indeed (disj (atom 10) (atom 20)) can be derived from left to right as follows:

    <regexp> -->
    (disj <regexp> <regexp>) -->
    (disj (atom <atom>) <regexp>) -->
    (disj (atom 10) <regexp>) -->
    (disj (atom 10) (atom <atom>)) -->
    (disj (atom 10) (atom 20))
    

    It can also be derived from right to left.

  • The abstract-syntax tree of the regular expression (star (atom 10)) reads as follows:

    <regexp>
       |
       |
    (star <regexp>)
             |
             |
          (atom <atom>)
                  |
                  10
    

    And indeed (star (atom 10)) can be derived from left to right as follows:

    <regexp> -->
    (star <regexp>) -->
    (star (atom <atom>)) -->
    (star (atom 10))
    
  • The abstract-syntax tree of the regular expression (plus (atom 10)) reads as follows:

    <regexp>
       |
       |
    (plus <regexp>)
             |
             |
           (atom <atom>)
                   |
                   10
    

    And indeed (plus (atom 10)) can be derived from left to right as follows:

    <regexp> -->
    (plus <regexp>) -->
    (plus (atom <atom>)) -->
    (plus (atom 10))
    
  • The abstract-syntax tree of the regular expression (disj (empty) (atom 10)) reads as follows:

    <regexp>
       |
       |
    (disj <regexp> <regexp>)
             |        |
             |        |
          (empty)  (atom <atom>)
                           |
                           10
    

    And indeed (disj (empty) (atom 10)) can be derived from left to right as follows:

    <regexp> -->
    (disj <regexp> <regexp>) -->
    (disj (empty) <regexp>) -->
    (disj (empty) (atom <atom>)) -->
    (disj (empty) (atom 10))
    

    It can also be derived from right to left.

Exercise 6

If possible,

  1. derive, and
  2. draw abstract-syntax trees

for each the following regular expressions:

  1. empty
  2. any
  3. (seq empty empty)
  4. (seq (atom 1) (seq (atom 2) (seq (atom 3) (seq (atom 4) (empty)))))
  5. (seq (atom 1) (seq (atom 2) (seq (atom 3) (atom 4))))
  6. (seq (seq (empty) (seq (atom 1) (atom 2))) (seq (empty) (seq (atom 3) (atom 4))))
  7. (seq (seq (seq (atom 1) (atom 2)) (atom 3)) (atom 4))
  8. (seq (seq (seq (seq (empty) (atom 1)) (atom 2)) (atom 3)) (atom 4))

Exercise 7

If possible, write the regular expression corresponding to each of the following abstract-syntax trees:

a.
   <regexp>
     |
   (seq <regexp> <regexp>)
          /        \
    (atom <atom>) (seq <regexp> <regexp>)
           |             /        \
           1       (atom <atom>) (seq <regexp> <regexp>)
                          |             /        \
                          2     (atom <atom>) (seq <regexp> <regexp>)
                                       |             /        \
                                       3       (atom <atom>)  (empty)
                                                      |
                                                      4


b.
   <regexp>
     |
   (seq <regexp> <regexp>)
          /        \
    (atom <atom>) (seq <regexp> <regexp>)
           |             /        \
           1      (atom <atom>) (seq <regexp> <regexp>)
                         |             /        \
                         2       (atom <atom>) (atom <atom>)
                                        |           |
                                        3           4


c.
                 <regexp>
                   |
                 (seq <regexp> <regexp>)
                        /           \
                       /             \
   (seq <regexp> <regexp>)          (seq <regexp> <regexp>)
          /        \                         |         \
      (empty)    (seq <regexp> <regexp>)    empty    (seq <regexp> <regexp>)
                       /         \                          /        \
             (atom <atom>)     (atom <atom>)      (atom <atom>)    (atom <atom>)
                    |                 |                  |                |
                    1                 2                  3                4


d.
                           <regexp>
                             |
                           (seq <regexp> <regexp>)
                                  /        \
               (seq <regexp> <regexp>)    (atom <atom>)
                     /         \                 |
   (seq <regexp> <regexp>)    (atom <atom>)      4
          /        \                 |
   (atom <atom>)  (atom <atom>)      3
          |              |
          1              2


e.
                                        <regexp>
                                          |
                                        (seq <regexp> <regexp>)
                                               /           \
                           (seq <regexp> <regexp>)       (atom <atom>)
                                  /           \                 |
               (seq <regexp> <regexp>)      (atom <atom>)       4
                      /           \                |
   (seq <regexp> <regexp>)   (atom <atom>)         3
          /        \                |
        any   (atom <atom>)         2
                     |
                     1


f.
                  <regexp>
                    |
                  (seq <regexp> <regexp>)
                         /           \
     (seq <regexp> <regexp>)       (seq <regexp> <regexp>)
            /        \                    /        \
   (atom <atom>)  (atom <atom>)  (atom <atom>)  (atom <atom>)
          |              |              |              |
          1              2              3              4

A BNF for regular expressions, logically

Here is an alternative rendering for the BNF for fully parenthesized regular expressions: using a relation and proof rules.

We say that e is a valid regular expression whenever the judgment RE e holds. This judgment is defined with the following proof rules:

EMPTY 
RE (empty)
ATOMn is an integer
RE (atom n)
ANY 
RE (any)
SEQRE e1RE e2
RE (seq e1 e2)
DISJRE e1RE e2
RE (disj e1 e2)
STARRE e
RE (star e)
PLUSRE e
RE (plus e)

In words:

  • The first rule, EMPTY, is unconditional, i.e., it is an axiom. This axiom says that (empty) is a valid regular expression.

  • The second rule, ATOM, is conditional. This conditional rule says that if n is an integer, then (atom n) is a valid regular expression.

    (Reminder: the parts above the horizontal bar are called the premises (or again assumptions (or again hypotheses)), and the part below the horizontal bar is called the conclusion. Here, there is only one premise.)

  • The third rule, ANY, is an axiom (i.e., an unconditional rule).

  • The fourth rule, SEQ, is conditional. It says that if e1 and e2 are valid regular expressions, then (seq e1 e2) is also a valid regular expression.

  • etc.

To verify the validity of a regular expression, we attempt to construct a proof tree, i.e., a tree that uses proof rules as constructors:

  • Example 1: (any) is a valid regular expression because we can construct the following proof tree, using the axiom ANY:

    ANY --------
        RE (any)
    
  • Example 2: (star (any)) is a valid regular expression because we can construct the following proof tree:

     ANY --------
         RE (any)
    STAR ---------------
         RE (star (any))
    

    Here is how to construct this proof tree: start from the goal (word for the wise: at the bottom of a piece of paper), and then look for rules whose conclusion match the goal (see below for an example). If there are no such rules, the proof tree cannot be constructed. If there are several such rules, try them one after the other until you can construct a complete proof tree. For a rule whose conclusion matches the goal, instantiate its premises (see below for an example). The result is a collection of subgoals, one for each premise. Proceed with each subgoal as you did with the goal.

    • So let’s start from the goal:

      RE (star (any))
      
    • Does there exist rules whose conclusion matches RE (star (any))? There is one (and only one), namely STAR, that matches RE (star (any)) for any variable e that denotes (any). The result of matching is therefore a conditional yes, with the following constraint:

      e = (any)
      
    • We then instantiate the premise of Rule STAR, replacing the occurrence of e by (any), and we write the result on top of the horizontal bar:

           RE (any)
      STAR ---------------
           RE (star (any))
      
    • We then look for any rule whose conclusion matches RE (any). There is one (and only one), namely Rule ANY, that matches RE (any). The result of matching is therefore an unconditional yes.

    • We then instantiate the premise of Rule ANY (there isn’t any, since it is an axiom):

       ANY --------
           RE (any)
      STAR ---------------
           RE (star (any))
      

    The proof tree is complete, and therefore yes, (star (any)) is a valid regular expression.

  • Example 3: Is (seq (atom 10) (any)) a valid regular expression?

    • Let’s start from the goal:

      RE (seq (atom 10) (any))
      
    • Does there exist rules whose conclusion matches RE (seq (atom 10) (any))? There is one (and only one), namely Rule SEQ, that matches RE (seq (atom 10) (any)) for any variable e1 that denotes (atom 10) and for any variable e2 that denotes (any). The result of matching is therefore a conditional yes, with the following constraint:

      e1 = (atom 10)
      e2 = (any)
      
    • Let us then instantiate the premise of Rule SEQ, replacing the occurrence of e1 by (atom 10) and the occurrence of e2 by (any). We write the result on top of the horizontal bar:

          RE (atom 10)    RE (any)
      SEQ ------------------------
          RE (seq (atom 10) (any))
      

      We now have two new subgoals. Let us treat them in turn.

    • Does there exist rules whose conclusion matches RE (atom 10)? There is one (and only one), namely Rule ATOM, that matches RE (atom 10) for any variable n that denotes 10. The result of matching is therefore a conditional yes, with the following constraint:

      n = 10
      
    • Let us instantiate the premise of Rule ATOM, replacing the occurrence of n by 10. We write the result on top of the horizontal bar:

           10 is an integer
      ATOM ----------------
           RE (atom 10)        RE (any)
       SEQ ----------------------------
           RE (seq (atom 10) (any))
      

      We have one new subgoal, which is trivially satisfied since 10 is an integer. So let us turn to the other subgoal.

    • Does there exist rules whose conclusion matches RE (any)? There is one (and only one), namely Rule ANY. The result of matching is therefore an unconditional yes.

    • We then instantiate the premise of Rule ANY (there isn’t any, since it is an axiom):

           10 is an integer
      ATOM ----------------    ANY --------
           RE (atom 10)            RE (any)
       SEQ --------------------------------
           RE (seq (atom 10) (any))
      

    The proof tree is complete, and therefore yes, (seq (atom 10) (any)) is a valid regular expression.

  • Likewise, the reader can check that (seq (any) (atom 20)) is a valid regular expression since the following proof tree can be constructed:

                         20 is an integer
    ANY --------    ATOM ----------------
        RE (any)         RE (atom 20)
    SEQ -----------------------------
        RE (seq (any) (atom 20))
    
  • Question: is (seq (seq (atom 1) (atom 2)) (seq (atom 3) (atom 4))) a valid regular expression?

    Hint: to answer this question, attempt to draw a proof tree.

This principled method of constructing proof trees yields trees that are upside down compared to the abstract-syntax trees drawn above. Otherwise, given any valid regular expression, its abstract-syntax tree and its proof tree have the same shape, i.e., they are isomorphic.

Exercise 8

Draw proof trees for the same regular expressions as in Exercise 6 and verify that they have the same shape as the abstract-syntax trees you drew in your solution of Exercise 6.

Exercise 9

By the way, roots of proof trees are at their bottom.

  1. Here is an incomplete proof tree:

         42 is an integer
    ATOM ----------------    EMPTY ---
         ...                       ...
     SEQ -----------------------------
         ...
    

    What should be at the root of this proof tree for it to be well formed?

  2. Here is an incomplete proof tree:

               ANY ---    ANY ---
                   ...        ...
    ANY ---    SEQ --------------
        ...        ...
    SEQ -------------------------
        ...
    

    What should be at the root of this proof tree for it to be well formed?

  3. Here is an incomplete proof tree:

    EMPTY ---
          ...
      ANY ---
          ...
      ANY ---
          ...
    

    What should be at the root of this proof tree for it to be well formed?

Syntax errors

If something can go wrong,
at some point it will go wrong.

Murphy’s law

In principle, some terms, e.g., (plus empty), (seq (any)), and (disj (empty) (any) (atom 100)), cannot be derived from <regexp>, and correspondingly no abstract-syntax tree and no proof tree can be constructed for them. These terms are syntactically incorrect with respect to this grammar of regular expressions.

In practice, it is not good enough to report that a term is syntactically incorrect: users expect more information. Manufacturing and issuing useful error messages is an endeavor in itself.

Version

Fixed a typo in the statement of Exercise 7, thanks to Yunjeong Lee’s eagle eye [30 Mar 2019]

Changed the arrow for grammatical derivation from -> to --> [11 Feb 2019]

Created [21 Jan 2019]