Semantics of regular expressions

Goal

This note outlines the meaning of regular expressions.

Resources

Reminder: abstract syntax of regular expressions

Here is again the BNF for fully parenthesized regular expressions:

<regexp> ::= (empty) | (atom <atom>) | (any) | (seq <regexp> <regexp>) | (disj <regexp> <regexp>) | (star <regexp>) | (plus <regexp>)
<atom> ::= ...any integer...

Semantics of regular expressions

A regular expression represents a set of lists of integers. Matching a regular expression against a given list amounts to checking whether this list belongs to this set:

  • the constructor empty declares the empty regular expression, which matches the empty list of integers;

    for example, matching (empty) and () succeeds, whereas matching (empty) and, e.g., (1 2 3) fails;

    indeed, the regular expression (empty) represents the set {()};

  • the constructor atom declares a specific integer, and this atomic regular expression matches a singleton list that contains this integer;

    for example, matching (atom 10) and (10) succeeds, whereas matching (atom 10) and () fails, and so does matching (atom 10) and (10 20);

    indeed, the regular expression (atom 10) represents the set {(10)};

  • the constructor any specifies any integer, and this regular expression matches any singleton list containing an integer;

    for example, matching (any) and (10) succeeds, and so does matching (any) and (20), but matching (any) and () fails, and so does matching (any) and (10 20);

    indeed, the regular expression (any) represents the set {(n) | n is an integer}, i.e., the set of singleton lists of integers;

  • the constructor seq declares a sequence of two regular expressions, and matching this sequence matches a list

    • that starts with a prefix that is matched by the first regular expression, and
    • that ends with a suffix that is matched by the second regular expression;

    for example, matching (seq (atom 10) (atom 20)) and (10 20) succeeds because matching (atom 10) and (10) succeeds and matching (atom 20) and (20) also succeeds;

  • the constructor disj declares the disjunction (sometime called an alternation) of two regular expressions, and this disjunction matches any list that is matched by either regular expression;

    for example, matching (disj (atom 10) (atom 20)) and (10) succeeds, and so does matching (disj (atom 10) (atom 20)) and (20);

  • the constructor star declares the repetition (zero or more times) of a regular expression, and matches a list that contains a (possibly empty) sequence of sublists, each of which is matched by the regular expression;

    for example, matching (star (atom 10)) and () succeeds, and so do matching (star (atom 10)) and (10), matching (star (atom 10)) and (10 10), matching (star (atom 10)) and (10 10 10), etc.

  • the constructor plus declares the repetition (one or more times) of a regular expression, and matches a list that contains a (non-empty) sequence of sublists, each of which is matched by the regular expression;

    for example, matching (plus (atom 10)) and () fails, but matching (plus (atom 10)) and (10) succeeds, and so do matching (plus (atom 10)) and (10 10), matching (plus (atom 10)) and (10 10 10), etc.

You can play online with this regular-expression matcher here (ignore the var constructor).

Examples:

  • Matching the regular expression (atom 10) against the list (10) succeeds, and matching it against any other list (i.e., shorter, longer, or different) fails.
  • Matching the regular expression (any) against any one-element list succeeds, and matching it against any other list (i.e., shorter or longer) fails.
  • Matching the regular expression (seq (any) (any)) against any two-elements list succeeds, and matching it against any other list fails.
  • Matching the regular expression (seq (atom 10) (any)) against any two-elements list whose first element is 10 succeeds, and matching it against any other list fails.
  • Matching the regular expression (seq (any) (atom 20)) against any two-elements list whose second element is 20 succeeds, and matching it against any other list fails.
  • Matching the regular expression (seq (seq (atom 1) (atom 2)) (seq (atom 3) (atom 4))) against the list (1 2 3 4) succeeds, and matching it against any other list fails.
  • Matching the regular expression (disj (atom 10) (atom 20)) against the one-element list whose element is 10 succeeds, matching it against the one-element list whose element is 20 succeeds, and matching it against any other list fails.
  • Matching the regular expression (star (atom 10)) against any list whose only elements are 10, including the empty list, succeeds, and matching it against any other list fails.
  • Matching the regular expression (plus (atom 10)) against any list whose only elements are 10, excluding the empty list, succeeds, and matching it against any other list fails.
  • Matching the regular expression (disj (empty) (atom 10)) against the empty list succeeds, matching it against the one-element list containing 10 succeeds, and matching it against any other list fails.

Food for thought:

  • does matching (star (disj (atom 0) (atom 1))) and () succeed?
  • does matching (star (disj (atom 0) (atom 1))) and (0) succeed?
  • does matching (star (disj (atom 0) (atom 1))) and (1) succeed?
  • does matching (star (disj (atom 0) (atom 1))) and (0 0) succeed?
  • does matching (star (disj (atom 0) (atom 1))) and (0 1) succeed?
  • does matching (star (disj (atom 0) (atom 1))) and (1 0) succeed?
  • does matching (star (disj (atom 0) (atom 1))) and (1 1) succeed?
  • does matching (star (disj (atom 0) (atom 1))) and (0 0 0 1 1 1) succeed?

More examples:

  • Matching the regular expression (seq (atom 10) (disj (atom 100) (atom 200))) succeeds against either of the lists (10 100) and (10 200), and fails against any other list.
  • Matching the regular expression (seq (disj (atom 10) (atom 20)) (atom 100)) succeeds against either of the lists (10 100) and (20 100), and fails against any other list.
  • Matching the regular expression (seq (disj (atom 10) (atom 20)) (disj (atom 100) (atom 200))) succeeds against either of the lists (10 100), (10 200), (20 100) and (20 200), and fails against any other list.
  • Matching the regular expression (seq (plus (atom 10)) (seq (star (any)) (plus (atom 20)))) succeeds against any list starting with one or more occurrences of 10 and ending with one or more occurrences of 20, and fails against any other list.

Exercise 10

Which lists does (seq (disj (atom 1) (empty)) (seq (disj (atom 2) (empty)) (disj (atom 3) (empty)))) match?

Exercise 11

In how many ways does (seq (star (atom 10)) (star (atom 10))) match the following lists?

  1. ()
  2. (10)
  3. (10 10)
  4. (10 10 10)
  5. (10 10 10 10)

Why?

Exercise 12

Which lists do the following regular expressions match?

  1. (seq (empty) (atom 33))
  2. (atom 33)
  3. (seq (atom 33) (empty))
  4. (seq (empty) (seq (empty) (atom 33)))
  5. (seq (empty) (seq (atom 33) (empty)))
  6. (seq (seq (empty) (atom 33)) (empty))
  7. (seq (seq (atom 33) (empty)) (empty))

What does that say about the relation between empty and seq?

Hint: to solve this exercise, use the traditional scientific method, namely

  1. answer the question about which lists do the regular expressions match, using the online matcher, and observe the answers;
  2. formulate a hypothesis about the relation; and
  3. put your hypothesis to the test with new examples, again using the online matcher.

Exercise 13

Which lists do the following regular expressions match?

  1. (seq (atom 1) (seq (atom 2) (seq (atom 3) (atom 4))))
  2. (seq (seq (atom 1) (atom 2)) (seq (atom 3) (atom 4)))
  3. (seq (seq (seq (atom 1) (atom 2)) (atom 3)) (atom 4))
  4. (seq (atom 1) (seq (seq (atom 2) (atom 3)) (atom 4)))
  5. (seq (seq (seq (atom 1) (atom 2)) (atom 3)) (atom 4))

What does that say about seq?

Hint: to solve this exercise, use the traditional scientific method, as in Exercise 12.

Exercise 14

Which lists do the following regular expressions match?

  1. (seq (disj (atom 10) (atom 20)) (seq (atom 100) (atom 200)))
  2. (disj (seq (atom 10) (seq (atom 100) (atom 200))) (seq (atom 20) (seq (atom 100) (atom 200))))
  3. (seq (seq (atom 100) (atom 200)) (disj (atom 1000) (atom 2000)))
  4. (disj (seq (seq (atom 100) (atom 200)) (atom 1000)) (seq (seq (atom 100) (atom 200)) (atom 2000)))

What does that say about disj and seq?

Hint: to solve this exercise, use the traditional scientific method, as in Exercise 12 and Exercise 13.

Resources

Version

Fixed a typo in the narrative, thanks to Yunjeong Lee’s eagle eye [30 Mar 2019]

Created [21 Jan 2019]