Compiled regular expressions

Week 02 featured the following tagged and fully parenthesized syntactic representation of regular expressions:

<regexp> ::= (empty) | (atom <integer>) | (any) | (seq <regexp> <regexp>) | (disj <regexp> <regexp>) | (star <regexp>) | (plus <regexp>)

Week 11 featured an interpreter for this representation of regular expressions: a Scheme procedure that, given a regular expression and a proper list of integers, tests whether this list belongs to the set represented by this regular expression.

Week 12, in this chapter, features compiled regular expressions: Scheme procedures that, given a proper list of integers, test whether this list belongs to the regular expression that was compiled.

So for example, given a regular expression re, the challenge is to implement a Scheme procedure c_re such that, for any proper list of integers ns applying the interpreter to re and ns gives the same result as applying c_re to ns. Here is the relation between re and c_re:

for all ns, (match re ns) == (c_re ns)

where match denotes the interpreter for regular expressions features in Week 11. In words, matching re and ns and applying c_re to ns should give the same result.

Interlude

Pablito: Well, given re, c_re is not difficult to implement, is it?

Bong-sun: Pray elaborate, Pablito.

Pablito: It can just call match, look:

(define c_re
  (lambda (ns)
    (match re ns)))

Bong-sun: This implementation is definitely correct, but then c_re is not a compiled version of re, is it?

Pablito: Right. Well, it can be useful in the unit tests.

Bong-sun: That is can.

Exercise 03

Given the following regular expressions, implement their compiled counterpart:

  1. (mt)
  2. (atom 10)
  3. (seq (atom 10) (atom 20))
  4. (seq (atom 10) (seq (atom 20) (atom 30)))
  5. (seq (atom 10) (seq (atom 20) (seq (atom 30) (atom 40))))
  6. (disj (atom 10) (atom 11))
  7. (seq (atom 10) (seq (disj (atom 20) (atom 21)) (seq (atom 30) (atom 40))))
  8. (seq (atom 10) (seq (disj (atom 20) (seq (atom 21) (atom 22))) (seq (atom 30) (atom 40))))

Here is a template for your solutions:

(define re1
  '(mt))

(define test_c_re1
  (lambda (candidate)
    (and (candidate '())
         (not (candidate '(10)))
         (not (candidate '(10 20)))
         (not (candidate '(10 20 30))))))

(define c_re1
  (lambda (ns)
    (null? ns)))

(unless (test_c_re1 c_re1)
  (printf "c_re1 failed its unit test"))

;;;;;

(define re2
  '(atom 10))

(define test_c_re2
  (lambda (candidate)
    (and (candidate '(10))
         (not (candidate '()))
         (not (candidate '(10 20)))
         (not (candidate '(10 20 30))))))

(define c_re2
  (lambda (ns)
    ...))

(unless (test_c_re2 c_re2)
  (printf "c_re2 failed its unit test"))

;;;;;

(define re3
  '(seq (atom 10) (atom 20)))

;;;;;

(define re4
  '(seq (atom 10) (seq (atom 20) (atom 30))))

Food for thought:

  1. A regular expression re represents a set of lists {n1s, n2s, n3s, ...}. So a compiled version of re could legitimately be (lambda (ns) (or (equal? ns n1s) (equal? ns n2s) (equal? ns n3s) ...)). The size of such a compiled version would be proportional to the size of the set of lists represented by re, which is impractical. Instead, we want a compiled version whose size is proportional to the size of re (and whose structure reflects the structure of the regular expression).
  2. In Week 11, we observed that (mt) is neutral on the left and on the right of seq in that (seq (mt) re) and (seq re (mt)) match the same lists as re. Question: What are the compiled counterparts of (seq (mt) (atom 10)) and of (seq (mt) (atom 10)), and how do they compare to c_re2?
  3. In Week 11, we observed that seq is associative in that for all regular expressions x, y, and z, (seq x (seq y z)) and (seq (seq x y) z) match the same lists. Question: What is the compiled counterpart of (seq (seq (atom 10) (atom 20)) (atom 30)) and how do they compare to c_re4?

Exercise 04

The goal of this exercise is to compile regular expressions that involve star.

So given the following regular expressions, implement their compiled counterpart:

  1. (star (mt))
  2. (star (atom 20))
  3. (seq (atom 10) (seq (star (atom 20)) (seq (atom 30) (seq (atom 40) (atom 50)))))
  4. (seq (atom 10) (seq (star (disj (atom 20) (atom 21))) (atom 30)))
  5. (seq (atom 10) (seq (star (disj (atom 20) (seq (atom 21) (atom 22)))) (atom 30)))
  6. (seq (star (atom 10)) (star (atom 20)))
  7. (star (disj (atom 10) (atom 20)))

Hint: Use letrec.

Bonus if you can compile a regular expression that contains two nested occurrences of star.

Exercise 05

The goal of this exercise is to compile regular expressions that involve star.

So given the following regular expressions, implement their compiled counterpart:

  1. (plus (mt))
  2. (plus (atom 20))
  3. (seq (atom 10) (seq (plus (atom 20)) (seq (atom 30) (seq (atom 40) (atom 50)))))
  4. (seq (atom 10) (seq (plus (disj (atom 20) (atom 21))) (atom 30)))
  5. (seq (atom 10) (seq (plus (disj (atom 20) (seq (atom 21) (atom 22)))) (atom 30)))
  6. (seq (plus (atom 10)) (plus (atom 20)))
  7. (plus (disj (atom 10) (atom 20)))

Hint: Use letrec.

Bonus if you can compile a regular expression that contains two nested occurrences of star.

Resources

Version

Created [08 Nov 2025]

Table Of Contents

Previous topic

Evaluation order

Next topic

Exercises for Week 12