Towards a syntax checker for an extended subset of Scheme

This chapter sets the stage for the midterm project.

Canonical BNF of the extended subset of Scheme

The following BNF recapitulates what we have seen of Scheme so far, where braces {...} are a meta-notation to group what is between “{” and “}”: when followed by a “+”, it means “once or more”, and when followed by a “*”, it means “zero times or more”.

<program> ::= ({<toplevel-form>}*)
<toplevel-form> ::= <definition> | <expression>
<definition> ::= (define <variable> <expression>)
<expression> ::= <number> | <boolean> | <character> | <string> | <variable> | <time-expression> | <if-expression> | <and-expression> | <or-expression> | <cond-expression> | <case-expression> | <let-expression> | <letstar-expression> | <letrec-expression> | <begin-expression> | <unless-expression> | <quote-expression> | <lambda-abstraction> | <application>
<time-expression> ::= (time <expression>)
<if-expression> ::= (if <expression> <expression> <expression>)
<and-expression> ::= (and {<expression>}*)
<or-expression> ::= (or {<expression>}*)
<cond-expression> ::= (cond {<cond-clause>}* [else <expression>])
<cond-clause> ::= [<expression>] | [<expression> <expression>] | [<expression> => <expression>]
<case-expression> ::= (case <expression> {[({<quotation>}*) <expression>]}* [else <expression>])
<let-expression> ::= (let ({[<variable> <expression>]}*) <expression>) ;;; where all the variables are distinct
<letstar-expression> ::= (let* ({[<variable> <expression>]}*) <expression>)
<letrec-expression> ::= (letrec ({[<variable> <lambda-abstraction>]}*) <expression>) ;;; where all the variables are distinct
<begin-expression> ::= (begin {<expression>}+)
<unless-expression> ::= (unless <expression> <expression>)
<quote-expression> ::= (quote <quotation>)
<quotation> ::= <number> | <boolean> | <character> | <string> | <symbol> | () | (<quotation> . <quotation>)
<lambda-abstraction> ::= (lambda <lambda-formals> <expression>) | (trace-lambda <variable> <lambda-formals> <expression>)
<lambda-formals> ::= <variable> | ({<variable>}*) ;;; where all the variables are distinct | ({<variable>}+ . <variable>) ;;; where all the variables are distinct
<application> ::= ({<expression>}+)

where:

  • a number is a value that answers #t to the predicate number?;
  • a Boolean is a value that answers #t to the predicate boolean?;
  • a character is a value that answers #t to the predicate char?;
  • a string is a value that answers #t to the predicate string?;
  • a symbol is a value that answers #t to the predicate symbol?;
  • a variable is a value that answers #t to the predicate symbol? and that is not a keyword (as defined just below); and
  • keywords are the symbols define, time, if, cond, else, case, and, or, let, let* (the “*” in let* is part of the keyword, not a meta-notation), letrec, begin, unless, quote, lambda, and trace-lambda.

The syntax checker

The syntax checker for this extended subset of Scheme is a procedure that decides whether a given value is well formed according to this given BNF. It should return:

  • #t when it is given a syntactically correct Scheme program, i.e., a list of toplevel forms, and
  • #f when it is given a syntactically incorrect Scheme program.

Nice syntax checkers issue a meaningful error message (using printf) before returning #f.

Useful syntax checkers don’t just stop after issuing an error message, but keep on syntax-checking the rest of their input.

Self-applicable syntax checkers conform to the given BNF.

Operationalizing the BNF of Scheme

The input of the syntax checker is a Scheme value, and the goal of the syntax checker is to determine whether this value conforms to the BNF. At the top level, a program is a proper list of toplevel forms, possibly empty:

<program> ::= ({<toplevel-form>}*)

Operationally, we can replace the meta-notation with an explicit non-terminal <toplevel-forms> that captures the structure of a proper list of toplevel forms:

<program> ::= <toplevel-forms>

<toplevel-forms> ::= ()
                   | (<toplevel-form> . <toplevel-forms>)

Now all the syntax checker needs to do is to follow this inductive structure.

Likewise, we can replace the occurrences of {<expression>}* with an explicit non-terminal <expressions> that captures the structure of a proper list of expressions:

<expressions> ::= ()
                | (<expression> . <expressions>)

Now all the syntax checker needs to do is to follow this inductive structure, e.g., for and-expressions and or-expressions:

<and-expression> ::= (and . <expressions>)

<or-expression> ::= (or . <expressions>)

In words: if es denotes a proper list of conformant expressions, then (and . es) is a conformant and-expression and (or . es) is a conformant or-expression.

Likewise for let-expressions and friends:

<let-expression> ::= (let ({[<variable> <expression>]}*) <expression>)
                     ;;; where all the variables are distinct

We can replace the occurrence of {[<variable> <expression>]}* with an explicit non-terminal <let-binding> that captures the structure of a proper list of bindings:

<let-expression> ::= (let <let-bindings> <expression>)
                     ;;; where all the variables declared in the let-bindings are distinct

<let-bindings> ::= ()
                 | (<let-binding> . <let-bindings>)

<let-binding> ::= [<variable> <expression>]

Now all the syntax checker needs to do is to follow this inductive structure.

For cond-expressions, the occurrence of {<cond-clause>}* does not end a list, it is followed by an else clause:

<cond-expression> ::= (cond
                        {<cond-clause>}*
                        [else <expression>])

We can replace both with an explicit non-terminal <cond-clauses> that captures the structure of a proper list of cond-clauses:

<cond-expression> ::= (cond . <cond-clauses>)

<cond-clauses> ::= (<else-clause>)
                 | (<cond-clause> . <cond-clauses>)

<else-clause> ::= [else <expression>]

<cond-clause> ::= [<expression>]
                | [<expression> <expression>]
                | [<expression> => <expression>]

Now all the syntax checker needs to do is to follow this inductive structure, and likewise for case-expressions.

A similar story can be told for the formal parameters of lambda-abstractions:

<lambda-abstraction> ::= (lambda <lambda-formals> <expression>)
                       | (trace-lambda <variable> <lambda-formals> <expression>)

<lambda-formals> ::= <variable>
                   | ({<variable>}*)
                     ;;; where all the variables are distinct
                   | ({<variable>}+ . <variable>)
                     ;;; where all the variables are distinct

In words, formal parameters are valid if they are either a variable, a proper list of (distinct) variables, or an improper list of (distinct) variables ending with a variable. Said differently but equivalently, formal parameters are valid if they are a list of distinct variables that ends with the empty list or with a variable:

<lambda-formals> ::= ()
                   | <variable>
                   | (<variable> . <lambda-formals>)

Now all the syntax checker needs to do is to follow this inductive structure.

Finally, since the meta-notation {e}+ is equivalent to e {e}*, begin-expressions and applications can be restated to use the non-terminal <expressions>:

<begin-expression> ::= (begin <expression> . <expressions>)

<application> ::= (<expression> . <expressions>)

And now all the syntax checker needs to do is to follow this inductive structure.

For ground values, the syntax checker can simply use type predicates (number?, boolean?, etc.). Otherwise the value must be a pair to have a chance to be a definition or an expression – if it isn’t one, it does not conform. If it is a pair, then the situation is similar to that of The syntax checker for arithmetic expressions and A syntax checker for Boolean expressions encountered earlier on: Follow the inductive structure!

Version

Created [14 Oct 2025]