## PLI Lecture 3

Context-free grammars and languages, derivations, parse trees, ambiguity, first and follow sets, LL(1) grammars, recursive descent parsing, examples.

```Context-free grammars

Originally proposed for describing natural language grammars
by Noam Chomsky.

First grammar for palindromes:
S -> e | a | b | a S a | b S b
Here, S is a nonterminal symbol, a and b are terminal symbols,
e denotes the empty string.
Derivation: S -> a S a => a b S b a => a b b a

First grammar for arithmetic expressions:
exp -> exp op exp | ( exp ) | number
op  -> + | - | * | /
Here, exp and op are nonterminal symbols, "(", ")", number,
"+", "-", "*" and "/" are terminal symbols.
Derivation:
exp ->       exp     op exp
=> (     exp    ) *  4
=> ( exp op exp ) *  4
=> ( 2   +  3   ) *  4

First grammar for statements:
stmt -> identifier = exp ;
|  identifier (arg-part) ;
|  { stmt-list }
|  if (exp) stmt
|  if (exp) stmt else stmt
|  while (exp) stmt
exp  -> ...
arg-part -> e | arg-list
arg-list -> arg | arg , arg-list
arg  -> exp
stmt-list -> stmt | stmt stmt-list

First grammar for Lisp expressions:
lexp -> atom | list
atom -> number | identifier
list -> ( lexp-seq )
lexp-seq -> lexp | lexp-seq lexp
Derivation:
lexp ->  list => ( lexp-seq ) => ( lexp-seq lexp)
=>* ( lexp-seq lexp list ) =>* ( lexp 3 ( lexp-seq ) )
=>* ( * 3 ( lexp-seq lexp ) )
=>  ...
=>  ( * 3 ( + 4 5 ) )
We say: lexp =>* ( * 3 ( + 4 5 ) )

In practice we need good conventions to distinguish between
nonterminal symbols, terminal symbols (language tokens), and
symbols used in grammar (e.g., "|").

General definition of a context-free grammar (CFG):
A set of nonterminal symbols, including a start symbol S,
a set of terminal symbols, and a set of rules of the form
A -> a1 ... am                                    (m >= 0)
where A is a nonterminal and a1, ..., an are nonterminals
or terminals.
A -> s1 | s2 | ... | sn                           (n >= 1)
is simply an abbreviation for
A -> s1
...
A -> sn
L(G), the language defined by the CFG G, is the set of
sentences t1 t2 ... tu s.t. S =>* t1 ... tu, where t1, ..., tu
are terminal symbols (u >= 0).  I.e., L(G) is the set of
terminal sequences that can be derived from the start symbol
of G.  (Clearly, L(G) may be infinite.)

Leftmost and rightmost derivations:
A derivation is called leftmost (resp., rightmost) if,
at every step, we replace the leftmost (resp., rightmost)
nonterminal symbol by one of its right-hand sides.

Leftmost derivation:
exp -> exp op exp => 2 op exp => 2 + exp => 2 + 3

Rightmost derivation:
exp -> exp op exp => exp op 3 => exp + 3 => 2 + 3

Derivation trees provide an order-independent description of
how a sentence is derived from the start symbol of a grammar.

(Examples of deivation trees for 2+3, 2+3*4, (2+3)*4, etc.)

So a sentence is in the language defined by a grammar if
it is the sequence of leaves of a derivation tree in the
language.

Context-free languages

A language is context-free if it is the language defined
by some context-free grammar.

Theorem: Every regular language is context-free

Context-free languages require automata with unbounded space
to recognise them, e.g., { a^m b^m | m >= 0 }, balanced
parenthesis strings, arithmetic expressions.

But this unbounded space only needs to be organised as a stack.

Indeed, a language is context-free (i.e., is defined by some
context-free grammar) if and only if it is recognised by some
nondeterministic pushdown (stack) automaton (cf. Kleene's theorem).

Note that checking whether a given language is the language
generated by a given contex-free grammar is an undecidable
problem.

Ambiguous grammars

Some grammars are better than others.
One bad property a grammar may have is ambiguity.
A grammar G is ambiguous if there exists a sentence in L(G)
with two distinct derivation trees.
E.g., 2 + 3 * 4 has two different derivation trees.
E.g., 2 - 3 - 4 has two different derivation trees.
Distinct derivation trees may have distinct "meanings",
in these examples, distinct values.  So ambiguous grammars
are undesirable.

Second grammar for arithmetic expressions:
exp    ->  term  |  exp addop term
term   ->  factor  |  term mulop factor
mulop  ->  *  |  /
factor ->  number  |  ( exp )

This grammar is (much) better than the previous one:
* It is unambiguous (requires proof)
* It captures operator precedence (mulops bind stronger
* It captures operator associativity (operators are left
associative, i.e., 2 - 3 - 4 is interpreted as (2 - 3) - 4)

Exercise: Modify this grammar to make mulops right associative.

This is an important example.  It is not trivial or mechanical
to transform the first expression grammar to the second one.

"Dangling else" problem:

The grammar for statements above is ambiguous:
there are two distinct derivation trees for the statement
if (exp1) if (exp2) id1 = exp3; else id2 = exp4;

Exercise: Construct these two trees.

Transforming this grammar to an unambiguous grammar in which
each "else part" is matched with the closest unmatched "if part"
is nontrivial.

Exercise: Construct such a grammar.

Note that checking whether a given grammar is ambiguous or not
is an undecidable problem.

Limitations of context-free languages

Not every language is context-free.  The following languages
are not context free:
* { a^m b^m c^m | m >= 0 }
* The set of nested compound statements (with declarations)
in which every variable is declared before it is used.
* The set of well-typed Java programs.
* The set of always-terminating Java programs.  (This language
is not even computable.  See 3130CIT TOC for details.)

Parsing

Parsing (syntactic analysis) takes a sequence of tokens
(i.e., terminal symbols) as input, and
1. determines whether the given string is a sentence of the
language or not, and
2. (re)constructs a derivation tree (also called a parse tree)
for the sentence in the language.

The resulting parse tree describes the structure and, to some
extent, the meaning of the sentence.

In practice, parsers normally construct abstract syntax trees
or structure trees (which omit semantically meaningless
derivations steps) rather than parse trees.

There are two main types of parsers:
* Top-down parsers build the parse tree top-down (and left-right).
Top-down parsers may be build by hand or by tools.
* Bottom-up parsers build the parse tree bottom-up (and left-right).
Bottom-up parsers (in practice) may only be built by tools.
* Doh!

The sequence of rule applications in the tree construction normally
corresponds to a leftmost derivation in a top-down parser and
to a rightmost derivation in a bottom-up parser.

We describe one form of top-down parsers below, and one form
of bottom-up parsers in the next lecture.

Recursive descent parsers

* Construct parse trees from top, i.e., from start symbol
at root, down to the terminal symbols at the leaves.
* Expand nonterminal symbols in the same order as a leftmost
derivation.
* Code mirrors grammar rules, procedure for each nonterminal
symbol.
* But may require grammar transformation.

Example:
* factor -> number | "(" exp ")"
* void factor(void) {
if (token == number) {
match(number);
} else {
match("(");
exp();
match(")");
}
}
* void match(TokenType expected) {
if (token == expected)
token = getToken();
else
error(token, expected);
}

Note that we are now identifying (programming) language tokens
with the terminal symbols of the context-free language.

Rules for writing a recursive descent parser for a CFG G:
* Use a global variable, token, which stores the first,
as-yet unconsidered, token in the input.  Initially token
stores the first token in the input.
* Write one parsing procedure A() for each nonterminal
symbol A in G.
* Each such procedure A() recognises a prefix of the input
derived from the nonterminal symbol A.
* Each such procedure A() must satisfy the following invariant:
- On entry, token stores the token that starts the string
derived from A.
- On exit, token stores the first token that follows the string
derived from A.
* Observe how the above definitions of factor() and match()
ensure these invariants are satisfied.

Problems:
* We can't always decide which right-hand side to use.
Consider:
exp -> term  |  term addop exp
if the token that starts the string derived from exp is number
(or "("), we can't tell whether to use the first alternative
or the second alternative.
* One right-hand side may derive empty.  Consider:
S -> e  | a S b  (e denotes the empty string)
What does it mean to say token may start the empty string?
(The above description is somehow incomplete.)
* Left-recursive rules cause nonterminating computations.
Consider:
exp -> term  |  exp addop term
void exp(void) {
if (token == ???) { // problem???
term();
} else {
exp();     // uh, oh!!!
term();
}
}

LL(1) grammars

Not every grammar has a recursive descent parser.

Consider a CFG with start symbol S and terminals T.
Extend the grammar with a new start symbol S' and an additional
rule S' -> S \$, where \$ is a special "end of input" symbol.
Then we can define the sets:
first(s) =
{ t in T | s =>* t s' }, for all sentential forms s
follow(A) =
{ t in T | S' =>* s' A t s'' }, for all nonterminals A
In particular, first(S') = first(S) and \$ is in follow(S).

I.e., for a sequence s of terminal and nonterminal symbols,
first(s) is the set of all terminals that can occur as the
first terminal in a sentence derived from s.  For a nonterminal
symbol A, follow(A) is the set of all terminals that can follow
a sentence derived from A in the context of a longer sentence
derived from S'.

A CFG G is LL(1) if, for every rule in G
A -> s1 | s2 | ... | sn,
a) first(si) is disjoint from first(sj), for all i != j, and
b) if A =>* e (the empty string), then first(A) is disjoint
from follow(A).

If a grammar is LL(1), then it can be parsed Left to right,
generating a Leftmost derivation, with 1 token lookahead.

In particular, every LL(1) grammar can be parsed by a recursive
descent parser.

Exercises:
* What are the first and follow sets for S in the above grammar
for palindromes above?
* What are the first and follow sets for each nonterminal
symbol in the second grammar above for arithmetic expressions?
* What are the first and follow sets for each nonterminal
symbol in the above grammar for Lisp expressions?

Note that the second grammar above for arithmetic expressions
is not LL(1).

We can sometimes transform a grammar that is not LL(1)
into an equivalent grammar that is LL(1) by applying the
following transformations:
* Left factoring:
exp -> term  |  term addop exp
=>
exp -> term exp-tail
exp-tail -> e  |  addop exp
* Left recursion elimination:
exp -> term  | exp addop term
=>
exp -> term exp-tail
exp-tail -> e  | addop term exp-tail

These rules aren't complete, and are difficult to formalise.

It often simplifies the task to write the transformed grammars
in extended Backus-Naur form (EBNF).  In this form, each right
hand side of a rule may use parentheses (for grouping),
alternation ("|"), repetition ("*") and optionality ("?").
I.e.,each nonterminal has a single right hand side which is an
arbitrary regular expression, not just an alternation of
sequences.

Every grammar expressed in EBNF may be rewritten as an equivalent
context-free grammar.

Third grammar for arithmetic expressions:
term   ->   factor (mulop factor)*
mulop  ->  "*" | "/"
factor ->  number  |  "(" exp ")"

This grammar is LL(1).  (It's also clear and compact.)

Recursive descent parser for this grammar:

void exp(void) {
term();
term();
}
}

if (token == "+") match("+");
else match("-");    // token == "-"
}

Exercise: Complete this parser.  Note that what's shown
is pseudocode, not C code.   (A Java solution is available,
but you really should attempt the problem yourself before
looking at our solution.)

Check out the grammar, language and automata tools at
http://www.jflap.org.

Tree construction during parsing

In almost all cases, such parsing procedures have to be augmented
with code to construct the abstract syntax tree (or structure tree)
corresponding to the derivation tree for the sentence being parsed.

Representing derivation/parse/structure trees in C:

typedef enum {Plus,Minus,Times,Divides} OpKind;
typedef enum {OPK,ConstK) ExpKind;
typedef structt streenode {
ExpKind kind;
OpKind op;
struct streenode *lchild, *rchild;
int val;
} STreeNode;
typedef STreeNode *STreePtr;

* In general, a node may have a variable number of children.
* It's better to use a union type to save space.
* It's desirable to declare all variables (and parameters, etc.)
to be of pointer types (e.g., STreePtr).
* More abstract tree declarations are possible in C++ and Java.

Exercise: Extend the above parser to construct the corresponding
abstract syntax tree.

Study the recursive descent parser parse.c for the TINY language.

```