Context-free grammars and languages, derivations, parse trees, ambiguity, first and follow sets, LL(1) grammars, recursive descent parsing, examples.
Context-free grammars Originally proposed for describing natural language grammars by Noam Chomsky. First grammar for palindromes: S -> e | a | b | a S a | b S b Here, S is a nonterminal symbol, a and b are terminal symbols, e denotes the empty string. Derivation: S -> a S a => a b S b a => a b b a First grammar for arithmetic expressions: exp -> exp op exp | ( exp ) | number op -> + | - | * | / Here, exp and op are nonterminal symbols, "(", ")", number, "+", "-", "*" and "/" are terminal symbols. Derivation: exp -> exp op exp => ( exp ) * 4 => ( exp op exp ) * 4 => ( 2 + 3 ) * 4 First grammar for statements: stmt -> identifier = exp ; | identifier (arg-part) ; | { stmt-list } | if (exp) stmt | if (exp) stmt else stmt | while (exp) stmt exp -> ... arg-part -> e | arg-list arg-list -> arg | arg , arg-list arg -> exp stmt-list -> stmt | stmt stmt-list First grammar for Lisp expressions: lexp -> atom | list atom -> number | identifier list -> ( lexp-seq ) lexp-seq -> lexp | lexp-seq lexp Derivation: lexp -> list => ( lexp-seq ) => ( lexp-seq lexp) =>* ( lexp-seq lexp list ) =>* ( lexp 3 ( lexp-seq ) ) =>* ( * 3 ( lexp-seq lexp ) ) => ... => ( * 3 ( + 4 5 ) ) We say: lexp =>* ( * 3 ( + 4 5 ) ) In practice we need good conventions to distinguish between nonterminal symbols, terminal symbols (language tokens), and symbols used in grammar (e.g., "|"). General definition of a context-free grammar (CFG): A set of nonterminal symbols, including a start symbol S, a set of terminal symbols, and a set of rules of the form A -> a1 ... am (m >= 0) where A is a nonterminal and a1, ..., an are nonterminals or terminals. A -> s1 | s2 | ... | sn (n >= 1) is simply an abbreviation for A -> s1 ... A -> sn L(G), the language defined by the CFG G, is the set of sentences t1 t2 ... tu s.t. S =>* t1 ... tu, where t1, ..., tu are terminal symbols (u >= 0). I.e., L(G) is the set of terminal sequences that can be derived from the start symbol of G. (Clearly, L(G) may be infinite.) Leftmost and rightmost derivations: A derivation is called leftmost (resp., rightmost) if, at every step, we replace the leftmost (resp., rightmost) nonterminal symbol by one of its right-hand sides. Leftmost derivation: exp -> exp op exp => 2 op exp => 2 + exp => 2 + 3 Rightmost derivation: exp -> exp op exp => exp op 3 => exp + 3 => 2 + 3 Derivation trees provide an order-independent description of how a sentence is derived from the start symbol of a grammar. (Examples of deivation trees for 2+3, 2+3*4, (2+3)*4, etc.) So a sentence is in the language defined by a grammar if it is the sequence of leaves of a derivation tree in the language. Context-free languages A language is context-free if it is the language defined by some context-free grammar. Theorem: Every regular language is context-free Context-free languages require automata with unbounded space to recognise them, e.g., { a^m b^m | m >= 0 }, balanced parenthesis strings, arithmetic expressions. But this unbounded space only needs to be organised as a stack. Indeed, a language is context-free (i.e., is defined by some context-free grammar) if and only if it is recognised by some nondeterministic pushdown (stack) automaton (cf. Kleene's theorem). Note that checking whether a given language is the language generated by a given contex-free grammar is an undecidable problem. Ambiguous grammars Some grammars are better than others. One bad property a grammar may have is ambiguity. A grammar G is ambiguous if there exists a sentence in L(G) with two distinct derivation trees. E.g., 2 + 3 * 4 has two different derivation trees. E.g., 2 - 3 - 4 has two different derivation trees. Distinct derivation trees may have distinct "meanings", in these examples, distinct values. So ambiguous grammars are undesirable. Second grammar for arithmetic expressions: exp -> term | exp addop term addop -> + | - term -> factor | term mulop factor mulop -> * | / factor -> number | ( exp ) This grammar is (much) better than the previous one: * It is unambiguous (requires proof) * It captures operator precedence (mulops bind stronger than addops) * It captures operator associativity (operators are left associative, i.e., 2 - 3 - 4 is interpreted as (2 - 3) - 4) Exercise: Modify this grammar to make mulops right associative. This is an important example. It is not trivial or mechanical to transform the first expression grammar to the second one. "Dangling else" problem: The grammar for statements above is ambiguous: there are two distinct derivation trees for the statement if (exp1) if (exp2) id1 = exp3; else id2 = exp4; Exercise: Construct these two trees. Transforming this grammar to an unambiguous grammar in which each "else part" is matched with the closest unmatched "if part" is nontrivial. Exercise: Construct such a grammar. Note that checking whether a given grammar is ambiguous or not is an undecidable problem. Limitations of context-free languages Not every language is context-free. The following languages are not context free: * { a^m b^m c^m | m >= 0 } * The set of nested compound statements (with declarations) in which every variable is declared before it is used. * The set of well-typed Java programs. * The set of always-terminating Java programs. (This language is not even computable. See 3130CIT TOC for details.) Parsing Parsing (syntactic analysis) takes a sequence of tokens (i.e., terminal symbols) as input, and 1. determines whether the given string is a sentence of the language or not, and 2. (re)constructs a derivation tree (also called a parse tree) for the sentence in the language. The resulting parse tree describes the structure and, to some extent, the meaning of the sentence. In practice, parsers normally construct abstract syntax trees or structure trees (which omit semantically meaningless derivations steps) rather than parse trees. There are two main types of parsers: * Top-down parsers build the parse tree top-down (and left-right). Top-down parsers may be build by hand or by tools. * Bottom-up parsers build the parse tree bottom-up (and left-right). Bottom-up parsers (in practice) may only be built by tools. * Doh! The sequence of rule applications in the tree construction normally corresponds to a leftmost derivation in a top-down parser and to a rightmost derivation in a bottom-up parser. We describe one form of top-down parsers below, and one form of bottom-up parsers in the next lecture. Recursive descent parsers * Construct parse trees from top, i.e., from start symbol at root, down to the terminal symbols at the leaves. * Expand nonterminal symbols in the same order as a leftmost derivation. * Code mirrors grammar rules, procedure for each nonterminal symbol. * But may require grammar transformation. Example: * factor -> number | "(" exp ")" * void factor(void) { if (token == number) { match(number); } else { match("("); exp(); match(")"); } } * void match(TokenType expected) { if (token == expected) token = getToken(); else error(token, expected); } Note that we are now identifying (programming) language tokens with the terminal symbols of the context-free language. Rules for writing a recursive descent parser for a CFG G: * Use a global variable, token, which stores the first, as-yet unconsidered, token in the input. Initially token stores the first token in the input. * Write one parsing procedure A() for each nonterminal symbol A in G. * Each such procedure A() recognises a prefix of the input derived from the nonterminal symbol A. * Each such procedure A() must satisfy the following invariant: - On entry, token stores the token that starts the string derived from A. - On exit, token stores the first token that follows the string derived from A. * Observe how the above definitions of factor() and match() ensure these invariants are satisfied. Problems: * We can't always decide which right-hand side to use. Consider: exp -> term | term addop exp if the token that starts the string derived from exp is number (or "("), we can't tell whether to use the first alternative or the second alternative. * One right-hand side may derive empty. Consider: S -> e | a S b (e denotes the empty string) What does it mean to say token may start the empty string? (The above description is somehow incomplete.) * Left-recursive rules cause nonterminating computations. Consider: exp -> term | exp addop term void exp(void) { if (token == ???) { // problem??? term(); } else { exp(); // uh, oh!!! addop(); term(); } } LL(1) grammars Not every grammar has a recursive descent parser. Consider a CFG with start symbol S and terminals T. Extend the grammar with a new start symbol S' and an additional rule S' -> S $, where $ is a special "end of input" symbol. Then we can define the sets: first(s) = { t in T | s =>* t s' }, for all sentential forms s follow(A) = { t in T | S' =>* s' A t s'' }, for all nonterminals A In particular, first(S') = first(S) and $ is in follow(S). I.e., for a sequence s of terminal and nonterminal symbols, first(s) is the set of all terminals that can occur as the first terminal in a sentence derived from s. For a nonterminal symbol A, follow(A) is the set of all terminals that can follow a sentence derived from A in the context of a longer sentence derived from S'. A CFG G is LL(1) if, for every rule in G A -> s1 | s2 | ... | sn, a) first(si) is disjoint from first(sj), for all i != j, and b) if A =>* e (the empty string), then first(A) is disjoint from follow(A). If a grammar is LL(1), then it can be parsed Left to right, generating a Leftmost derivation, with 1 token lookahead. In particular, every LL(1) grammar can be parsed by a recursive descent parser. Exercises: * What are the first and follow sets for S in the above grammar for palindromes above? * What are the first and follow sets for each nonterminal symbol in the second grammar above for arithmetic expressions? * What are the first and follow sets for each nonterminal symbol in the above grammar for Lisp expressions? Note that the second grammar above for arithmetic expressions is not LL(1). We can sometimes transform a grammar that is not LL(1) into an equivalent grammar that is LL(1) by applying the following transformations: * Left factoring: exp -> term | term addop exp => exp -> term exp-tail exp-tail -> e | addop exp * Left recursion elimination: exp -> term | exp addop term => exp -> term exp-tail exp-tail -> e | addop term exp-tail These rules aren't complete, and are difficult to formalise. It often simplifies the task to write the transformed grammars in extended Backus-Naur form (EBNF). In this form, each right hand side of a rule may use parentheses (for grouping), alternation ("|"), repetition ("*") and optionality ("?"). I.e.,each nonterminal has a single right hand side which is an arbitrary regular expression, not just an alternation of sequences. Every grammar expressed in EBNF may be rewritten as an equivalent context-free grammar. Third grammar for arithmetic expressions: exp -> term (addop term)* addop -> "+" | "-" term -> factor (mulop factor)* mulop -> "*" | "/" factor -> number | "(" exp ")" This grammar is LL(1). (It's also clear and compact.) Recursive descent parser for this grammar: void exp(void) { term(); while (token in first(addop)) { addop(); term(); } } void addop(void) { if (token == "+") match("+"); else match("-"); // token == "-" } Exercise: Complete this parser. Note that what's shown is pseudocode, not C code. (A Java solution is available, but you really should attempt the problem yourself before looking at our solution.) Check out the grammar, language and automata tools at http://www.jflap.org. Tree construction during parsing In almost all cases, such parsing procedures have to be augmented with code to construct the abstract syntax tree (or structure tree) corresponding to the derivation tree for the sentence being parsed. Representing derivation/parse/structure trees in C: typedef enum {Plus,Minus,Times,Divides} OpKind; typedef enum {OPK,ConstK) ExpKind; typedef structt streenode { ExpKind kind; OpKind op; struct streenode *lchild, *rchild; int val; } STreeNode; typedef STreeNode *STreePtr; * In general, a node may have a variable number of children. * It's better to use a union type to save space. * It's desirable to declare all variables (and parameters, etc.) to be of pointer types (e.g., STreePtr). * More abstract tree declarations are possible in C++ and Java. Exercise: Extend the above parser to construct the corresponding abstract syntax tree. Study the recursive descent parser parse.c for the TINY language.