PLI Tutorial 3
Lexical analysis and context-free grammars

(Lm.n refers to Louden, Exercise m.n.)

Complete Tutorial 2.

Part A.  Scheme programming.

1. Given a list L, write a function that returns the sublist of L from 
   position m of length n, where 0 <= m, n and m+n <= length(L).  For
   example, (sublist '(2 3 5 7 11 13 17 19) 2 4) ==> (5 7 11 13).
   
2. Write a function to sort a list of integers using the mergesort
   algorithm.  (Hint. Function sublist might be useful.)
   
3. Run-length encoding of a list.
   Implement the so-called run-length encoding data compression method. 
   Consecutive copies of an element in a list are encoded as lists (<n> x),
   where <n> is the number of copies of the element x.  For example,
   (encode '(a a a a b c c a a d e e e e)) ==>
   ((4 a) (1 b) (2 c) (2 a) (1 d) (4 e))

4. Modified run-length encoding.
   Modify the solution to the previous question so that if an element 
   has no following copies it is transferred unchanged into the result list. 
   Only elements with copies are transferred as (<n> x) lists. For example,
   (encode-modified '(a a a a b c c a a d e e e e)) ==>
   ((4 a) b (2 c) (2 a) d (4 e))

5. Modified run-length decoding.
   Given a run-length code list generated as specified in the previous question,
   construct its uncompressed version.

(Solutions to questions 3 and 5 are provided on the Resources page but 
please try to solve the questions yourself before reading the solutions.)

Part B.  Lexical analysis.
   
1, Modify the lex specification lex/tiny.l (and related files)
   in the Tiny compiler so that the function getToken() returns 
   a structure containing a token type and a token string instead 
   of returning a token type and assigning the token string to a
   global variable.

   Check that flex appears to accept your changes by executing flex
   on your modified specification.

2. Modify the Lex specification lex/tiny.l (and related files) 
   in the Tiny compiler as follows:
   
   a. Add new reserved words "while" and "do".
   b. Allow digits and underscores in identifiers, but not at the start.
   c. Change the comment convention to the C convention /* ... */.
   
   Check that flex appears to accept your changes by executing flex
   on your modified specification.

3. Test that your specification is correct by writing and testing
   a driver program that repeatedly calls your generated scanner, 
   reading from standard input, and writing the type and value of
   each successive token on a separate line.
   
Part C.  Context-free grammars.

1. Give a CFG for the language { s in (a|b)* | s = reverse(s) } 
   of palindromes over terminal symbols a and b.

2. Give a CFG for the language {a^i b^j c^k | i=j or j=k}, where a, b
   and c are terminal symbols, and a^i denotes a sequence of i a's.
   Give two different leftmost derivations for the string a^2 b^2 c^2.

3. Argue informally that the CFG
      S -> e | a S b S | b S a S
   (where e denotes the empty sequence) generates all strings of a's
   and b's with an equal number of a's and b's.

4. Given the second CFG for expressions in the Lecture Notes (the one
   below that defines precedences and associativities of operators),

      exp -> exp addop term | term
      addop -> + | -
      term -> term mulop factor | factor
      mulop -> * | /
      factor -> ( exp ) | number

   give a leftmost derivation, a derivation tree and an abstract syntax 
   tree for the expression 3-(4+5*6).
   
5. (L3.4) The following grammar generates all regular expressions
   over the alphabet of letters ("|" and "*" are terminal symbols
   in the language of regular expressions):

      R -> R "|" R | R R | R "*" | "(" R ")" | letter

   a. Give a leftmost derivation for the regular expression (aa | b)*
      using this grammar.
   b. Show that the grammar is unambiguous.
   c. Rewrite the grammar to establish the correct precedences
      between "*" (highest), concatenation (next highest), and "|" 
      (lowest).

6. (L3.6 and L3.10a) Consider the following grammar for simplified
   Scheme expressions:
   
      lexp -> atom | list
      atom -> number | identifier
      list -> "(" lexp-seq ")"
      lexp-seq -> lexp | lexp-seq
      
   a. Write a leftmost and a rightmost derivation for the string 
      (a 23 (m x y).  Draw a derivation tree for the same string.
   b. Rewrite the grammar in extended Backus-Naur form (EBNF).
   
7. (L3.13) Consider the following simplification of the above grammar, 
   which describes simple arithmetic expressions in Scheme-like 
   prefix notation:
   
      lexp -> number | "(" op lexp-seq ")"
      op -> "+" | "-" | "*" | "/"
      lexp-seq -> lexp | lexp-seq lexp
      
   a. What meanings should be assigned to the expressions (+ 2), 
      (- 2), (+ 2 3 4), (- 2 3 4)?
   b. Does this grammar capture the normal precedence and associativity
      of the arithmetic operators?
   c. Is the grammar ambiguous or unambiguous?
   
8. Consider the following grammar for balanced parenthesis strings:
       S -> e | ( S ) S 
    a. What are first(S) and follow(S) in ths grammar?  Is first(S)
       disjoint from follow(S)?  Is the grammar LL(1)?
    b. Write a recursive descent parser for this grammar.  Your parser
       need only recognise whether an input string is a balanced 
       parenthesis string or not.

9. Consider the second CFG for expressions (the one above).  
    What are first(A) and follow(A) for each nonterminal symbol A 
    in the grammar?   Show that the grammar is not LL(1).

10. Write an EBNF grammar for this language that is LL(1).  (An EBNF
    grammar is LL(1) if its equivalent BNF grammar is LL(1).)  Show
    that the grammar is LL(1).