PLI Lecture 2
Lexical analysis: specification and implementation

* Lexical analysis (scanning): Tasks
  - Eliminate comments
  - Transform program text to a stream of basic symbols 
    (called tokens by Louden)
  - What are basic symbols?
    . Reserved words
    . Identifiers
    . Constants (strings, numbers, ...)
    . Operators
    . Punctuation symbols
  - Enter (new) constants into a constant table (optional)
  - Enter (new) names into a name table (optional)
  - Report lexical errors

* Representing tokens internally:
  - Token type (reserved word, identifier, constant,
    operator, punctuation symbol, ...)
  - Value, or reference to value in a constant (or name) table
  - Location in program text

* Representating tokens internally (in C): 

  typedef enum {
      IF, THEN, ELSE, ..., ID, INT, DOUBLE, 
      ASSIGN, PLUS, MINUS, ..., LPAREN, RPAREN, COMMA, ... 
  } TokenType;

  typedef struct {
      TokenType tokenval;
      char *stringval;
      int numval;
      int line;
  } TokenRecord;

  Note 1. Could also use a union type for stringval/numval.
  Note 2. These definitions do not use a constant (or name) table.

* Interface to lexical analyser
  - TokenType getToken(void);
    (reads and returns next token from input)
    (called repeatedly from parser)
  - Internally, the lexical analyser may call:
    void lexError(char *errorMsg);
    char getChar(void);
    constantTabEntry *findConstant(*stringval);
    nameTabEntry *findName(*stringval);

* Implementation principles
  - Lexical analysis is slow, so optimisation is important.
  - Read each line of source program into an internal buffer.
  - getToken() advances a pointer in this buffer.

* Implementation techniques
  - Manual implementation, see Section 2.5 (pp.75-80)
  - Requires hand-coding a state transition machine
  - See files globals.h, util.h, scan.h, scan.c (pp.511-516)
  - Use Java library classes (java.util.Scanner, J2SE 5.0)
    See Sun's "Core Java Technologies Tech Tips" for
    December 1, 2004, and March 8, 2005.  (Previous classes
    were inadequate.)
  - Use tools: Lex/Flex (C), JavaCC (Java), JLex (Java), Eli (C)

* Regular expressions 
  - Used to specify form of basic symbols
    (Also used in shells, editors, Web programming, ...)
  - Examples:
    . Reserved words: if, then, else, ...
    . Identifiers: [a-zA-Z][a-zA-Z0-9]*
    . Integers: (+|-)?[0-9][0-9]*
    . Doubles: (+|-)?[0-9][0-9]*(.[0-9][0-9]*)?(e(+|-)?[0-9][0-9]*)?
    . And so on
    . (It's difficult to define comments with regular expressions.)
  - In general, regular expressions are defined recursively from:
    . Empty string (epsilon)
    . Empty set (phi)
    . Characters ('a') and character classes ([a-z])
    by composition using the following operators:
    . concatenation: abc, [a-z][0-9]z, <e> <f>
    . alternation: a|b|c, [a-z] | [0-9], <e> | <f>
    . Optional: (+|-)?, <e>?
    . Repetition: [a-z][a-z]*, <e>*
    . Positive repetition: [a-z]+, <e>+
    also using:
    . Parentheses (<e>), for grouping
    . Escape character, \| \+ | - | \* | /
  - Examples:
    . (1*01*0)*1* denotes the set of binary strings with
      an even number of 0s.
    . 1(01|1)*0? denotes the set of binary strings starting
      with 1 and not containing 00.
  - Many variations of syntax and expressive power.
  - See "Mastering Regular Expressions, Third Edition", by J. Friedl
    (O'Reilly, 2006).

* Tokens in most programming languages can be described by 
  regular expressions.
  - Examples follow in tiny.l below.

* Regular languages
  - Each regular expression (RE) defines a set of strings 
    (a language) over its alphabet.
  - The languages (sets of strings) that can be defined 
    by regular expressions is the class of regular languages.
  - Regular languages can be recognised by computers (automata)
    with only finite memory.  
  - (But how?)
 
* Finite automata I
  - Models of simple, finite-memory computers used to recognise 
    regular expressions (and many other phenomena)
  - A deterministic finite automaton (DFA) consists of:
    . a finite set of states
    . an initial state
    . a transition function: state, char -> new state
    . a set of accepting state
  - A DFA accepts an input string if, from the initial state,
    it eventually reaches an accepting state
  - Examples: 
    . identifiers (Figs 2.1 and 2.2), 1/letter->2, 2/letter|digit->2
    . unsigned integers, 1/digit->2, 2/digit->2
    . floating point numbers (Fig. 2.3), 
      (digit | (+|-)digit) digit*! 
          (E | . digit+! E) (digit | (+|-)digit) digit*!
    . binary numbers divisible by 3
  - DFAs can be represented as a 2-dimensional array.
  - DFAs can process input strings very efficiently:

    state := INITIAL_STATE;
    while (more input) 
      ch := getChar();
      state := table[state][ch]; // transition function
    if (state is an accepting state) accept;
    else reject;

* Main theorem (Kleene)
  A language is regular (i.e., is defined by a RE)
  if and only if it is accepted by some DFA. 

* Finite automata II
  - So we can implement a lexical analyser by defining 
    basic symbols using REs, translating the REs 
    into equivalent DFAs, then using the DFAs to 
    recognise the basic symbols.
  - But, to automatically translate an RE into a DFA,
    we need to go via an NFA...
  - A nondeterministic finite automaton (NFA) is like an DFA 
    except for the transition function which maps a state and 
    a character into a SET of states, and for the possibility
    of empty transitions.
  - An NFA accepts an input string if SOME sequence of choices
    leads to an accepting state.
  - Examples: Integers, doubles, binary strings with 0 as
    third last symbol.

* Translation of REs into equivalent NFAs
  - Simple recursive construction (Section 2.4.1, not required)

* Translation of NFAs into equivalent DFAs
  - Subset construction (Section 2.4.2, not required)

* Manual scanner implementation
  - See Fig. 2.10
  - See TinyCompiler: scan.c

* Automated scanner implementation 
  - Use tools such as lex/flex, JavaCC, JLex, Eli
  - See TinyCompiler scanner specification: lex/tiny.l
  - Invoke lex as follows:

    $ lex tiny.l

  - Generates a C file lex.yy.c that defines the following function:
    TokenType getToken(void)
  - Or: invoke flex as follows:

    $ flex -oscanner.c tiny.l

    which generates a C file scanner.c
  - Use the function: TokenType getToken(void).
  - Do not look at the generated scanner.

* Some Scheme examples
  - Scheme examples