PLI Lecture 2
Lexical analysis: specification and implementation
* Lexical analysis (scanning): Tasks
- Eliminate comments
- Transform program text to a stream of basic symbols
(called tokens by Louden)
- What are basic symbols?
. Reserved words
. Identifiers
. Constants (strings, numbers, ...)
. Operators
. Punctuation symbols
- Enter (new) constants into a constant table (optional)
- Enter (new) names into a name table (optional)
- Report lexical errors
* Representing tokens internally:
- Token type (reserved word, identifier, constant,
operator, punctuation symbol, ...)
- Value, or reference to value in a constant (or name) table
- Location in program text
* Representating tokens internally (in C):
typedef enum {
IF, THEN, ELSE, ..., ID, INT, DOUBLE,
ASSIGN, PLUS, MINUS, ..., LPAREN, RPAREN, COMMA, ...
} TokenType;
typedef struct {
TokenType tokenval;
char *stringval;
int numval;
int line;
} TokenRecord;
Note 1. Could also use a union type for stringval/numval.
Note 2. These definitions do not use a constant (or name) table.
* Interface to lexical analyser
- TokenType getToken(void);
(reads and returns next token from input)
(called repeatedly from parser)
- Internally, the lexical analyser may call:
void lexError(char *errorMsg);
char getChar(void);
constantTabEntry *findConstant(*stringval);
nameTabEntry *findName(*stringval);
* Implementation principles
- Lexical analysis is slow, so optimisation is important.
- Read each line of source program into an internal buffer.
- getToken() advances a pointer in this buffer.
* Implementation techniques
- Manual implementation, see Section 2.5 (pp.75-80)
- Requires hand-coding a state transition machine
- See files globals.h, util.h, scan.h, scan.c (pp.511-516)
- Use Java library classes (java.util.Scanner, J2SE 5.0)
See Sun's "Core Java Technologies Tech Tips" for
December 1, 2004, and March 8, 2005. (Previous classes
were inadequate.)
- Use tools: Lex/Flex (C), JavaCC (Java), JLex (Java), Eli (C)
* Regular expressions
- Used to specify form of basic symbols
(Also used in shells, editors, Web programming, ...)
- Examples:
. Reserved words: if, then, else, ...
. Identifiers: [a-zA-Z][a-zA-Z0-9]*
. Integers: (+|-)?[0-9][0-9]*
. Doubles: (+|-)?[0-9][0-9]*(.[0-9][0-9]*)?(e(+|-)?[0-9][0-9]*)?
. And so on
. (It's difficult to define comments with regular expressions.)
- In general, regular expressions are defined recursively from:
. Empty string (epsilon)
. Empty set (phi)
. Characters ('a') and character classes ([a-z])
by composition using the following operators:
. concatenation: abc, [a-z][0-9]z, <e> <f>
. alternation: a|b|c, [a-z] | [0-9], <e> | <f>
. Optional: (+|-)?, <e>?
. Repetition: [a-z][a-z]*, <e>*
. Positive repetition: [a-z]+, <e>+
also using:
. Parentheses (<e>), for grouping
. Escape character, \| \+ | - | \* | /
- Examples:
. (1*01*0)*1* denotes the set of binary strings with
an even number of 0s.
. 1(01|1)*0? denotes the set of binary strings starting
with 1 and not containing 00.
- Many variations of syntax and expressive power.
- See "Mastering Regular Expressions, Third Edition", by J. Friedl
(O'Reilly, 2006).
* Tokens in most programming languages can be described by
regular expressions.
- Examples follow in tiny.l below.
* Regular languages
- Each regular expression (RE) defines a set of strings
(a language) over its alphabet.
- The languages (sets of strings) that can be defined
by regular expressions is the class of regular languages.
- Regular languages can be recognised by computers (automata)
with only finite memory.
- (But how?)
* Finite automata I
- Models of simple, finite-memory computers used to recognise
regular expressions (and many other phenomena)
- A deterministic finite automaton (DFA) consists of:
. a finite set of states
. an initial state
. a transition function: state, char -> new state
. a set of accepting state
- A DFA accepts an input string if, from the initial state,
it eventually reaches an accepting state
- Examples:
. identifiers (Figs 2.1 and 2.2), 1/letter->2, 2/letter|digit->2
. unsigned integers, 1/digit->2, 2/digit->2
. floating point numbers (Fig. 2.3),
(digit | (+|-)digit) digit*!
(E | . digit+! E) (digit | (+|-)digit) digit*!
. binary numbers divisible by 3
- DFAs can be represented as a 2-dimensional array.
- DFAs can process input strings very efficiently:
state := INITIAL_STATE;
while (more input)
ch := getChar();
state := table[state][ch]; // transition function
if (state is an accepting state) accept;
else reject;
* Main theorem (Kleene)
A language is regular (i.e., is defined by a RE)
if and only if it is accepted by some DFA.
* Finite automata II
- So we can implement a lexical analyser by defining
basic symbols using REs, translating the REs
into equivalent DFAs, then using the DFAs to
recognise the basic symbols.
- But, to automatically translate an RE into a DFA,
we need to go via an NFA...
- A nondeterministic finite automaton (NFA) is like an DFA
except for the transition function which maps a state and
a character into a SET of states, and for the possibility
of empty transitions.
- An NFA accepts an input string if SOME sequence of choices
leads to an accepting state.
- Examples: Integers, doubles, binary strings with 0 as
third last symbol.
* Translation of REs into equivalent NFAs
- Simple recursive construction (Section 2.4.1, not required)
* Translation of NFAs into equivalent DFAs
- Subset construction (Section 2.4.2, not required)
* Manual scanner implementation
- See Fig. 2.10
- See TinyCompiler: scan.c
* Automated scanner implementation
- Use tools such as lex/flex, JavaCC, JLex, Eli
- See TinyCompiler scanner specification: lex/tiny.l
- Invoke lex as follows:
$ lex tiny.l
- Generates a C file lex.yy.c that defines the following function:
TokenType getToken(void)
- Or: invoke flex as follows:
$ flex -oscanner.c tiny.l
which generates a C file scanner.c
- Use the function: TokenType getToken(void).
- Do not look at the generated scanner.
* Some Scheme examples
- Scheme examples