PLI Lecture 6
Semantic analysis II: Data structures and algorithms
Symbol tables, scope rules, name analysis, type inference
and checking.
1. THE SYMBOL TABLE (Louden, Section 6.3)
The symbol table is a typical dictionary data structure
that stores information about the identifiers declared
and used in the program.
The principal operations are:
* insert()
* lookup()
* delete()
These are sometimes called from the lexical analysis
and parser, or perhaps only from the semantic analyser
(as in the TINY compiler).
For efficiency, a typical representation of the symbol
table is a hash table, normally of prime size. Because
the size of the table must be able to grow indefinitely,
it must use separate chaining (see Louden, Fig. 6.12).
The hash algorithm used must use all characters of the
identifier and must use them all differently. (Why?)
A possible implementation (Louden, Fig. 6.13) is as follows.
#define SIZE ...
#define SHIFT 4
int hash(char* key) {
int t = 0, i = 0;
while (key[i++] != '\0')
t = ((t << SHIFT) + key[i]) % SIZE;
return t;
}
The information stored with each identifier depends on
the entity associated with (or bound to) the identifier.
Typical entities are constants, types or classes, fields,
functions or methods, parameters, and variables.
This information may be stored in either the syntax
tree as attributes or in the symbol table directly.
Examples of information to be stored include:
* type and value of a constant
* type of a variable
* number of dimensions (and bounds for each dimension)
of an array
* field names and types of a struct,
* members (fields and methods) and constructors of a class,
and their properties,
* parameter names and types, and result type, of
a function,
Important concepts in associating information with
identifiers are:
* scope: the region of the program where the binding
(of an identifier to an entity) applies,
* range: the construct (block) containing the binding,
* extent: the lifetime of the binding, and
* block: a construct that may contain identifier
bindings.
Languages differ in subtle ways with respect to
their scope rules.
When blocks can be nested, the binding associated
with an identifier is normally the binding in the
innermost block containing a binding for that
identifier (static scope rule).
See Louden, Figure 6.14:
int i, j;
int f(int size) {
char i, tmp;
...
{ double i, j; ... }
...
{ char* j; ... }
...
}
See Louden, Figure 6.15, for a more complex example
of nested procedures in Pascal. Java has nested classes
and compound stattements; C has nested compound statements;
Haskell and Lisp have nested procedures.
To represent a symbol table for a language with nested
blocks, a more complex structure than a simple hash
table with separate chaining is required.
One method is to use each list as a stack, inserting
and deleting item-entity pairs at the start of each
list.
See Louden, Figure 6.16.
Another method is to build a separate symbol table for
each scope. Then maintain a stack of symbol tables,
pushing when you enter a new scope, popping when you
leave the current scope.
See Louden, Figure 6.17.
Languages differ subtly wrt identifier bindings
(scope rules). Consider the C fragment:
int i = 1;
void f(void) {
int i = 2, j = i+1;
...
}
Is j initialised to 3 (sequental declaration)
or to 2 (collateral declaration)?
What about mutually recursive declarations (of
functions) in the same scope?
void f(void) { ... g() ... }
void g(void) { ... f() ... }
How does the C compiler handle this case?
(It requires a preceding declaration of one
of the two functions.)
2. EXAMPLE (Louden, Section 6.3.5)
Consider the following grammar for nested let-
expressions (cf., Haskell, Lisp):
S -> exp
exp -> ID | NUM | ( exp ) | exp + exp
| LET dec-list IN exp
dec-list -> dec-list , dec | dec
dec -> ID = exp
An example expression is:
let x = 2, y = 3 in
(let x = x+1, y = (let z=3 in x+y+z)
in (x+y)
)
Scope rules:
* Each name may be declared only once in each
let-expression (block).
* Each name used must be declared in some
surrounding let-expression.
* Static scope rules are used.
* Sequential declaration (in a single block)
is used.
Thus, the value of the expression
let x=2, y=x+1 in (let x=x+y, y=x+y in y)
is 8.
Exercise: What is the value of the larger
example expression above?
Here is an attribute grammar to determine
whether or not an expression is legal (i.e.,
whether its err attribute is false):
S --> exp
exp.tab = tab()
exp.level = 0
S.err = exp.err
exp --> ID
exp.err = ! isin(ID.name, exp.tab)
exp --> NUM
exp.err = false
exp --> ( exp )
exp_2.tab = exp_1.tab
exp_2.level = exp_1.level
exp_1.err = exp_2.err
exp --> exp + exp
exp_2.tab = exp_1.tab
exp_3.tab = exp_1.tab
exp_2.level = exp_1.level
exp_3.level = exp_1.level
exp_1.err = exp_2.err or exp_3.err
exp --> LET dec-list IN exp
dec-list_2.intab = exp_1.tab
dec-list.level = exp_1.level + 1
exp_2.tab = dec-list.outtab
exp_2.level = dec-list.level
exp_1.err = (dec-list.outtab = errtab)
or exp_2.err
dec-list --> dec
dec.intab = dec-list.intab
dec.level = dec-list.level
dec-list.outtab = dec.outtab
dec-list --> dec-list , dec
dec-list_2.intab = dec-list_1.intab
dec-list_2.level = dec-list_1.level
dec.intab = dec-list_2.outtab
dec.level = dec-list_2.level
dec-list_1.outtab = dec.outtab
dec --> id = exp
exp.tab - dec.intab
exp.level = dec.level
dec.outtab =
if (dec.intab = errtab or exp.err)
errtab
else if (lookup(id.name, dec.intab) =
dec.level)
errtab
else
insert(dec.intab,id.name,dec.level)
This attribute grammar enforces the above
scope rules. Study it to see how it all works.
E.g., draw dependency graphs for the two let-
expressions given above. Identify which
attributes are synthesized and which are
inherited.
Exercise: Extend this attribute grammar to
compute the value of the top-level expression.
Exercise: Modify this attribute grammar to use
collateral declaration instead of sequential
declaration.
3. TYPE ANALYSIS (Louden, Section 6.4)
The semantic analyser must compute and maintain
information that allows it to infer the type of
every entity, particularly every expression in
the program (type inference) and to ensure that
all the type rules of the language are satisfied
(type checking).
Data type information may be static (types are bound
to variables at compile-time) as in Haskell, C and
Java, or dynamic (types are bound to variables at
run-time) as in Python and Lisp.
Languages differ in subtle ways with respect to
their type checking rules.
E.g., types may be defined in different ways in
different languages (see Louden, 6.4.1-2). Most
languages have basic or primitive types (e.g.,
bool, int, float, char) from which compound
types may be constructed. Examples of compound
types are:
* enumerated types
* sequence types, e.g., arrays
* product types, e.g., structs,
* union types
* pointer types
* function types
* recursive types
Exercise: Which of these types may be constructed
in C, Java and Haskell, respectively? How?
Recursive types may be constructed directly in
Haskell and Java, e.g.,
class ListNode {
int value;
ListNode next;
}
but may only be constructed indirectly using pointers
in C, e.g.,
typedef struct node {
int value;
struct node *next;
} list;
3.1 Type equivalence
If two types are equivalent, values of either type
may be used wherever a value of the other type is
allowed.
Type equivalence (a true equivalence relation) is
typically defined using one of three different rules:
* Structural equivalence: two types are equivalent
if they have the same structure.
* Name equivalence: two types are equivalent if
they have the same name (obviously only available
in languages where types can be named).
* Declaration equivalence: two types are equivalent
if they have the same "base type names" (after
expanding type aliases, etc.).
Different languages may use different rules, and
even the same languge may use different rules for
different compound types, e.g., C uses declaration
equivalence for structures and unions, but structural
equivalence for pointers and arrays.
Structural equivalence is typically tested by a
recursive function of two arguments, each of which
is a structure representing a type, which traverses
the two structures in parallel (see Louden, Fig. 6.20).
When the structures describe recursive types, types
must be stored in the symbol table (or elsewhere)
to avoid indefinite recursion.
3.2 Type inference and type checking
This is normally done by attribute computation
and tests.
Consider the simple language defined in Fig. 6.22.
program --> var-decs ; stmts
var-decs --> var-dec | var-decs ; var-dec
var-dec --> ID : type-exp
type-exp --> INT | BOOL | ARRAY [NUM] OF type-exp
stmts -> stmt | stmts ; stmt
stmt --> ID := exp | IF exp THEN stmt
exp --> NUM | TRUE | FALSE | ID
| exp[exp] | exp OR exp | exp + exp
Type checking of this language is specified by the
corresponding attribute grammar in Table 6.10.
Note that this language does not have nested blocks,
so a single, simple symbol table suffices.
var-dec --> ID : type-exp
insert(id.name, type-exp.type)
type-exp --> INT
type-exp.type = integer
type-exp --> BOOL
type-exp.type = boolean
type-exp --> array [NUM] OF type-exp
type-exp.type =
makeTypeNode(array,
NUM.value,
type-exp.type)
stmt --> ID := exp
if (! typeEqual(lookup(ID.name),
exp.type))
type-error(stmt)
stmt --> IF exp THEN stmt
if (! typeEqual(exp.type,boolean))
type-error(stmt_1)
exp --> NUM
exp.type = integer
exp --> TRUE
exp.type = boolean
exp --> FALSE
exp.type = boolean
exp --> ID
exp.type = lookup(ID.name)
exp --> exp[exp]
if (isArrayType(exp_2.type) &&
typeEqual(exp_2.type,integer))
exp_1.type := exp_2.type.child
else
type-error(exp_1)
exp --> exp OR exp
if (typeEqual(exp_2.type,boolean) &&
typeEqual(exp_3.type,boolean))
exp_1.type = boolean
else
type-error(exp_1)
exp --> exp + exp
if (typeEqual(exp_2.type,integer) &&
typeEqual(exp_3.type,integer))
exp_1.type = integer
else
type-error(exp_1)
Note that no concept of type conversion or type
coercion is possible here, only type equality.
3.3 Other topics in type analysis
Overloading is the assignment of different
operations to the same operator (e.g., + may denote
integer addition, floating point addition, or, in
Java, string concatenation). Various implementation
techniques are possible.
Type conversion and coercion (implicit conversion)
may be required in compound expressions, in assignment
statements, and in parameter passing. A key question
is which types may be (a) converted, (b) coerced to which
other types. In object-oriented languages, subtypes may
normally becoerced to supertypes. Languages differ
greatly in this area.
For example, suppose the above language also had a
type float. Then some of the above attribute grammar
rules would have to be changed, e.g., as follows:
stmt --> ID := exp
if (! typeCompatible(exp.type,
lookup(ID.name))
type-error(stmt)
(Here, every type is compatible with itself,
and type int is also compatible with type
float.)
exp --> exp + exp
if (typeEqual(exp_2.type,integer) &&
typeEqual(exp_3.type,integer))
exp_1.type = integer
else if (typeCompatible(exp_2,float) &&
typeCompatible(exp_2,float))
exp_1.type = float
else
type-error(exp_1)
(Note this rule should also use an op_type attribute
of the operator symbol: in the first case, it would
have value integer-addition, in the second case
value float-addition.)
4. A SEMANTIC ANALYSER FOR TINY (Louden, Section 6.5)
Study the description and the code.
Files: symtab.h, symtab.c, analyze.h, analyze.c