PLI Lecture 6
Semantic analysis II: Data structures and algorithms

Symbol tables, scope rules, name analysis, type inference
and checking.

1. THE SYMBOL TABLE (Louden, Section 6.3)

The symbol table is a typical dictionary data structure 
that stores information about the identifiers declared 
and used in the program.  

The principal operations are:
* insert()
* lookup()
* delete()

These are sometimes called from the lexical analysis 
and parser, or perhaps only from the semantic analyser 
(as in the TINY compiler).

For efficiency, a typical representation of the symbol 
table is a hash table, normally of prime size.  Because 
the size of the table must be able to grow indefinitely, 
it must use separate chaining (see Louden, Fig. 6.12).

The hash algorithm used must use all characters of the
identifier and must use them all differently.  (Why?)
A possible implementation (Louden, Fig. 6.13) is as follows.

#define SIZE ...
#define SHIFT 4

int hash(char* key) {
    int t = 0, i = 0;
    while (key[i++] != '\0')
        t = ((t << SHIFT) + key[i]) % SIZE;
    return t;
}

The information stored with each identifier depends on 
the entity associated with (or bound to) the identifier.  
Typical entities are constants, types or classes, fields,
functions or methods, parameters, and variables.

This information may be stored in either the syntax
tree as attributes or in the symbol table directly.
Examples of information to be stored include:
* type and value of a constant
* type of a variable
* number of dimensions (and bounds for each dimension)
  of an array
* field names and types of a struct,
* members (fields and methods) and constructors of a class, 
  and their properties,
* parameter names and types, and result type, of
  a function,

Important concepts in associating information with
identifiers are:
* scope: the region of the program where the binding
  (of an identifier to an entity) applies,
* range: the construct (block) containing the binding,
* extent: the lifetime of the binding, and
* block: a construct that may contain identifier
  bindings.

Languages differ in subtle ways with respect to
their scope rules.

When blocks can be nested, the binding associated
with an identifier is normally the binding in the
innermost block containing a binding for that 
identifier (static scope rule). 

See Louden, Figure 6.14:

int i, j;

int f(int size) {
    char i, tmp;
    ...
    { double i, j; ... }
    ...
    { char* j; ... }
    ...
}

See Louden, Figure 6.15, for a more complex example
of nested procedures in Pascal.  Java has nested classes
and compound stattements; C has nested compound statements;
Haskell and Lisp have nested procedures.

To represent a symbol table for a language with nested
blocks, a more complex structure than a simple hash 
table with separate chaining is required.

One method is to use each list as a stack, inserting
and deleting item-entity pairs at the start of each
list.

See Louden, Figure 6.16.

Another method is to build a separate symbol table for 
each scope.  Then maintain a stack of symbol tables,
pushing when you enter a new scope, popping when you
leave the current scope.

See Louden, Figure 6.17.

Languages differ subtly wrt identifier bindings
(scope rules).  Consider the C fragment:

int i = 1;

void f(void) {
    int i = 2, j = i+1;
    ...
}

Is j initialised to 3 (sequental declaration)
or to 2 (collateral declaration)?

What about mutually recursive declarations (of
functions) in the same scope?

void f(void) { ... g() ... }
void g(void) { ... f() ... }

How does the C compiler handle this case? 
(It requires a preceding declaration of one 
of the two functions.)

2. EXAMPLE (Louden, Section 6.3.5)

Consider the following grammar for nested let-
expressions (cf., Haskell, Lisp):

S -> exp
exp -> ID | NUM | ( exp ) | exp + exp 
     | LET dec-list IN exp
dec-list -> dec-list , dec | dec
dec -> ID = exp

An example expression is:

let x = 2, y = 3 in
  (let x = x+1, y = (let z=3 in x+y+z)
   in (x+y)
  )

Scope rules:
* Each name may be declared only once in each
  let-expression (block).
* Each name used must be declared in some
  surrounding let-expression.
* Static scope rules are used.
* Sequential declaration (in a single block)
  is used.

Thus, the value of the expression

let x=2, y=x+1 in (let x=x+y, y=x+y in y)

is 8.  

Exercise: What is the value of the larger
example expression above?

Here is an attribute grammar to determine 
whether or not an expression is legal (i.e.,
whether its err attribute is false):

S --> exp	
	exp.tab = tab()
	exp.level = 0
	S.err = exp.err

exp --> ID	
	exp.err = ! isin(ID.name, exp.tab)

exp --> NUM	
	exp.err = false

exp --> ( exp )	
	exp_2.tab = exp_1.tab
	exp_2.level = exp_1.level
	exp_1.err = exp_2.err

exp --> exp + exp
	exp_2.tab = exp_1.tab
	exp_3.tab = exp_1.tab
	exp_2.level = exp_1.level
	exp_3.level = exp_1.level
	exp_1.err = exp_2.err or exp_3.err

exp --> LET dec-list IN exp
	dec-list_2.intab = exp_1.tab
	dec-list.level = exp_1.level + 1
	exp_2.tab = dec-list.outtab
	exp_2.level = dec-list.level
	exp_1.err = (dec-list.outtab = errtab) 
		    or exp_2.err

dec-list --> dec
	dec.intab = dec-list.intab
	dec.level = dec-list.level
	dec-list.outtab = dec.outtab

dec-list --> dec-list , dec
	dec-list_2.intab = dec-list_1.intab
	dec-list_2.level = dec-list_1.level
	dec.intab = dec-list_2.outtab
	dec.level = dec-list_2.level
	dec-list_1.outtab = dec.outtab

dec --> id = exp
	exp.tab - dec.intab
	exp.level = dec.level
	dec.outtab =
	    if (dec.intab = errtab or exp.err)
	        errtab
	    else if (lookup(id.name, dec.intab) =
		     dec.level)
	        errtab
	    else 
		insert(dec.intab,id.name,dec.level)

This attribute grammar enforces the above
scope rules.  Study it to see how it all works.
E.g., draw dependency graphs for the two let-
expressions given above.  Identify which 
attributes are synthesized and which are
inherited.

Exercise: Extend this attribute grammar to
compute the value of the top-level expression.

Exercise: Modify this attribute grammar to use
collateral declaration instead of sequential
declaration.

3. TYPE ANALYSIS (Louden, Section 6.4)

The semantic analyser must compute and maintain
information that allows it to infer the type of
every entity, particularly every expression in
the program (type inference) and to ensure that
all the type rules of the language are satisfied
(type checking).

Data type information may be static (types are bound
to variables at compile-time) as in Haskell, C and
Java, or dynamic (types are bound to variables at 
run-time) as in Python and Lisp.

Languages differ in subtle ways with respect to
their type checking rules.

E.g., types may be defined in different ways in 
different languages (see Louden, 6.4.1-2).  Most
languages have basic or primitive types (e.g.,
bool, int, float, char) from which compound
types may be constructed.  Examples of compound
types are:
* enumerated types
* sequence types, e.g., arrays
* product types, e.g., structs,
* union types
* pointer types
* function types
* recursive types

Exercise: Which of these types may be constructed
in C, Java and Haskell, respectively?  How?

Recursive types may be constructed directly in
Haskell and Java, e.g., 

class ListNode {
    int value;
    ListNode next;
}

but may only be constructed indirectly using pointers
in C, e.g.,

typedef struct node {
    int value;
    struct node *next;
} list;

3.1 Type equivalence

If two types are equivalent, values of either type
may be used wherever a value of the other type is 
allowed.

Type equivalence (a true equivalence relation) is 
typically defined using one of three different rules:
* Structural equivalence: two types are equivalent
  if they have the same structure.
* Name equivalence: two types are equivalent if
  they have the same name (obviously only available
  in languages where types can be named).
* Declaration equivalence: two types are equivalent
  if they have the same "base type names" (after
  expanding type aliases, etc.).

Different languages may use different rules, and
even the same languge may use different rules for
different compound types, e.g., C uses declaration
equivalence for structures and unions, but structural
equivalence for pointers and arrays.

Structural equivalence is typically tested by a 
recursive function of two arguments, each of which
is a structure representing a type, which traverses 
the two structures in parallel (see Louden, Fig. 6.20).
When the structures describe recursive types, types
must be stored in the symbol table (or elsewhere) 
to avoid indefinite recursion.

3.2 Type inference and type checking

This is normally done by attribute computation
and tests.

Consider the simple language defined in Fig. 6.22.

program --> var-decs ; stmts
var-decs --> var-dec | var-decs ; var-dec 
var-dec --> ID : type-exp
type-exp --> INT | BOOL | ARRAY [NUM] OF type-exp
stmts -> stmt | stmts ; stmt
stmt --> ID := exp | IF exp THEN stmt
exp --> NUM | TRUE | FALSE | ID
      | exp[exp] | exp OR exp | exp + exp

Type checking of this language is specified by the 
corresponding attribute grammar in Table 6.10.

Note that this language does not have nested blocks,
so a single, simple symbol table suffices.

var-dec --> ID : type-exp
	insert(id.name, type-exp.type)

type-exp --> INT
	type-exp.type = integer
type-exp --> BOOL
	type-exp.type = boolean
type-exp --> array [NUM] OF type-exp
	type-exp.type = 
		makeTypeNode(array,
		             NUM.value,
		             type-exp.type)

stmt --> ID := exp 
	if (! typeEqual(lookup(ID.name), 
	                exp.type))
	    type-error(stmt)
stmt --> IF exp THEN stmt
	if (! typeEqual(exp.type,boolean))
	    type-error(stmt_1)

exp --> NUM 
	exp.type = integer
exp --> TRUE 
	exp.type = boolean
exp --> FALSE 
	exp.type = boolean
exp --> ID 
	exp.type = lookup(ID.name)
exp --> exp[exp] 
	if (isArrayType(exp_2.type) &&
	    typeEqual(exp_2.type,integer))
	    exp_1.type := exp_2.type.child
	else
	    type-error(exp_1)
exp --> exp OR exp 
	if (typeEqual(exp_2.type,boolean) &&
	    typeEqual(exp_3.type,boolean))
	    exp_1.type = boolean
	else
	    type-error(exp_1)
exp --> exp + exp
	if (typeEqual(exp_2.type,integer) &&
	    typeEqual(exp_3.type,integer))
	    exp_1.type = integer
	else
	    type-error(exp_1)

Note that no concept of type conversion or type
coercion is possible here, only type equality.

3.3 Other topics in type analysis

Overloading is the assignment of different 
operations to the same operator (e.g., + may denote
integer addition, floating point addition, or, in
Java, string concatenation).   Various implementation
techniques are possible.  

Type conversion and coercion (implicit conversion) 
may be required in compound expressions, in assignment 
statements, and in parameter passing.  A key question 
is which types may be (a) converted, (b) coerced to which
other types.  In object-oriented languages, subtypes may 
normally becoerced to supertypes.  Languages differ 
greatly in this area.

For example, suppose the above language also had a
type float.  Then some of the above attribute grammar
rules would have to be changed, e.g., as follows:

stmt --> ID := exp 
	if (! typeCompatible(exp.type,
	                     lookup(ID.name))
	    type-error(stmt)

(Here, every type is compatible with itself,
and type int is also compatible with type
float.)

exp --> exp + exp
	if (typeEqual(exp_2.type,integer) &&
	    typeEqual(exp_3.type,integer))
	    exp_1.type = integer
	else if (typeCompatible(exp_2,float) &&
	         typeCompatible(exp_2,float))
	    exp_1.type = float
	else
	    type-error(exp_1)

(Note this rule should also use an op_type attribute 
of the operator symbol: in the first case, it would 
have value integer-addition, in the second case 
value float-addition.)

4. A SEMANTIC ANALYSER FOR TINY (Louden, Section 6.5)

Study the description and the code.

Files: symtab.h, symtab.c, analyze.h, analyze.c