PLI Lecture 6 Semantic analysis II: Data structures and algorithms Symbol tables, scope rules, name analysis, type inference and checking. 1. THE SYMBOL TABLE (Louden, Section 6.3) The symbol table is a typical dictionary data structure that stores information about the identifiers declared and used in the program. The principal operations are: * insert() * lookup() * delete() These are sometimes called from the lexical analysis and parser, or perhaps only from the semantic analyser (as in the TINY compiler). For efficiency, a typical representation of the symbol table is a hash table, normally of prime size. Because the size of the table must be able to grow indefinitely, it must use separate chaining (see Louden, Fig. 6.12). The hash algorithm used must use all characters of the identifier and must use them all differently. (Why?) A possible implementation (Louden, Fig. 6.13) is as follows. #define SIZE ... #define SHIFT 4 int hash(char* key) { int t = 0, i = 0; while (key[i++] != '\0') t = ((t << SHIFT) + key[i]) % SIZE; return t; } The information stored with each identifier depends on the entity associated with (or bound to) the identifier. Typical entities are constants, types or classes, fields, functions or methods, parameters, and variables. This information may be stored in either the syntax tree as attributes or in the symbol table directly. Examples of information to be stored include: * type and value of a constant * type of a variable * number of dimensions (and bounds for each dimension) of an array * field names and types of a struct, * members (fields and methods) and constructors of a class, and their properties, * parameter names and types, and result type, of a function, Important concepts in associating information with identifiers are: * scope: the region of the program where the binding (of an identifier to an entity) applies, * range: the construct (block) containing the binding, * extent: the lifetime of the binding, and * block: a construct that may contain identifier bindings. Languages differ in subtle ways with respect to their scope rules. When blocks can be nested, the binding associated with an identifier is normally the binding in the innermost block containing a binding for that identifier (static scope rule). See Louden, Figure 6.14: int i, j; int f(int size) { char i, tmp; ... { double i, j; ... } ... { char* j; ... } ... } See Louden, Figure 6.15, for a more complex example of nested procedures in Pascal. Java has nested classes and compound stattements; C has nested compound statements; Haskell and Lisp have nested procedures. To represent a symbol table for a language with nested blocks, a more complex structure than a simple hash table with separate chaining is required. One method is to use each list as a stack, inserting and deleting item-entity pairs at the start of each list. See Louden, Figure 6.16. Another method is to build a separate symbol table for each scope. Then maintain a stack of symbol tables, pushing when you enter a new scope, popping when you leave the current scope. See Louden, Figure 6.17. Languages differ subtly wrt identifier bindings (scope rules). Consider the C fragment: int i = 1; void f(void) { int i = 2, j = i+1; ... } Is j initialised to 3 (sequental declaration) or to 2 (collateral declaration)? What about mutually recursive declarations (of functions) in the same scope? void f(void) { ... g() ... } void g(void) { ... f() ... } How does the C compiler handle this case? (It requires a preceding declaration of one of the two functions.) 2. EXAMPLE (Louden, Section 6.3.5) Consider the following grammar for nested let- expressions (cf., Haskell, Lisp): S -> exp exp -> ID | NUM | ( exp ) | exp + exp | LET dec-list IN exp dec-list -> dec-list , dec | dec dec -> ID = exp An example expression is: let x = 2, y = 3 in (let x = x+1, y = (let z=3 in x+y+z) in (x+y) ) Scope rules: * Each name may be declared only once in each let-expression (block). * Each name used must be declared in some surrounding let-expression. * Static scope rules are used. * Sequential declaration (in a single block) is used. Thus, the value of the expression let x=2, y=x+1 in (let x=x+y, y=x+y in y) is 8. Exercise: What is the value of the larger example expression above? Here is an attribute grammar to determine whether or not an expression is legal (i.e., whether its err attribute is false): S --> exp exp.tab = tab() exp.level = 0 S.err = exp.err exp --> ID exp.err = ! isin(ID.name, exp.tab) exp --> NUM exp.err = false exp --> ( exp ) exp_2.tab = exp_1.tab exp_2.level = exp_1.level exp_1.err = exp_2.err exp --> exp + exp exp_2.tab = exp_1.tab exp_3.tab = exp_1.tab exp_2.level = exp_1.level exp_3.level = exp_1.level exp_1.err = exp_2.err or exp_3.err exp --> LET dec-list IN exp dec-list_2.intab = exp_1.tab dec-list.level = exp_1.level + 1 exp_2.tab = dec-list.outtab exp_2.level = dec-list.level exp_1.err = (dec-list.outtab = errtab) or exp_2.err dec-list --> dec dec.intab = dec-list.intab dec.level = dec-list.level dec-list.outtab = dec.outtab dec-list --> dec-list , dec dec-list_2.intab = dec-list_1.intab dec-list_2.level = dec-list_1.level dec.intab = dec-list_2.outtab dec.level = dec-list_2.level dec-list_1.outtab = dec.outtab dec --> id = exp exp.tab - dec.intab exp.level = dec.level dec.outtab = if (dec.intab = errtab or exp.err) errtab else if (lookup(id.name, dec.intab) = dec.level) errtab else insert(dec.intab,id.name,dec.level) This attribute grammar enforces the above scope rules. Study it to see how it all works. E.g., draw dependency graphs for the two let- expressions given above. Identify which attributes are synthesized and which are inherited. Exercise: Extend this attribute grammar to compute the value of the top-level expression. Exercise: Modify this attribute grammar to use collateral declaration instead of sequential declaration. 3. TYPE ANALYSIS (Louden, Section 6.4) The semantic analyser must compute and maintain information that allows it to infer the type of every entity, particularly every expression in the program (type inference) and to ensure that all the type rules of the language are satisfied (type checking). Data type information may be static (types are bound to variables at compile-time) as in Haskell, C and Java, or dynamic (types are bound to variables at run-time) as in Python and Lisp. Languages differ in subtle ways with respect to their type checking rules. E.g., types may be defined in different ways in different languages (see Louden, 6.4.1-2). Most languages have basic or primitive types (e.g., bool, int, float, char) from which compound types may be constructed. Examples of compound types are: * enumerated types * sequence types, e.g., arrays * product types, e.g., structs, * union types * pointer types * function types * recursive types Exercise: Which of these types may be constructed in C, Java and Haskell, respectively? How? Recursive types may be constructed directly in Haskell and Java, e.g., class ListNode { int value; ListNode next; } but may only be constructed indirectly using pointers in C, e.g., typedef struct node { int value; struct node *next; } list; 3.1 Type equivalence If two types are equivalent, values of either type may be used wherever a value of the other type is allowed. Type equivalence (a true equivalence relation) is typically defined using one of three different rules: * Structural equivalence: two types are equivalent if they have the same structure. * Name equivalence: two types are equivalent if they have the same name (obviously only available in languages where types can be named). * Declaration equivalence: two types are equivalent if they have the same "base type names" (after expanding type aliases, etc.). Different languages may use different rules, and even the same languge may use different rules for different compound types, e.g., C uses declaration equivalence for structures and unions, but structural equivalence for pointers and arrays. Structural equivalence is typically tested by a recursive function of two arguments, each of which is a structure representing a type, which traverses the two structures in parallel (see Louden, Fig. 6.20). When the structures describe recursive types, types must be stored in the symbol table (or elsewhere) to avoid indefinite recursion. 3.2 Type inference and type checking This is normally done by attribute computation and tests. Consider the simple language defined in Fig. 6.22. program --> var-decs ; stmts var-decs --> var-dec | var-decs ; var-dec var-dec --> ID : type-exp type-exp --> INT | BOOL | ARRAY [NUM] OF type-exp stmts -> stmt | stmts ; stmt stmt --> ID := exp | IF exp THEN stmt exp --> NUM | TRUE | FALSE | ID | exp[exp] | exp OR exp | exp + exp Type checking of this language is specified by the corresponding attribute grammar in Table 6.10. Note that this language does not have nested blocks, so a single, simple symbol table suffices. var-dec --> ID : type-exp insert(id.name, type-exp.type) type-exp --> INT type-exp.type = integer type-exp --> BOOL type-exp.type = boolean type-exp --> array [NUM] OF type-exp type-exp.type = makeTypeNode(array, NUM.value, type-exp.type) stmt --> ID := exp if (! typeEqual(lookup(ID.name), exp.type)) type-error(stmt) stmt --> IF exp THEN stmt if (! typeEqual(exp.type,boolean)) type-error(stmt_1) exp --> NUM exp.type = integer exp --> TRUE exp.type = boolean exp --> FALSE exp.type = boolean exp --> ID exp.type = lookup(ID.name) exp --> exp[exp] if (isArrayType(exp_2.type) && typeEqual(exp_2.type,integer)) exp_1.type := exp_2.type.child else type-error(exp_1) exp --> exp OR exp if (typeEqual(exp_2.type,boolean) && typeEqual(exp_3.type,boolean)) exp_1.type = boolean else type-error(exp_1) exp --> exp + exp if (typeEqual(exp_2.type,integer) && typeEqual(exp_3.type,integer)) exp_1.type = integer else type-error(exp_1) Note that no concept of type conversion or type coercion is possible here, only type equality. 3.3 Other topics in type analysis Overloading is the assignment of different operations to the same operator (e.g., + may denote integer addition, floating point addition, or, in Java, string concatenation). Various implementation techniques are possible. Type conversion and coercion (implicit conversion) may be required in compound expressions, in assignment statements, and in parameter passing. A key question is which types may be (a) converted, (b) coerced to which other types. In object-oriented languages, subtypes may normally becoerced to supertypes. Languages differ greatly in this area. For example, suppose the above language also had a type float. Then some of the above attribute grammar rules would have to be changed, e.g., as follows: stmt --> ID := exp if (! typeCompatible(exp.type, lookup(ID.name)) type-error(stmt) (Here, every type is compatible with itself, and type int is also compatible with type float.) exp --> exp + exp if (typeEqual(exp_2.type,integer) && typeEqual(exp_3.type,integer)) exp_1.type = integer else if (typeCompatible(exp_2,float) && typeCompatible(exp_2,float)) exp_1.type = float else type-error(exp_1) (Note this rule should also use an op_type attribute of the operator symbol: in the first case, it would have value integer-addition, in the second case value float-addition.) 4. A SEMANTIC ANALYSER FOR TINY (Louden, Section 6.5) Study the description and the code. Files: symtab.h, symtab.c, analyze.h, analyze.c