PLI Lecture 12
Code optimisation (actually, code improvement)

1. Optimisation opportunities

* Register allocation
  - with small number of registers (register spilling)
  - with large number of registers (register allocation)
  - register storage with procedure calls (hardware support?)

* Unnecessary operations
  - common subexpression elimination
    (programmer/compiler generated, e.g., array references)
  - assignments to subsequently unused variables
  - dead (i.e., unreachable) code elimination
  - jump optimisation
  
* Expensive operations
  - strength reduction
  - constant folding
  - constant propagation
  - procedure inlining (can be specified in C)
  - tail recursion elimination (required in Scheme)
  
* Machine-specific opportunities
  - instruction selection
  - machine idioms
  - peephole optimisations
  
* Predicting program behaviour (statistically)
  - frequent paths, procedures code blocks
  - very important in modern pipelined processors
  
2. Classification of optimisations

Optimisation is available at every stage.

* Source-level vs target-level optimisations

* Ordering of optimisations
  - E.g., perform constant folding & propagation 
    before dead code elimination,
    x = 1; ... y = 0; ... if (y) x = 0; ... if (x) y = 1; ...

* Local (basic block) vs global (intraprocedural)
  vs interprocedural optimisations
  - A basic block is a maximal unbranching code
    sequence (normally including procedure calls).
  
* Example of local optimisation
  - Don't reload value known to be in register.

* Example of global optimisation
  - Store "induction" variables in registers.
  - Move "constant" code out of loops.

3. Implementation techniques

* Syntax tree transformation is sometimes useful,
  - e.g., for constant folding and dead code elimination. 
  
* Syntax tree attribution is often useful,
  - e.g., for variable usage counts (which variables
    to store in registers).
  
* Flow graphs for global optimisation
  - A flow graph is a directed graph whose nodes
    correspond to basic blocks and whose edges 
    correspond to (conditional) jumps (to the 
    start of basic blocks).
  - E.g., see Figure 8.18.
  - A flow graph can be constructed in a single pass
    over the intermediate code.
  - Exercise: Describe how to do this.
  
* Data flow analysis is the process of collecting
  information useful for code optimisation.
  
* Example of data flow analysis
  - Reaching definitions
  - A definition (of a variable) is an instruction
    (e.g., assignment, read) that sets the value of
    the variable.
  - A definition reaches a basic block if at the 
    start of the block the variable has the value
    set by the definition.
  - Knowing which definitions reach a given basic 
    block is useful for many optimisations,
    . e.g., for constant propagation.
    
* To generate code for a basic block, it is useful
  to first construct a DAG for the block.
  
* A DAG for a basic block is an directed acyclic graph
  that describes how intermediate and output variables
  in a block depend on its input and intermediate 
  variables (and constants).
  - E.g., see Figures 8.19 and 8.20.
  - Exercise: Describe how to construct a DAG for a 
    basic block.
    
* A DAG for a basic block can be transformed into
  target code by "topologically sorting" the DAG.
  - A topological sort of a DAG is an ordering of
    the nonleaf nodes in the DAG such that no target 
    follows one of its source in the ordering.
  - (Generating topological orderings is a standard
    algorithm.)
  - Some topological sorts are better than others.
  - E.g., topological sorts of Figure 8.19.
  - E.g., topological sorts of Figure 8.20.
  - Each element of the topological sort corresponds
    to an instruction in the target code.

* Using DAGs for basic blocks in this way:
  - automatically does common subexpression elimination
  - eliminates redundant assignments
  - enables good register allocation 
  
* Maintenance of register descriptors and address
  descriptors
  - A register descriptor stores the set of variables
    whose value is currently stored in the register.
  - An address descriptor stores the set of locations
    where a given variable is currently stored.
  - E.g., descriptor maintenance during execution
    of the basic block DAG in Figure 8.19 (pp.479-480).
  - This information can be used for register allocation
    and dead code elimination.
    
4. Other techniques

* Register usage
  - Allocate specified registers for program counter,
    global pointer, frame pointer, stack pointer, etc.
  - With large number of registers, allocate registers
    for parameters and local variables of current 
    procedure call.  This may require storing register
    values on stack frame on subsequent procedure calls
    and restoring them on procedure return.  Some instruction
    sets support this.
  - Use remaining registers for temporaries in expression
    evaluation.  This may require "spilling" registers
    to temporary locations in the current frame when
    more registers are required.
    
* Example: Expression evaluation with register spilling
  - First compute the number of registers required
    to evaluate each node of an expression tree:
    
    void expRegs(SyntaxTree t) 
      if (t is a constant) 
        t.nregs = 0
      else if (t is a variable) 
        t.nregs = 1
      else if (t has one child, u) 
        t.nregs = max(1, u.nregs)
      else if (t has two children, u and v)
        if (u.nregs = v.nregs) 
          t.nregs = 1 + u.nregs
        else 
          t.nregs = max(u.nregs, v.nregs)
         
  - Then, generate code to evaluate an expression tree
    using registers r to K-1 (say):
    
    void genCode(SyntaxTree t, int r) 
      if (t is a constant)		// e.g., 1
        emitCode("r = t.value")
      else if (t is a variable)		// e.g, x
        emitCode("r = t.name")
      else if (t has one child, u)	// e.g., -2
        genCode(u, r)
        emitCode("r = t.op r")
      else if (t has two children, u and v)
        if (u.nregs < v.nregs)		// e.g., (u+1) * ((x+y)*(2+z))
          Exchange u and v
          // evaluate more complex subexp first
        genCode(u, r)
        if (v is a constant or variable) // e.g., (2*(x+y) + z
          emitCode("r = r t.op v")
        else if (v.nregs = K-r)
          // no more registers to evaluate v
          addr tmp = spill(r)
          genCode(v. r)
          emitCode("r+1 = tmp")
          emitCode("r = r+1 t.op r")
        else
          genCode(v, r+1)		// e.g., (2*(x+y)) * (3*(u-v))
          emitCode("r = r t.op r+1")
          
    - Here, spill() emits code to push its argument
      onto the stack and return the offset from the
      frame pointer.

    - Exercise: Trace the behaviour of these functions
      on some examples, e.g., x*(y+z*2*(x-y)-u)+3 with
      only 2 registers available.
        
    - Exercise: Prove the two functions are correct
      or find a counterexample which demonstrates they
      are wrong!
    
    - Additional complications arise if we consider whether
      or not operators are commutative or not ("+", "-"),
      whether "reverse operations" exist, and so on.
      
    - Exercise: Modify function genCode() to work 
      correctly on noncommutative operations such as
      subtraction and division.
      
    - Exercise: Suppose a node in the expression tree
      is a function call.  How do the two functions need
      to be extended/modified to handle this?

    - Hint: Either spill all registers on function entry,
      or assume each function call requires all K registers
      (and let genCode() spill registers normally).