Notes based on F29LP course by Jamie Gabbay and Gudmund Grov

Compiler

  • A program that converts instructions into a machine-code or lower-level form so that they can be read and executed by a computer.

Interpreter

  • A program that can analyse and execute a program line by line.

Phases of a compiler

  • There are two parts to a compiler:
    • The front end: analysis of the source program.
    • The back end: constructs the program.
  • These can be split up into a sequence of phases.
  • Front end includes:
    • Lexical analysis (token sequence) from source code.
    • Syntax analysis (syntax tree).
    • Semantic analysis (syntax tree).
    • Intermediate code generator.
  • Back end includes:
    • Code generator - for target machine code.
    • Machine dependent code optimiser.

Phases of an interpreter

  • Front end is often the same for an interpreter.
  • One may interpret the syntax tree.
  • Lexical analysis > syntax analysis > semantic analysis > intermediate code generator > evaluator.

Passes of language processors

  • A pass relates to a traversal of the source program.
  • This can be the code, the syntax tree or the intermediate representation.
  • Several passes are usually combined into a single pass.
    • E.g lexical and syntax analysis.
  • Single pass compiler will pass all phases in a single pass.
  • Multi pass compiler will pass over the source program multiple times.

Lexical Analysis

  • First step of most language processors.
  • Turns the source program into a stream of tokens.
  • Some structural symbols are discarded as they serve no purpose for the language processor such as comments, blank spaces newline symbols.

Lexeme

  • Sequence of characters in the source program

Pattern

  • Description of the form that the lexemes may take.
  • Often described as a regular expression.

Token

  • An abstract symbol representing a lexical unit.
  • The grammar can then use a token as a unit.
  • May have an attribute value connected.
  • Uses regular expression (type 3 grammar) rules for describing the patterns of lexemes.
  • Examples of tokens in java:
    • Reserved words: boolean, break, char, else, if, int, while etc.
    • Punctuation: ; , . ( ) = == <= >= != { } + ++ - __ * / etc.
  • Each token can be assigned a keyword.
    • E.g := ASSIGN, <= LE ; SEMI.
  • Variables: x, y, test represented by ID.
  • Digits: 0,1,2… represented by INT.

Example of Lexical Analysis

  • Lexical analysis turns the source program into a stream of tokens.
    • E.g VAR x ; x := 0 becomes VAR ID SEMI ID ASSIGN INT.
  • Variables become ID and numbers becomes INT.

Implementing A Lexical Analyser

  • Regular expression are convenient for specifying lexical tokens.
  • Achieved with deterministic finite automata.
  • We can convert regular expressions into a non deterministic finite automata which can then be converted into deterministic finite automata.

Lexical Analyser Generators

  • Use a lexical generator called FLEX on your source file.
  • Most common languages have lexical analysis generators.

Flex (Lex)

  • Generate a lexical analyser from a user provided specification.
  • Specification file (source file) > flex (generator) > lexical analyser (output .c file).
  • A flex specification file rule has two parts.
  • A pattern represented as a regexp e.g [0-9]+.
  • An action, c code to be executed when the match occurs. E.g {return INT;}
  • No backtracking:
    • The longest match for an input is generated.
    • If equal length then first rule applies.
  • The string variable yytext contains the value of the current match.
  • Takes regular expressions as input and generates a C program as output.
  • Essentially turn regexps into DFA which is then turned into NFA.
  • The NFA is implemented as a transition table, a representation of the transition matrix.
  • The generated C program then uses the transition table in an automation simulator. Which works on the input file for the lexical analyser: newstate := transition_table[oldstate][inputstate].

Syntax Analysis

  • Syntax analysis as two rules:
    • Check that the given program (string) is well formed (is a valid sentence).
    • To build a syntax tree of the program used. For later phases of the compiler/interpreter.

Why separate this from the lexical analysis phase ?

  • Simplifies the design and overall process for parsing the source file to the compiler by allowing us to work with more abstract notion of tokens.
  • Speed up the process by applying specialised techniques to particular programs.
  • Portability (flexibility) can change one file without impacting the other.

Backus-Naur Form (BNF)

  • The syntax of most programming languages are described by a type 2 (context free) grammar.
  • BNFs are Widely used to describe programming languages.
  • Nonterminal symbols enclosed by <…>.
  • Productions of same nonterminals separated by |
    • E.g < exp > ::= < term > + < exp > | < term >

LL(1) Parsers

  • Simple and efficient parsers are desirable. LL(1):
  • Left to right scanning, leftmost derivation, 1 token look ahead.
  • One symbol lookahead without backtracking.
  • Make every choice point based upon current state and next input.
  • No step in the analysis will need to be undone.

Recursive Descent Parsing

  • Parser for LL(1) grammar.
  • Also called predictive parsing.
  • Amounts to writing a potentially recursive function A() for each nonterminal A.
  • With a clause for each production.
  • Tries to uniquely match the rule’s right hand side.
  • Match(A) where A ::= α match(α).
  • To parse A, match its right hand side α.
  • If α is a nonterminal then match(α) amounts to calling α().
  • Only works when first terminal symbol is sufficient to decide which production rule apply.
  • First(α):
    • Set of all symbols that can start α.
    • Not straightforward to computer, α can be a non terminal with multiple productions. It can have shape αβ where α may be empty, also includes all symbols that can start at β.

Lookahead for LL(1)

  • LL(K) grammar - k is maximum look ahead.
  • To distinguish options/avoid backtracking.
  • Recursive descent is applicable for LL(1) grammars.
  • One step lookahead by starting every option with a unique initial terminal symbol.
    • E.g E ::= E + T | T will cause problems as first (E+T) will contain everything in first (T) so we cannot always decide which rule to apply.
  • To check if grammar is LL(1) check that every rule option starts with a unique terminal.
  • Sometimes you have to factor a non LL(1) grammar to make it LL(1).

LR(K)

  • Left to right parse, rightmost derivation, K-token lookahead. More powerful than LL but can delay making a choice. Not as simple to implement.

Abstract Syntax Trees

Passes

  • Pass is a complete traversal of the source program.

Single pass compilers

  • Integrate all phases in a single pass.
  • Meaning code is generated directly while parsing.
  • Faster, uses less space.

Multipass compiler

  • More modular, flexible, enables more optimized code.
  • Several passes over source program.
  • A phase can now be a pass, often some phases are combined.
  • Need to create data structure for next pass.
  • Syntax tree generated from parsing for the next pass.

Syntax Trees

  • Data structures which later phases can use.
  • A leaf node for each token.
  • An internal node for each grammar rule reduced during parse phase.
  • Concrete syntax:
    • Structured of well formed symbol sequences in textual representation. E.g (1+2) * 3.

Abstract Syntax Tree

  • Only keep information relevant for semantic manipulation.
  • Structure of well formed symbol sequence relevant to meaning.
  • Abstract syntax describes structure of representation, it is not for parsing, parsing specific parts can be thrown away.
  • Find abstract syntax by:
    • Substituting to remove singleton chains.
    • Removing unnecessary structural symbols.

Syntax Not Semantics

  • Syntax cannot capture all meaning.
  • Not all well formed symbol sequences are meaningful.
    • E.g 3 / ((6 - (4 + 2)) -> 3 / 0.
    • E.g {int a; a++} -> a has no value.
  • Semantic analysis:
    • Can find some such semantic problems.
    • While some can only be found at runtime.

Concrete To Abstract Syntax

  • Factor out singleton rules:
    • Replace singleton rule by right hand side.
    • Replace other references to that rule with name of rule in which it has been replaced.
  • Need to expand up all inline options.
  • Don’t just factor everything into single abstract representation.
  • Distinguish between semantically different constructs.
  • For example in SIMP language
    • Declaration -> introduces variables.
    • Command -> change variables.
    • Expression -> gets values from variables.

AST Construction

  • Modify parser to make each rule return the abstract syntax tree (AST) for the symbol sequence it recognises.

Fetching Token Attribute Value

  • Flex will return the token.
  • For ID/INT we need the actual value (match) to be stored in the AST.
  • This is stored in the yytext field but will be overridden in the next call to yylex() so we need to make a copy of the string. This is handles by strdup(i) in the newId() function.

Interpreters

  • We would like to be able to execute programs.
  • An interpreter takes the source program and executes it immediately. No translation into other (low level machine) code.
  • Interpreters are good for
    • Quickly testing a program to get feedback.
    • Prototyping programming language.

Types of Interpreters

  • High level interpreters:
  • Execute/interpret the AST directly or source code directly (single pass).
  • Often straightforward to implement but inefficient.
  • Bytecode/IR interpreters
    • Translate into an intermediate (low level) representation and interpret this lower level representation.
  • Used in several programming languages such as java, JS, C#.
  • More efficient but harder to develop.
  • We developed a high level interpreter for SIMP.

Implementing an Interpreter

  • We have to give meaning to each constructor. Using natural languages may lead to mistakes so some languages have a mathematical description of how the program behaves.
  • Recursive interpretation:
    • Introduce a store:
      • Values of variables.
      • Need to handle scoping.
  • Traverse abstract syntax tree.
    • At each note:
      • Execute construct.
      • Update store.
  • Stack of identifier/value pairs.
  • Scoping rules: keep track of current frame.
  • Variables declared at current lock level.
  • Frame base = sb.
  • Frame pointer = sp - next free position.

Interpreter Store

  • At start frame base (sp) == frame pointer (sp) == 0.
  • For block entry:
    • Remember old frame base.
    • Frame base starts at frame pointer.
  • For block exit, reset:
    • Frame pointer to frame base.
  • Frame base to old frame base.

Interpreter Reflection

  • An interpreter provides a good way to:
  • Quickly check out language design.
  • Construct a first implementation.
  • Enable platform independent portability.
  • But inefficient due to tree traversal which involves considerable pointer indirection.

Code Generation

  • We would like to avoid tree traversal.
  • Generate target code for potential machine level execution.
  • Turn the source code (AST) into instructions at the machine level.
  • Goal is to generate MIPS assembly code.

MIPS

  • RISC instruction set architecture.
  • 32 register (32 bits).
  • MIPS memory:
    • Operations on registers, fast but limited.
    • .data (static) area global data / variables.
    • Stack, limited lifetime - local variables.
    • Heap, dynamically allocated at run time.

Correct Code Generation

  • Problem is that programming languages high level concepts such as expressions, types, variables procedures and functions etc. but target machines only support low level concepts such as bits, bytes, words, registers, stacks addresses and subroutines.
  • This difference is called the semantic gap between high and low level concepts.
  • Ensuring fast and correct code is often hard, we can often see a spectrum of approaches:
    • From simple/direct:
      • Easier to see correctness.
      • But may be not be optimal.
  • To fast/optimised:
    • Hard to relate to source code.
  • Target machines are often implement source instructions differently so code generated for one machine may not adapt for a different machine.
  • Therefore back end does not have the same general principles as the front end.
  • We often split code generation into two parts:
    • Instruction selection.
    • Register allocation.

Register Allocation

  • Most instructions assume that data are in registers, these are very fast but typically there is more data than registers.
  • Programs will spend a lot of time moving data to and from registers.
  • Register allocation is the strategy of assigning data to registers.

Instruction Selection

  • Maps AST nodes to the appropriate machine instructions.
  • Typically precedes register allocation.
  • Assumes there are sufficient registers often via a form of pseudo registers called temporaries.
  • Register allocation will then replace these by actual registers.
  • Instruction selection by traversing AST.
  • Match node with suitable instruction (possibly its subtrees).
  • Recursively apply instruction selection for subtrees.
  • Generate code for matching code (code for subtrees will have to be generated first).
  • This approach is called tiling the tree where one match corresponds to a tile.
  • In general there could be multiple matches.
  • Maximal algorithm is used to solve this which dictates that the largest tile should always be used.

Traversing The AST

  • Code generated by traversing the AST.
  • Implemented one rule for each type of node, each note represents a tile that may include sub nodes.

Register Operations

  • Saved registers: $s0 to $s7.
  • Temporary registers $t8 and $t9 to store variables.
  • Temporary registers for intermediate results during evaluations are $t8 and $t9.
  • Use $a0 - $a3 for the first 4 arguments.
  • Use $v0 and $v1 for return values.
  • Move Rm, Rn: moves value in register Rn to Rm.
  • Add/addi Ra, Rb, Rc: Ra = Rb + Rc.
  • Sub Ra, Rb, Rc: Ra = Rb - Rc
  • Div Ra, Rb, Rc: Ra = Rb / Rc.
  • Mult Ra, Rb, Rc: Ra = Rb * Rc.
  • Mul achieved by using loops.
  • MIPS Instruction
  • $sp keeps track of the stack pointer.
  • Lw $t0, var1: loads contents of RAM location into reg $t0: $t0 = var1.
  • Sw $t1, var1: store contents of reg $t1 into RAM: var1 = $t1.
  • Li $t1, 5: load immediate $t1 = 5.
  • Labels are used to jump to new code sequence.
  • Ble r1,r2, label: goto label if r1 is <= r2.
  • Bgt r1, r2, label: goto label if r1 is > r2.
  • Beq is equal.
  • J label: unconditional jump to label.
  • Syscall can perform different operations by loading a value into $v0:
    • 1 to print integer.
    • 4 to print string.
    • 5 to read integer.
    • 10 to terminate a program.
  • Use registers $a0 - $a3 to load argument values.
  • Evaluating binary operations by using registers and stacks.
    • Use any free registers to store these values.
    • Or use a stack to push and pop values.
  • .data tell the assembler the upcoming section is to be considered data.
  • .text tells the assembler that section is language instructions.
  • Main: code to be executed after this label.
  • Terminate a program by setting register $v0 to 10 by making a syscall.
  • Lifetime of variable depends on scope:
    • Global: lifetime is until program terminates.
    • Local: scope of block/function/procedure, ends there too.
  • Functions are handled by low level subroutines.
    • Control is transferred by jal f (f = function name).
    • Control returned by jr $ra ($ra = address where call is made).
  • Parameters can be passed to the routine and results returned. Parameter Passing Variants:
    • Call by value: pass copy of parameter.
    • Call by reference: pass reference to a place in memory.
    • Call by value-results: pass copy but copy something back.
    • Call by name: argument not evaluated but name/expression is actually passed. Evaluated in the calle body when accessed.
  • Making a call: evaluate the arguments, call subroutine with jal, pop the stack and put value in correct register.
  • Arrays are declared as such: ARRAY identifier OF integer e.g ARRAY a OF 3 will declare a variable a to hold e ints (32bit). They can be represented in the static area (global) or in separate registers (local).

Optimisation

  • Better to delay optimisation until software is correct.
  • Source code optimisation on AST before code generation.
  • Target code optimisation after code generation (actual C code).
  • Local vs global optimisation:
  • local:
    • Peephole optimisation: consider instructions in immediate context by removing any unnecessary moves/loads/store.
    • Expression optimisations: remove unnecessary moves/load/store.
    • Constant folding: perform numeric constant arithmetic at compile time. Better use of register by maximising the use of free registers by using two temporary registers.
  • Simplify expressions by removing identity operations.
  • Global:
    • Often require global analysis of program.
    • Eliminate function calls and replace with in line body (code).
    • Eliminate chains of branches = loop optimisations, remove any unnecessary branches in nested if constructs. Particularly in: after outer/inner IF success and branch to end of inner/outer IF.

Register Stack

  • On block entry, remember rb (reg base) and reset it to next free on block exit. Reset next free rb (make a walkthrough on how the stack works).

Storage Organisation

  • Registers are considerably faster but limited number of reg available.
  • Use this memory to load and store registers as well as update them.
  • Heap when allocated dynamically at run time (not used in this course).
  • Stack (local) when lifetime is limited.
  • Static (global) when lifetime is whole program.

Stack Vs Register Protocols

  • Register based protocol:
    • Faster approach.
    • Limited by number of free registers.
    • Caller and callee share a set of agreed registers.
    • Make a call with register, evaluate and put the arguments in correct registers, call subroutine with jal and move results from return register to correct register.
    • Scoping:
      • Callee-saves registers, they are responsible for keeping values. $s registers in MIPS are callee-save.
      • Caller-save registers, they are responsible for keeping values. All except the $s registers in MIPS are caller-save.
  • Stack based protocol:
    • Slower but does not tie up registers.
    • Arguments pushed on stack before call.
    • Routine will pop the arguments, apply the body of routine, push the results on the stack.
    • Use $a0 - $a3 for the first 4 arguments the rest go on the stack.

Stack Frame

  • Collection of all data on the stack associated with a subroutine call.
  • Organises local storage.
  • Typically consists of:
    • Local variables.
    • Temporary results.
    • Current arguments (required for nested calls).
    • Return address ($ra) (required for nested calls).
    • Caller-Save registers (required for nested calls).
    • Additional arguments stored in caller frame.
  • Nested calls can make their own frame setting the $sb and $sp to the new frame for local subroutine variables.
  • When control is returned to f the registers can be updated by popping them from stack, when f ends the frame is removed and control is returned to its caller.

Register Allocation

  • In RISK all operations occur between registers thus we move all variables into registers to perform operations on them.
  • Register allocation:
    • Determines which values should be placed in which register at a given time.
    • Most commonly performed within a function/procedure.
  • A temporary is a pseudo register we would like to allocate a register.
  • A temporary is live if it holds a value that me be needed in the future, discarded otherwise.
  • An assignment to temporary variable defines that variable.
  • Liveness analysis tell us which temporaries are live at a given time.

Liveness Analysis and Flow Graph

  • Compute a flow graph to perform an liveness analysis
  • Replace control structures - loops, conditionals etc.
  • Nodes contain basic blocks of code such as list of assignments. No branching within the block.
  • In edge from predecessor nodes.
  • Out edges to successor nodes.
  • An assignment to temporary variable defines that variable.
  • An occurrence of a variable on the r.h.s of assignment uses that variable.
  • Uses and Defines
    • A variable lives on an edge if there is a directed path from that edge to a use of a variable.
    • The path does not go through a define of that variable in the path.
  • Liveness is computed backwards that use sets of variables on outputs.
  • Compute input set of assignment:
    • If x is defined then remove from set.
    • If x is used then add to set.

Register Interference Graph

  • A unidirectional graph representing all the live variables.
  • At each temporary is a node in the graph.
  • Edge between nodes represent liveness at the same time meaning they cannot share the same register this allowing two temporaries to share the same register if they do not have an edge connecting them.

Graph Colouring

  • Registers allocation has a close correspondence to graph colouring.
  • Two adjacent nodes cannot have the same colour.
  • Common problems for colouring maps is using different colours for neighbouring nodes.
  • For register allocation graph is the interference graph which is the result of the liveness analysis of variables.

Interference Graph

  • Gives global and architecture independent view of register allocation.
  • Give nodes a colour to represent temporaries.
  • No connected nodes can have the same colour.
  • A graph is K-colourable if it has a colouring with K colours.
  • K means no more than K registers.

Graph Colouring

  • Computing graph colouring is NP-hard meaning there is no efficient algorithms to compute this and colouring therefore may not exist.
  • We can use heuristics as a solution.

Heuristic for Graph Colouring

  • A method for finding which nodes can be coloured the same using a stack.
  • Select a node n with < K colours.
  • Remove n (add to stack) and adjacent edges and push to a stack.
  • Repeat until graph is empty.
  • While stack is not empty:
    • Pick top node from stack.
    • Add it back into the graph with connected edges.
    • Assign a colour to node (different colour for each neighbour).
  • Heuristic fails if we reach a point where there are no nodes with < k neighbours.

Spilling

  • If colouring does not exist it can be loaded into memory (spill to memory).
  • Basically moving temporary to memory if no registers left.
  • Spilling process:
    • Let fa be memory address for f.
    • f := load fa before using it.
    • Store f fa after assigning it.
  • This will reduce the live range of f.
  • F is only live between load of next operation and store of preceding operation.
  • Best to spill temporary with the most conflicts or few definitions and uses.
  • Video tutorial