Notes based on F29LP course by Jamie Gabbay and Gudmund Grov

Formal Language

  • A formal language is a set of strings of symbols together with a set of rules that are specific to it. The symbols and formulas of such languages stand in precisely specified syntactic and semantic relations to one another.
  • A formal language consists of:
    • A set of atomic symbols called tokens.
    • A set of strings of these tokens called sentences.

Tokens

  • The alphabet of a formal language is the set of symbols, letters, or tokens from which the strings of the language may be formed.

Token String

  • A sentence of tokens.

Regular Expressions

  • A way to specify formal languages, where tokens are taken to be ASCII characters.
  • A regexp determines the language which is the set of strings that match that regexp.
  • A sequence of symbols and characters expressing a string or pattern to be searched for within a longer piece of text.

How Regexps can specify the language of a set

The regexp a* specifies the language that is the set:

  • language(a*) = {[], a, aa, aaa …}

The regexp a+ specifies the language that is the set:

  • language(a+) = {a, aa, aaa …} (no empty string)

The regexp abc$ specifies the language that is the set:

  • language(abc$) = {abc, aabc, afsfhs224abc} (Anything ending in abc)

Regexp syntax

  .   Single character match except newline
"." Anything in quotations marks literally
A* Zero or more occurrences of A
A+ One or more occurrences of A
A? Zero or one occurrence of A
A/B Match A only if followed by B
() series of regexps grouped together
[] Match any character in brackets
[^] Do not match any character in brackets
[-] Define range of characters to match
^ Match beginning of new line
$ Match end of a line
{} Cardinality of pattern match set
\ Escapes meta-characters
| Disjunctive matches, or operation match

Above is the syntax used within regular expressions. Follow here for some live examples.

Formal Grammar

  • A formal grammar is a set of production rules for strings in a formal language. The rules describe how to form strings from the language’s alphabet that are valid according to the language’s syntax.
  • Generate language using formal grammar.
  • With formal grammar we can:
    • Verify whether a sentence is in our language.
    • Synthesise legal programs (make correct/valid programs).

Production Rule

  • A production rule is a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences.

Formal Grammar Terminology

  • V for the set of symbols (V for vocabulary).
  • V can be split into two sets called terminal and non terminal symbols.
  • V* set of all strings of elements of V (including the empty string) - aka closure of V.
  • V+ set of non-empty strings of elements of V.
  • Write for the empty string.
  • Sentences in V are just elements of V*.

Grammars

  • A grammar is a four tuples consisting of:
  • N a set of nonterminal symbols.
  • T a set of terminal symbols, disjoint from N.
  • A start symbol in N.
  • A set of productions α ::= β.

Notational conventions

  • A,B,C,S,T , … range over nonterminals N.
  • a,b,c … range over terminals T.
  • N U T a vocabulary. X, Y, Z range over N U T.
  • Strings of terminals: x, y, z.
  • String of terminals and/or nonterminals α, β, γ…

Object Language

  • Set of strings of terminals that we can produce using the production rules.
  • A language into which a program is translated by means of a compiler or assembler.

Meta Language

  • Set of all strings of terminals or nonterminals that we can produce using the production rule.
  • A form of language or set of terms used for the description or analysis of another language.

Generative Grammar A type of grammar which describes a language in terms of a set of logical rules formulated so as to be capable of generating the infinite number of possible sentences of that language.

Chomsky Classifications of Grammars

There are three types of grammars with different production forms. Further Examples.

Type 0

  • Production form: α ::= β .
  • α is a non-empty string of terminal and/or nonterminal symbols.
  • Productions have no restrictions and can contain pretty much anything.

Examples

  • S → ACaB
  • Bc → acB
  • CB → DB
  • aD → Db

Type 1

  • Context sensitive grammars contain the productions of the form: αAγ ::= αβγ.
  • A donates a single nonterminal (N).
  • B donates an arbitrary string of terminals and/or nonterminal symbols (N U T).

Examples

  • AB → AbBc
  • A → bcA
  • B → b

Type 2

  • Context free grammars contain productions of the form: A ::= γ.
  • A denotes a single nonterminal aka BNF (N).
  • γ donates a string of terminals and/or nonterminals (N U T).
  • A BNF specification is a set of derivation rules.
  • Languages generated by these grammars are recognized by nondeterministic pushdown automaton.
  • Good for languages like “the language of arithmetic”.

  • Examples S → X a X → a X → aX X → abc X → ε

Type 3

  • Regular grammars (regexps) contain productions of the form:
    • A ::= aB
    • A ::= b
    • A ::= ε
  • Must have a single non-terminal on the left hand side.
  • Right hand side consisting of a single terminal and possible followed by a single nonterminal.
  • Good for identifying lexical units such as words.

Examples

  • X → ε
  • X → a | aY
  • Y → b

Most of the computer languages you know are determined by type 2 grammars (if (bool) then (exp) else (exp)); the keywords of those languages are determined by type 3 grammars (if, then, and else).

Nondeterministic

  • Nondeterministic algorithm is an algorithm that, even for the same input, can exhibit different behaviors on different runs.

Deterministic

  • Deterministic system is a system in which no randomness is involved in the development of future states of the system. A deterministic model will thus always produce the same output from a given starting condition or initial state.

Phrase

  • String derived from a nonterminal other than the start symbol.

Sentential Form

  • String derived from the start symbol.

Sentence

  • Sentential form without nonterminals.

Leftmost Derivation

  • Leftmost nonterminal is replaced/expanded, gives rise to leftmost sentential form.

Rightmost Derivation

  • Rightmost nonterminal is replaced/expanded, gives rise to Rightmost sentential form.
  • Example

Parse Tree

  • A diagrammatic representation of the parsed structure of a sentence or string.
  • Remembers how a sentence was produced.
  • Parse trees may induce different intuitive meanings (different outputs/results).
  • For example 1 + 2 ∗ 3 means 9, but 1 + 2 ∗ 3 means 7.

Grammar Equivalency

  • Two different grammars can define the same language L.
  • Two grammars are equivalent when they describe the same language.
  • Equivalent grammars can define different parse trees.

Ambiguous Grammars

  • A grammar is ambiguous when we can produce two different parse trees with one sentence.
  • Question of deciding if a grammar is ambiguous is undecidable.

Unambiguous Grammars

  • Grammars that produce one parse tree for every sentence.

Recursive Grammars

  • A grammar is recursive when is calls/references itself on the right an side.
  • Production form example: A :: = …A…
  • Left recursive grammars have the production form : A ::= …|Aα|…
    • Left recursion tends to lean to non termination creating loops.
    • Eliminate this by replacing |Aα| with A’ nonterminals.
  • Right recursive grammars have the production form : A ::= …|αA|…
  • A grammar can be both left and right recursive.

Nondeterminism

  • A nondeterministic algorithm will for the same input, exhibit different behaviors on different runs (outputs/results).
  • Production form: A ::= abα | abβ - A can down two possible productions.
  • Can lead to backtracking of parser descent (backtracking is going back to previous parts of the production or back up the AST).
  • Eliminate nondeterminism by left factoring:
  • Split a production into a new production rule for example:
    • Given A ::= aB | aC we can create A ::= aA’ A’ ::= B | C.
    • A’ is created for reference to B or C option.

Parser

  • A parser takes input in the form of a sequence of tokens or program instructions and usually builds a data structure in the form of a parse tree or an abstract syntax tree.

Lookahead

  • Lookahead establishes the maximum incoming tokens that a parser can use to decide which rule it should use.
  • Basically it has a scope with tokens and the parser will use tokens within the scope to decide the best possible path to take.
  • Lookahead

Finite Automata

  • Plural of automaton is automata.
  • A machine which performs a range of functions according to a predetermined set of coded instructions.
  • A finite automaton (FA) is a simple idealized machine used to recognize patterns within input taken from some character set. The job of an FA is to accept or reject an input depending on the pattern defined by the FA occurs in the input.
  • Call two automata equivalent when they recognise they same language.

Deterministic Finite Automata (DFA)

  • A finite-state machine that accepts and rejects strings of symbols and only produces a unique computation (or run) of the automaton for each input string.
  • Set of words accepted by the DFA determines the language.
  • Scans input strings from left to right, at each letter the DFA’s state changes.
  • DFA definition is a five tuple consisting of : A = (Q, Σ, δ, q0, F).
    • Q : Finite state set.
    • Σ : Input alphabet (input string).
    • δ : State transition (q(current state), a (input character)) b (new state).
    • q0 : Current state.
    • F : Final State.
  • Transition Function: δ : Q × Σ → Q.
  • q0 ∈ Q : initial state.
  • Set F ⊆ Q : final state, specifies the end state of a DFA, if the input ends at a state in F, the word is accepted otherwise it is rejected.
  • Language recognised (accepted) by a DFA A is the set of words that A accepts, donated as L(A).
  • Languages that can be recognised by DFA are called regular.
  • Formal definition of language L(A): L(A) = {w ∈ Σ ∗ | δ(q0,w) ∈ F}.
  • This is the set of words w over the alphabet Σ such that, if the machine reads input w in the initial state q0, then it reaches a final state.

Transition Diagram

  • A labeled directed graph, vertices =states (q0) and edges = transitions (input string).

Nondeterministic Finite Automata (NFA)

  • Generalise deterministic finite automata (DFA).
  • NFA only accept regular language (regexps).
  • By allowing several outgoing transitions with the same label (zero or more).
  • NFA will accept some choice of words if it leads to a final state, otherwise ignore non final state results.
  • NFA’s may have different computations for the same input word.
  • NFA’s definition is a five tuple consisting of : A = (Q, Σ, δ, q0, F).
    • Q = the set.
    • Σ = the input alphabet.
    • δ = transition function.
    • q0 = Initial state.
    • F = final state.
  • Similar to the DFA but the NFA transition function differs.
  • Transition function for NFA takes into account all possible state transitions from word inputs: δ(q, a) ⊆ Q of possible next states.
  • Donated as 2Q = {S | S Q} we can write δ : Q × Σ → 2Q.
  • The language recognised by NFA is L(A) = {w ∈ Σ ∗ | δ(q0,w) ∩ F /= ∅}.
  • L(A) that is w must be some alphabet that reaches a final state from q0.
  • Ε-moves extend NFA by allowing spontaneous transitions, change states without reading any input letter.

Pushdown Automata (PDA)

  • A PDA is a modified NFA with a stack.
  • PDAs recognise context free languages.
  • The stack contains a string of symbols where the PDA can read from (popped) or write into (push). The leftmost symbol is considered to be the top of the stack and the only symbol the PDA has access too.
  • Input tape (input string) contains the symbols to be scanned symbol-by-symbol, ε-moves are possible.
  • Finite state control unit: A nondeterministic finite state machine whose transitions depend on the input symbol and the topmost stack symbol.
  • PDA normal moves (transition function) depend on:
    • Current state of the control unit.
    • Next input letter.
    • Topmost symbol on the stack.
  • PDAs may also change the state, pop (topmost) and push elements to the stack and move to the next input symbol.
  • Spontaneous ε-moves don’t pop any elements from the stack and move to the next input symbol on the input tape.
  • An input word is accepted if the PDA reaches the empty stack after reading all input symbols.
  • Instantaneous descriptions (ID) is a snapshot (core dump) of the PDA donated as a triple (q,w,γ).
    • q = current state.
    • W = remaining input (unused input characters from original input).
    • γ = is the content of the stack.
  • The ID contains all the needed subsequent steps of the computation.
    • (q1,w1, γ1) |- (q2,w2, γ2) = a move exists from the first ID to the second.
    • (q1,w1, γ1) |-* (q2,w2, γ2) = a sequence (possibly empty) moves exist from the first to the second ID.
    • (q1,w1, γ1) |-n (q2,w2, γ2) = when the first ID becomes the second ID in exactly n moves.
  • There are two modes of acceptance that can be considered to be equivalent. If there is a PDA that recognises language L using empty stack then there (effectively) exists another PDA that recognises L using final state(s), and vice versa.
    • Acceptance by empty stack occurs after all input letters are consumed and the stack is empty.
    • Acceptance by final state occurs when all input letters are consumed and end in a final state (accepting).

Formal Definition of a PDA

  • A pushdown automaton M consists of (Q, Σ, Γ, δ, q0, Z0, F).
  • Q = finite state set.
  • Σ = input alphabet.
  • Γ = finite stack alphabet.
  • q0 ∈ Q = the initial state.
  • Z0 ∈ Γ = start symbol of the stack.
  • F ⊆ Q = final states.
  • δ = transition function from Q × (Σ ∪ {ε}) × Γ to Q × Γ*.
  • δ(q, a, Z)= a set of possible outcomes.
  • PDAs are non deterministic but can be deterministic when every ID has at most one possible move.
  • Languages that can be recognised by deterministic PDA (using final states) are called deterministic context free languages.