### Compiler Passes: Single vs. Multi-Pass - **Single-Pass Compiler:** - Processes source code only once. - Generates target code directly. - Faster compilation, less memory usage. - Limited in handling forward references (e.g., jump targets, variable declarations after use). - Suitable for simpler languages (e.g., Pascal). - **Multi-Pass Compiler:** - Processes source code multiple times, each pass performing a specific task (e.g., lexical analysis, parsing, semantic analysis, code generation). - Intermediate representations (IR) are generated between passes. - Handles forward references effectively. - More complex design, slower compilation, higher memory usage. - Suitable for complex languages (e.g., C++, Java). ### Token, Lexeme, and Pattern - **Token:** A category representing a logically cohesive sequence of characters in the source code. Examples: `IDENTIFIER`, `KEYWORD`, `OPERATOR`, `NUMBER`. - **Lexeme:** The actual sequence of characters in the source code that matches a pattern and is categorized by a token. Examples: `myVar`, `if`, `=`, `123`. - **Pattern:** A rule (often a regular expression) that describes the set of lexemes that can form a specific token. Examples: `letter (letter | digit)*` for `IDENTIFIER`, `if` for `KEYWORD`. ### Regular Expressions 1. **Set of all strings over {a,b} such that the fifth symbol from the right is 'a'.** $$(a+b)^* a (a+b) (a+b) (a+b) (a+b)$$ 2. **Set of all strings over {a,b} such that every block of four consecutive symbols contains at least two zeros (assuming '0' and '1' instead of 'a' and 'b'). Let's assume the question meant 'a' and 'b' and 'two 'a's'.** This is a complex regular expression and is often better handled by finite automata. A direct regular expression can be very long. A simplified approach for a block of four having at least two 'a's might involve combinations like: $$((a+b)^4)^*$$ where each `(a+b)^4` block must contain at least two 'a's. More precisely, patterns like `aa(a+b)(a+b)`, `a(a+b)a(a+b)`, `a(a+b)(a+b)a`, `(a+b)aa(a+b)`, `(a+b)a(a+b)a`, `(a+b)(a+b)aa` etc. A more practical representation might involve states in an automaton. *(Note: This type of regular expression is often simplified or handled by state machines in practice due to its complexity.)* ### Short Notes #### Context-Free Grammar (CFG) - A formal grammar used to define the syntax of programming languages. - Consists of a finite set of terminal symbols (alphabets), non-terminal symbols (variables), a start symbol, and a finite set of production rules. - Production rules are of the form $A \rightarrow \beta$, where A is a single non-terminal and $\beta$ is a string of terminals and/or non-terminals. - Used in the parsing phase of a compiler to construct a parse tree. #### Yacc Parser Generator - **Yacc** (Yet Another Compiler Compiler) is a program that generates a parser (the syntax analyzer portion of a compiler) from an input specification written in a form of Backus-Naur Form (BNF). - It takes a grammar specification, and optionally semantic actions (C code), and produces a C source file containing the parsing tables and a C function `yyparse()`. - Yacc parsers are typically LALR(1) parsers, which are efficient and handle a large class of context-free grammars. #### Process of Generating Lexical Analyzer using LEX 1. **Specification:** The user writes a `.l` file containing regular expressions for tokens and associated C code actions. 2. **Compilation (Lex):** The `lex` tool processes the `.l` file and generates a C source file named `lex.yy.c`. This file contains a finite automaton (DFA) that recognizes the specified patterns. 3. **Compilation (C compiler):** The `lex.yy.c` file is compiled by a C compiler (e.g., `gcc`) into an executable object file. 4. **Execution:** When the compiled lexical analyzer is run, it reads input characters, matches them against the patterns, and executes the corresponding C actions (e.g., returning the token type and lexeme value to the parser). ### Cross Compiler & Bootstrapping - **Cross Compiler:** A compiler that runs on one machine architecture (the host) but generates executable code for a different machine architecture (the target). - Example: Compiling code for an embedded ARM processor on an x86 desktop. - **Bootstrapping of a Compiler to a Second Machine:** 1. **Initial Compiler (C1):** A compiler for language L written in language A, running on machine M1, producing code for M1. 2. **New Compiler (C2):** Write the compiler for language L in language L itself (self-hosting). 3. **Compile C2 with C1:** Use C1 (running on M1) to compile C2 (written in L) to produce an executable for M1. Let's call this C2_M1. 4. **Modify C2 for M2:** Modify C2 (the source code written in L) to generate code for the new machine M2, while still running on M1. Let's call this C2_M2_source. 5. **Compile C2_M2_source with C2_M1:** Use C2_M1 (running on M1, which generates M1 code) to compile C2_M2_source (which generates M2 code). This creates an executable compiler C2_M2_executable that runs on M1 but produces code for M2. 6. **Transfer to M2:** Move C2_M2_executable to M2. Now, C2_M2_executable (running on M2) can compile programs written in L to produce code for M2. This "recompiling itself" process allows the compiler to be ported. ### Compiler Design Example: `position := initial + rate * 60` Let's trace the compilation of the expression `position := initial + rate * 60`: 1. **Lexical Analysis (Scanning):** - Input: `position := initial + rate * 60` - Output Stream of Tokens: - `IDENTIFIER` (`position`) - `ASSIGN_OP` (`:=`) - `IDENTIFIER` (`initial`) - `PLUS_OP` (`+`) - `IDENTIFIER` (`rate`) - `MULT_OP` (`*`) - `NUMBER` (`60`) 2. **Syntax Analysis (Parsing):** - Input: Token stream - Output: Parse Tree (or Abstract Syntax Tree - AST) - The parser uses the grammar rules to verify the syntax and build a hierarchical representation. ``` ASSIGN / \ IDENTIFIER EXPRESSION (position) / | \ IDENTIFIER PLUS TERM (initial) / | \ IDENTIFIER MULT FACTOR (rate) (60) ``` 3. **Semantic Analysis:** - Input: Parse Tree/AST - Output: Annotated Parse Tree/AST (with type information, error checks) - Checks for type compatibility: `initial` (float), `rate` (float), `60` (int). `int` is implicitly converted to `float` for multiplication. `result` of `float` type is assigned to `position` (float). - Symbol table lookups for `position`, `initial`, `rate`. 4. **Intermediate Code Generation:** - Input: Annotated Parse Tree/AST - Output: Three-address code (or other IR) ``` t1 = inttoreal(60) // Type conversion t2 = id3 * t1 // id3 = rate t3 = id2 + t2 // id2 = initial id1 = t3 // id1 = position ``` 5. **Code Optimization (Optional):** - Input: Intermediate Code - Output: Optimized Intermediate Code - Example: Constant folding `t1 = 60.0`, strength reduction, common subexpression elimination. ``` t1 = 60.0 // Constant folding t2 = rate * t1 t3 = initial + t2 position = t3 ``` 6. **Code Generation (Target Code):** - Input: Optimized Intermediate Code - Output: Assembly code (or machine code) ```assembly LD R1, rate ; Load 'rate' into register R1 MUL R1, #60.0 ; Multiply R1 by 60.0 (float) LD R2, initial ; Load 'initial' into register R2 ADD R2, R1 ; Add R1 (rate*60.0) to R2 (initial) ST position, R2 ; Store result in 'position' ``` ### Finite Automata for Lexical Analysis - **Usefulness:** Finite Automata (FA) are fundamental to lexical analysis because they can efficiently recognize patterns (regular expressions) that define tokens. - Regular expressions are converted into Non-deterministic Finite Automata (NFA). - NFAs are then converted into Deterministic Finite Automata (DFA). - DFAs are used to build the scanner (lexical analyzer) because they are deterministic, meaning for each input symbol and state, there is exactly one next state, making them very efficient for pattern matching. - They allow the scanner to consume input characters and transition between states, identifying the longest possible lexeme that matches a token pattern. #### NFA and DFA for `(a+b)*abb` **1. NFA for `(a+b)*abb`:** (A common way to construct is using Thompson's construction or similar methods) - **States:** Q0, Q1, Q2, Q3, Q4, Q5, Q6 (or more, depending on construction) - **Start State:** Q0 - **Accept State:** Q6 (or the final state for `abb`) ``` ┌───ε───┐ │ V (Q0)───a───>(Q1)───ε───┐ │ ^ │ └───────b───────────┘ │ ^ │ └───────ε───────────┘ │ │ V │ (Q2)───a───>(Q3)───b───>(Q4)───b───>(Q5) (Accept state) ``` More standard representation: ```mermaid graph LR start((Start)) --> Q0 Q0 -- ε --> Q1 Q1 -- a --> Q2 Q1 -- b --> Q3 Q2 -- ε --> Q1 Q3 -- ε --> Q1 Q0 -- ε --> Q4 Q4 -- a --> Q5 Q5 -- b --> Q6 Q6 -- b --> Q7((Accept)) Q1 -- ε --> Q0 (Self-loop for Kleene star) Q0 -- ε --> Q4 (Initial transition to 'abb' part) ``` A simpler, more common NFA construction from `(a+b)*abb`: ``` (0) --ε--> (1) --a--> (2) --ε--> (6) ^ | ^ | b | | V | | (3) --ε--> (6) | | ^ | ε | | V | |_________ε_________|(6) --a--> (7) --b--> (8) --b--> (9)((Accept)) ``` Let's use a more direct NFA construction: ```mermaid graph LR Q0((Start)) --> Q1 Q1 -- a --> Q1 Q1 -- b --> Q1 Q1 -- a --> Q2 Q2 -- b --> Q3 Q3 -- b --> Q4((Accept)) ``` This NFA is a common way to represent `(a+b)*abb`. Q1 handles `(a+b)*`, then transitions to Q2 on an 'a', then Q3 on 'b', then Q4 on 'b'. **2. DFA for `(a+b)*abb` from the NFA above:** - **States:** A (representing {Q1}), B (representing {Q1, Q2}), C (representing {Q1, Q3}), D (representing {Q1, Q4} - Accept) - **Start State:** A ({Q1}) - **Accept State:** D | State | Input 'a' | Input 'b' | | :---- | :---------------- | :---------------- | | A | `{Q1, Q2}` (B) | `{Q1}` (A) | | B | `{Q1, Q2}` (B) | `{Q1, Q3}` (C) | | C | `{Q1, Q2}` (B) | `{Q1, Q4}` (D) | | D | `{Q1, Q2}` (B) | `{Q1}` (A) | ```mermaid graph LR A((Start)) -- a --> B A -- b --> A B -- a --> B B -- b --> C C -- a --> B C -- b --> D D((Accept)) -- a --> B D -- b --> A ``` *(Note: DFA construction can be complex and depends on the specific NFA used. The above is a common minimal DFA.)* ### BNF Notation - **BNF (Backus-Naur Form):** A metasyntax used to express context-free grammars. It is a standard way to describe the syntax of programming languages. - **Components:** - **Non-terminal symbols:** Enclosed in angle brackets, e.g., ` `, ` `. These represent syntactic categories that can be broken down further. - **Terminal symbols:** Literal characters or keywords of the language, e.g., `+`, `if`, `id`. - **Production rules:** Use the `::=` (or `->`) symbol to define how a non-terminal can be replaced by a sequence of terminals and/or non-terminals. - **Alternation:** The `|` symbol indicates a choice between different alternatives for a non-terminal. - **Example:** ```bnf ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ::= | ::= | "+" | "-" ::= | "*" | "/" ::= | "(" ")" ``` #### How is it different from CFG? - BNF **is a notation for** Context-Free Grammars (CFGs). It's not fundamentally different but rather a specific textual representation or syntax for writing down CFGs. - All grammars written in BNF are CFGs. - There are slight variations of BNF (e.g., EBNF - Extended BNF, ABNF - Augmented BNF) that add features like optional elements `[ ]`, repetition `{ }`, and grouping `( )` for conciseness, but these features can always be translated back into pure CFG rules. - So, the relationship is that BNF is a way to *write* a CFG. ### Ambiguous Grammar - **Definition:** A grammar is **ambiguous** if there exists at least one string that can be generated by the grammar in more than one way (i.e., has more than one leftmost derivation, rightmost derivation, or parse tree). - Ambiguity is problematic for compilers because it means the parser cannot uniquely determine the structure (and thus the meaning) of a program statement. #### Showing `S→aSbS|bSaS|ϵ` is ambiguous Let's consider the string `abab`. **Derivation 1 (Leftmost):** 1. `S` $\rightarrow$ `aSbS` 2. `aSbS` $\rightarrow$ `aϵbS` (substituting `S` with `ϵ`) 3. `aϵbS` $\rightarrow$ `abS` 4. `abS` $\rightarrow$ `ab aSbS` 5. `ab aSbS` $\rightarrow$ `ab aϵbS` 6. `ab aϵbS` $\rightarrow$ `ab abS` 7. `ab abS` $\rightarrow$ `ab abϵ` 8. `ab abϵ` $\rightarrow$ `abab` **Derivation 2 (Leftmost):** 1. `S` $\rightarrow$ `aSbS` 2. `aSbS` $\rightarrow$ `aS bSaS` (substituting `S` with `bSaS`) 3. `aS bSaS` $\rightarrow$ `aϵ bSaS` (substituting `S` with `ϵ`) 4. `aϵ bSaS` $\rightarrow$ `abSaS` 5. `abSaS` $\rightarrow$ `abϵaS` 6. `abϵaS` $\rightarrow$ `abaS` 7. `abaS` $\rightarrow$ `abaϵ` 8. `abaϵ` $\rightarrow$ `abab` Wait, this was not the correct demonstration of ambiguity for `abab`. Let's use `ab` as a simpler example or be more precise with `abab`. Let's try string `ab`. **Derivation 1:** 1. `S` $\rightarrow$ `aSbS` 2. `aSbS` $\rightarrow$ `aϵbS` (S $\rightarrow$ ϵ) 3. `aϵbS` $\rightarrow$ `abS` 4. `abS` $\rightarrow$ `abϵ` (S $\rightarrow$ ϵ) 5. `abϵ` $\rightarrow$ `ab` **Derivation 2:** 1. `S` $\rightarrow$ `bSaS` (This cannot derive `ab` starting with 'a') Let's reconsider the string `abab` and its parse trees. The issue with this type of grammar is often how the `ϵ` (epsilon) rule interacts with the recursive structure. A better way to show ambiguity for `S→aSbS|bSaS|ϵ` is by finding two different **parse trees** for the same string, or two different leftmost/rightmost derivations. Consider the string `aabb`. **Parse Tree 1:** ``` S /|\ a S b S | | ϵ S /|\ a S b S | | ϵ ϵ ``` This derives `aabb`. **Parse Tree 2:** ``` S /|\ a S b S | | S ϵ /|\ a S b S | | ϵ ϵ ``` This also derives `aabb`. Since there are two distinct parse trees for the string `aabb`, the grammar is ambiguous. The structure of nesting `aSbS` vs `bSaS` around each other for a balanced string leads to multiple interpretations. ### Finite Automaton for Binary Numbers Divisible by Three We need a DFA that accepts binary strings `w` such that `w mod 3 = 0`. The states will represent the remainder when the binary number processed so far is divided by 3. - **States:** Q0 (remainder 0), Q1 (remainder 1), Q2 (remainder 2) - **Start State:** Q0 (empty string has value 0, 0 mod 3 = 0) - **Accept State:** Q0 **Transitions:** When we append a bit `b` to a number `N`, the new value is `2N + b`. So, the new remainder is `(2 * current_remainder + b) mod 3`. - **From Q0 (current_remainder = 0):** - Input '0': `(2 * 0 + 0) mod 3 = 0`. Transition to Q0. - Input '1': `(2 * 0 + 1) mod 3 = 1`. Transition to Q1. - **From Q1 (current_remainder = 1):** - Input '0': `(2 * 1 + 0) mod 3 = 2`. Transition to Q2. - Input '1': `(2 * 1 + 1) mod 3 = 0`. Transition to Q0. - **From Q2 (current_remainder = 2):** - Input '0': `(2 * 2 + 0) mod 3 = 4 mod 3 = 1`. Transition to Q1. - Input '1': `(2 * 2 + 1) mod 3 = 5 mod 3 = 2`. Transition to Q2. **DFA Diagram:** ```mermaid graph LR Q0((Q0: Remainder 0)) -- 0 --> Q0 Q0 -- 1 --> Q1 Q1 -- 0 --> Q2 Q1 -- 1 --> Q0 Q2 -- 0 --> Q1 Q2 -- 1 --> Q2 ``` - **Start State:** Q0 - **Accept State:** Q0 **Example Trace:** - `11` (binary 3): - Start at Q0. - Input '1': Q0 $\rightarrow$ Q1. - Input '1': Q1 $\rightarrow$ Q0. - End in Q0 (Accept). Correct. - `10` (binary 2): - Start at Q0. - Input '1': Q0 $\rightarrow$ Q1. - Input '0': Q1 $\rightarrow$ Q2. - End in Q2 (Reject). Correct. - `1001` (binary 9): - Start at Q0. - Input '1': Q0 $\rightarrow$ Q1. - Input '0': Q1 $\rightarrow$ Q2. - Input '0': Q2 $\rightarrow$ Q1. - Input '1': Q1 $\rightarrow$ Q0. - End in Q0 (Accept). Correct.