Unit 1: Short Questions (2 Marks Each) 1. What is the role of a lexical analyzer? Introduction: The lexical analyzer, also known as a scanner, is the first phase of a compiler. Explanation: Its primary role is to read the input source code character by character and group them into meaningful sequences called lexemes. These lexemes are then converted into tokens for the subsequent syntax analysis phase. 2. What is a token? Give an example. Introduction: A token is a fundamental building block recognized by the lexical analyzer. Explanation: It is a pair consisting of a token name (e.g., `id`, `keyword`, `operator`) and an optional attribute value (e.g., the actual identifier name or literal value). Tokens represent a sequence of characters having a collective meaning in the source code. Example: For the statement count = 10; , the tokens would be: <id, "count"> <assign_op, "="> <number, "10"> <semicolon, ";"> 3. What is the purpose of the error handler in a compiler? Introduction: The error handler is a crucial component of a compiler. Explanation: Its main purpose is to detect and report errors during various phases of compilation. Upon detection, it attempts to recover from the error to continue processing the rest of the source code, thereby providing more comprehensive error messages to the user. This involves mechanisms like error reporting, error recovery, and error correction. 4. Explain the difference between a source program and an object program. Introduction: Source and object programs represent different stages of a program's lifecycle. Explanation: Source Program: This is the program written by a developer in a high-level programming language (e.g., C++, Java, Python). It is human-readable and contains instructions in a syntax and semantics specified by the language. Object Program: This is the output of a compiler or assembler, typically in machine language or an intermediate representation. It is directly executable by the computer's CPU (after linking) and is generally not human-readable. 5. Difference between One Pass and Multi Pass Compiler? Introduction: Compilers can be categorized based on the number of passes they make over the source code. Explanation: One-Pass Compiler: Processes the entire source code in a single scan. It generates target code directly without needing to revisit earlier parts. This is simpler to implement but imposes restrictions on language features (e.g., forward references must be handled carefully). Multi-Pass Compiler: Processes the source code multiple times, with each pass performing a specific task (e.g., lexical analysis, syntax analysis, semantic analysis, code generation). This allows for greater optimization and supports more complex language features by gathering information in earlier passes for use in later ones. 6. Define front-end and back-end of a compiler. Introduction: A compiler's architecture is often divided into two main parts. Explanation: Front-End: This part is machine-independent and language-dependent. It includes phases like lexical analysis, syntax analysis, semantic analysis, and intermediate code generation. Its primary job is to understand the source code and convert it into an intermediate representation (IR) that is free from syntax and semantic errors. Back-End: This part is machine-dependent and language-independent. It includes phases like code optimization and target code generation. Its main task is to take the IR from the front-end and transform it into optimized machine code for a specific target architecture. 7. What are language processors? Introduction: Language processors are software systems that translate programs written in one computer language into another. Explanation: They facilitate communication between programmers and computers. Examples include compilers (translate high-level to machine code), interpreters (execute high-level code line by line), assemblers (translate assembly language to machine code), and preprocessors (perform macro expansion and file inclusion). They are essential for turning human-readable code into executable instructions. 8. What is an assembler? Introduction: An assembler is a type of language processor. Explanation: Its specific role is to translate programs written in assembly language into machine code. Assembly language is a low-level programming language that uses symbolic representations of machine instructions, registers, and memory locations. The assembler converts these mnemonics and symbols into their binary equivalents, creating an executable object file. 9. What are the main tasks of the code generator phase? Introduction: The code generator is the final phase of a compiler's back-end. Explanation: Its main tasks include: Instruction Selection: Choosing appropriate machine instructions for each IR operation. Register Allocation: Deciding which variables reside in registers (fast access) and which in memory (slow access) to minimize memory access. Instruction Ordering: Determining the order of instructions to avoid pipeline stalls and optimize execution. Address Allocation: Assigning memory locations for variables. 10. Difference between Compiler and Interpreter? Introduction: Compilers and interpreters are both language processors, but they operate differently. Explanation: Compiler: Translates the entire high-level source code into machine code (or an intermediate form) before execution. The compilation process happens once. The resulting object code can be executed multiple times without recompilation. Error detection occurs during compilation. Interpreter: Translates and executes the source code line by line. It does not produce an executable object file. Each time the program runs, it is interpreted. Error detection occurs during execution, stopping at the first error encountered. Unit 1: Long Questions (10 Marks Each) 1. Explain the phases of a compiler with a neat diagram and example. Introduction: A compiler is a complex software system that translates a program written in a high-level language into an equivalent program in a low-level language, typically machine code. This translation process is broken down into several distinct phases, each performing a specific task. Explanation: The compilation process involves sequential phases, often interacting with a Symbol Table and an Error Handler. The phases are: Lexical Analysis (Scanning): Role: Reads the stream of characters from the source program and groups them into meaningful sequences called lexemes. These lexemes are then converted into tokens. Output: A stream of tokens. Example: For position = initial + rate * 60 , it generates tokens like <id, "position"> , <assign_op, "="> , <id, "initial"> , etc. Syntax Analysis (Parsing): Role: Takes the token stream from the lexical analyzer and builds a hierarchical structure, typically a parse tree or syntax tree (abstract syntax tree - AST). It checks if the token stream conforms to the grammatical rules of the language. Output: Parse tree or AST. Example: For the above tokens, it verifies if id = id + id * num is a valid expression according to the language's grammar. Semantic Analysis: Role: Checks the source program for semantic consistency with the language definition. This includes type checking, ensuring variables are declared before use, and checking for compatible types in expressions. It gathers type information and saves it in the symbol table. Output: Annotated syntax tree (if no errors). Example: Checks if initial and rate are declared and compatible with arithmetic operations, and if 60 is a valid integer. Intermediate Code Generation: Role: Generates an explicit intermediate representation (IR) of the source program. This IR should be easy to produce and easy to translate into the target machine code. Common forms include three-address code, quadruples, triples, or postfix notation. Output: Intermediate code. Example: t1 = inttoreal(60) t2 = id3 * t1 t3 = id2 + t2 id1 = t3 Code Optimization: Role: Attempts to improve the intermediate code to produce a target code that runs faster, takes less space, or consumes less power. Optimizations can be machine-independent or machine-dependent. Output: Optimized intermediate code. Example: If 60 is a constant, inttoreal(60) can be replaced by 60.0 directly. Common sub-expression elimination, dead code elimination. Target Code Generation: Role: Takes the optimized intermediate code and maps it into actual machine instructions (relocatable machine code) for the target architecture. This phase involves instruction selection, register allocation, and instruction ordering. Output: Target machine code (e.g., assembly code or binary). Example (Assembly): LDF R2, id3 MULF R2, #60.0 ADDF R1, R1, R2 STF id1, R1 Diagram: Phases of a Compiler Source Program Lexical Analyzer Tokens Syntax Analyzer Parse Tree / AST Semantic Analyzer Annotated AST Intermediate Code Gen. Intermediate Code Code Optimizer Optimized IR Code Generator Target Machine Code Symbol Table Error Handler 2. Explain the role and structure of a lexical analyzer. Introduction: The lexical analyzer, also known as a scanner, is the first phase of a compiler, responsible for breaking down the source code into a stream of tokens. Explanation: Role of Lexical Analyzer: Streamlining Input: It reads the source code character by character, which is a low-level operation, and groups these characters into meaningful units called lexemes. Token Generation: For each lexeme, it generates a token, which is a pair <token-name, attribute-value> . The token name represents the syntactic category (e.g., `identifier`, `keyword`, `operator`), and the attribute value points to an entry in the symbol table for identifiers or constants. Removing Whitespace and Comments: It filters out non-essential characters like spaces, tabs, newlines, and comments from the source code, which are irrelevant for syntax and semantics. Error Detection: It detects and reports lexical errors, such as illegal characters or malformed numbers/identifiers. Symbol Table Creation/Update: When an identifier or literal is encountered for the first time, its information (e.g., name, type) is entered into the symbol table. Structure of Lexical Analyzer: The lexical analyzer is typically implemented using finite automata, which can be deterministic (DFA) or non-deterministic (NFA). Input Buffer: The lexical analyzer reads input from a buffer to improve efficiency. Two-buffer input schemes are common to reduce overhead for character lookahead. It uses two pointers: `lexemeBegin` and `forward`. `lexemeBegin` marks the beginning of the current lexeme, and `forward` scans ahead to find the end of the lexeme. Lexical Specification: The patterns for tokens are defined using regular expressions. For example: Identifiers: `letter (letter | digit)*` Numbers: `digit+ (. digit+)? (E (+ | -)? digit+)?` Keywords: `if`, `else`, `while` (fixed strings) Finite Automata: These regular expressions are then converted into NFAs and subsequently into DFAs. The DFA is used to recognize valid lexemes in the input stream. When the DFA reaches an accepting state, a lexeme has been found. Token Output Mechanism: Once a lexeme is matched, the corresponding token <token-name, attribute-value> is generated and passed to the parser. The attribute value might be a pointer to the symbol table or the actual value for literals. Diagram: Interaction of Lexical Analyzer with other phases Source Program Lexical Analyzer Tokens Syntax Analyzer Symbol Table Error Handler 3. What are the different types of errors in a compiler? Explain error handling. Introduction: Errors are inevitable in programming, and a compiler must be robust enough to detect, report, and often recover from them to provide useful feedback to the programmer. Explanation: Types of Errors: Lexical Errors: Description: Errors detected during the scanning phase, usually involving illegal characters or malformed tokens. Example: An illegal character like # in C, or a number like 123.4.5 . Detection: When the lexical analyzer cannot match a sequence of characters to any valid regular expression pattern. Syntax Errors: Description: Errors detected during the parsing phase, where the token stream violates the grammatical rules (syntax) of the programming language. Example: Missing semicolon at the end of a statement, unmatched parentheses, or an `if` statement without a corresponding `then`. Detection: When the parser cannot construct a valid parse tree for the given token sequence according to the grammar rules. Semantic Errors: Description: Errors that are syntactically correct but violate the meaning (semantics) of the language. These are detected during the semantic analysis phase. Example: Type mismatches (e.g., adding an integer to a string), undeclared variables, functions called with incorrect number or types of arguments, division by zero. Detection: When the semantic analyzer checks for type compatibility, scope rules, and other logical constraints. Logical Errors (Runtime Errors): Description: These are not typically detected by the compiler. They represent flaws in the program's logic that lead to incorrect output or unexpected behavior during execution. Example: Infinite loops, incorrect algorithm implementation, off-by-one errors in loops. Detection: Primarily through testing and debugging of the compiled program. Error Handling: The compiler's error handler performs four main tasks: Error Detection: Identifying that an error has occurred. Error Reporting: Providing a clear and informative message to the user, including the type of error and its location (line number, column). Error Recovery: Attempting to repair the error or adjust the state of the compiler so that it can continue processing the remainder of the source code. This prevents a cascade of misleading error messages. Panic Mode Recovery: On finding an error, discard input symbols one at a time until a synchronizing token (e.g., semicolon, brace) is found. Simple but may discard too much input. Phrase-Level Recovery: Perform local correction on the input stream (e.g., inserting a missing semicolon, deleting an extraneous comma). More sophisticated but harder to implement. Error Productions: Add error productions to the grammar that specify common errors, allowing the parser to recognize them and issue specific messages. Global Correction: Find a parse tree for the erroneous input that requires the fewest changes. Computationally expensive and rarely used in practice. Error Correction: In rare cases, the compiler might attempt to automatically fix minor errors, though this is risky as the compiler might misinterpret the programmer's intent. 4. Explain the concept of bootstrapping and cross-compilers. Introduction: Bootstrapping and cross-compilation are two important concepts in compiler development, particularly when developing compilers for new languages or new hardware architectures. Explanation: Bootstrapping: Definition: Bootstrapping is the process of writing a compiler (or any complex system) in the very language it is intended to compile. It's a self-generating or self-sustaining process. Why it's used: To develop a compiler for a new language (e.g., C compiler written in C). To port a compiler to a new machine. To improve an existing compiler. Process: Start with a small subset of the language, for which a simple compiler can be written, perhaps in assembly language or an existing high-level language. Use this basic compiler to compile a more complete version of the compiler, written in the subset language. Iteratively use the more complete compiler to compile even more sophisticated versions of itself, until a full-featured compiler for the language is achieved, written entirely in that language. T-Diagrams: A common way to represent bootstrapping is using T-diagrams. A compiler written in language $L_i$ that translates source language $L_s$ to target language $L_t$ is represented as: L_s ----- L_i ----- L_t Bootstrapping involves using an existing compiler to compile a new compiler for the same language. For example, to build a C compiler (C to Machine M) written in C, on machine M: C C ----- ----- ASM C ----- ----- M M The C compiler written in ASM on M compiles the C compiler written in C, producing a C compiler in M. Cross-Compiler: Definition: A cross-compiler is a compiler that runs on one computer architecture (the host system) but generates executable code for a different computer architecture (the target system). Why it's used: Embedded Systems: Often, embedded systems have limited resources (memory, processing power) and cannot host a full-fledged compiler. Code is developed on a powerful workstation and cross-compiled for the embedded target. New Architectures: When a new CPU architecture is being developed, a cross-compiler is needed to generate code for it before native compilers are available. Resource Constraints: To compile programs for systems with different operating systems or environments. Example: A C compiler running on an x86-64 Linux machine that produces executable binaries for an ARM-based microcontroller. The host is x86-64 Linux, the target is ARM. T-Diagram Example: A C to ARM compiler running on an x86 machine: C ----- x86 ----- ARM Here, the compiler itself runs on x86, but its output is for ARM. 5. Explain Input Buffering in detail. Introduction: Input buffering is a technique used in the lexical analysis phase of a compiler to efficiently read and process the source program, minimizing disk I/O operations and speeding up character-by-character scanning. Explanation: When a lexical analyzer reads characters from the source file, it typically doesn't read one character at a time directly from disk. Disk access is slow. Instead, it reads a large block of characters into a buffer in main memory. This significantly reduces the number of slow I/O operations. Types of Input Buffering Schemes: Single Buffer Scheme: The entire input is loaded into one large buffer. Two pointers, `lexemeBegin` and `forward`, are used. `lexemeBegin` points to the start of the current lexeme, and `forward` scans ahead to find its end. When `forward` reaches the end of the buffer, the entire buffer needs to be refilled from the input file. Disadvantage: If a lexeme spans across the buffer boundary (i.e., `forward` is at the end, but the lexeme isn't complete), handling this requires special logic and might involve moving characters or performing extra I/O. This can be inefficient. Two-Buffer Scheme (Commonly Used): To mitigate the issues of the single buffer scheme, two alternating buffers are used, each capable of holding $N$ characters (where $N$ is typically a block size, e.g., 4096 bytes). When the `forward` pointer reaches the end of one buffer, the other buffer is filled with the next block of input from the source file, while the lexical analyzer continues scanning from the newly filled buffer. This allows for continuous scanning without interruption for I/O. Pointers: `lexemeBegin`: Points to the beginning of the current lexeme. `forward`: Scans ahead to locate the end of the current lexeme. Mechanism: Initially, buffer 1 is filled. The `forward` pointer scans through buffer 1. When `forward` reaches the end of buffer 1, buffer 2 is filled from the input. `forward` wraps around to the beginning of buffer 2. When `forward` reaches the end of buffer 2, buffer 1 is filled, and `forward` wraps to buffer 1. A special "sentinel" character (e.g., `EOF` or a non-source character like `\0`) is often placed at the end of each buffer half to simplify boundary checks. Instead of checking `if (forward == end_of_buffer)` for every character, the check is only done for the sentinel. If `forward` encounters the sentinel, it checks if it's the actual end of file or just a buffer boundary. Advantage: Reduces the number of checks for buffer boundaries on each character, as only the sentinel needs to be checked. This makes the inner loop of the lexical analyzer much faster. It handles lexemes spanning buffer boundaries smoothly. Diagram: Two-Buffer Input Scheme Buffer 1 (Filled) Buffer 2 (Filling/Filled) lexemeBegin forward Input Stream ... ... Unit 2: Short Questions (2 Marks Each) 1. Define ambiguous grammar. Introduction: An ambiguous grammar is a context-free grammar that allows a string to have more than one leftmost derivation or more than one rightmost derivation, and thus more than one parse tree. Explanation: If a grammar produces two or more distinct parse trees for the same sentence, it is ambiguous. This ambiguity makes it difficult for a parser to uniquely determine the structure of a program, leading to issues in semantic analysis and code generation. For example, $E \to E + E \mid E * E \mid id$ is ambiguous for the string $id + id * id$. 2. What are the inputs and outputs of the syntax analyzer? Introduction: The syntax analyzer, or parser, is the second phase of a compiler. Explanation: Input: It receives a stream of tokens from the lexical analyzer. Each token represents a lexeme from the source code. Output: It produces a parse tree or an Abstract Syntax Tree (AST) if the input program is syntactically correct. This hierarchical representation shows the grammatical structure of the source code. If there are syntax errors, it reports them. 3. What is the role of a parser in a compiler? Introduction: The parser is the syntax analysis phase of a compiler. Explanation: Its primary role is to determine if the sequence of tokens received from the lexical analyzer conforms to the grammatical rules (syntax) of the programming language. If it does, the parser constructs a parse tree or an Abstract Syntax Tree (AST) that represents the hierarchical structure of the input. If not, it detects and reports syntax errors. It ensures that the program is well-formed according to the language specification. 4. What is derivation in parsing? Introduction: Derivation is a fundamental concept in context-free grammars and parsing. Explanation: It refers to the process of starting from the grammar's start symbol and repeatedly applying production rules to replace a non-terminal with its right-hand side. This process continues until a string consisting only of terminal symbols is generated. If a non-terminal is replaced at each step, it's a leftmost derivation; if the rightmost non-terminal is replaced, it's a rightmost derivation. Derivations demonstrate how a given string can be generated by a grammar. Example: For $S \to aS \mid b$, a derivation for $aab$ is $S \Rightarrow aS \Rightarrow aaS \Rightarrow aab$. 5. What is a recursive descent parser? Introduction: A recursive descent parser is a type of top-down parser. Explanation: It consists of a set of recursive procedures, one for each non-terminal in the grammar. Each procedure is responsible for parsing the language construct corresponding to its non-terminal. It attempts to match the input with the right-hand side of a production rule by calling other procedures (for non-terminals) and matching terminal symbols directly. It uses backtracking if multiple choices exist for a non-terminal, though predictive recursive descent avoids this by using lookahead. 6. What is shift and reduce operation in bottom-up parsing? Introduction: Shift-reduce parsing is a common form of bottom-up parsing. Explanation: It uses a stack and an input buffer. Shift: The parser moves the next input symbol onto the stack. This is analogous to "reading" the input. Reduce: When the top of the stack matches the right-hand side of a grammar production, the parser replaces that sequence of symbols on the stack with the non-terminal on the left-hand side of the production. This signifies that a grammatical constituent has been recognized. 7. What are handle and handle pruning? Introduction: Handle and handle pruning are key concepts in bottom-up parsing. Explanation: Handle: A handle of a right-sentential form $\gamma$ is a production $A \to \beta$ such that: $\beta$ appears as a substring in $\gamma$. Replacing $\beta$ by $A$ in $\gamma$ yields the previous right-sentential form in a rightmost derivation. The position of $\beta$ in $\gamma$ is unique. Simply put, it's a substring that matches the RHS of a production and can be "reduced". Handle Pruning: This is the process of repeatedly identifying a handle and replacing it with the corresponding non-terminal on the left-hand side of the production. This process continues until the start symbol is reached. It's how a bottom-up parser constructs the parse tree in reverse. Unit 2: Long Questions (10 Marks Each) 1. Explain Recursive Descent Parsing with example. Introduction: Recursive Descent Parsing is a top-down parsing technique that constructs the parse tree from the root down to the leaves. It is a simple and intuitive method where a set of recursive procedures are used, one for each non-terminal in the grammar. Explanation: In a recursive descent parser, each non-terminal symbol in the grammar has a corresponding procedure. This procedure is responsible for matching the input string according to the production rules for that non-terminal. When a procedure for a non-terminal is called, it tries to match the current input symbols with one of the alternatives on the right-hand side (RHS) of its production rule. This involves: Calling procedures for non-terminals encountered on the RHS. Matching terminal symbols with the current input token. Advancing the input pointer ( lookahead ) upon successful matches. If a non-terminal has multiple productions, the parser must decide which production to use. This often requires looking ahead at the next input token. If the grammar is LL(1) (Left-to-right scan, Leftmost derivation, 1 lookahead token), this decision can be made deterministically without backtracking. Requirements for Recursive Descent Parsing: No Left Recursion: The grammar must not have left recursion (e.g., $A \to A\alpha$). Left recursion causes infinite loops in recursive descent parsers. It must be eliminated. No Ambiguity: The grammar should not be ambiguous. Left Factoring (for Predictive Parsers): If a non-terminal has multiple productions whose RHSs share a common prefix (e.g., $A \to \alpha\beta_1 \mid \alpha\beta_2$), left factoring must be applied to make the parsing decision deterministic using lookahead. Example: Arithmetic Expression Grammar Consider the following grammar for simple arithmetic expressions (after eliminating left recursion and left factoring): E → T E' E' → + T E' | ε T → F T' T' → * F T' | ε F → ( E ) | id Let's consider the input string: id + id * id Parser Procedures (Conceptual): // Global variable to hold the current input token token lookahead; // Function to advance the input token void match(token expected_token) { if (lookahead == expected_token) { lookahead = getNextToken(); // advance input } else { error("Syntax error: expected %s", expected_token); } } // Procedure for non-terminal E void E() { T(); E_prime(); } // Procedure for non-terminal E_prime void E_prime() { if (lookahead == PLUS) { // + T E' match(PLUS); T(); E_prime(); } else { // ε // do nothing } } // Procedure for non-terminal T void T() { F(); T_prime(); } // Procedure for non-terminal T_prime void T_prime() { if (lookahead == MULT) { // * F T' match(MULT); F(); T_prime(); } else { // ε // do nothing } } // Procedure for non-terminal F void F() { if (lookahead == ID) { // id match(ID); } else if (lookahead == LPAREN) { // ( E ) match(LPAREN); E(); match(RPAREN); } else { error("Syntax error: expected id or ("); } } Trace for input `id + id * id` (simplified): Call E() Call T() Call F() lookahead is `id`. Match `id`. lookahead becomes `+`. Call T_prime() lookahead is `+`. It's not `*`. Choose `ε`. Return. Call E_prime() lookahead is `+`. Match `+`. lookahead becomes `id`. Call T() Call F() lookahead is `id`. Match `id`. lookahead becomes `*`. Call T_prime() lookahead is `*`. Match `*`. lookahead becomes `id`. Call F() lookahead is `id`. Match `id`. lookahead becomes `EOF`. Call T_prime() lookahead is `EOF`. It's not `*`. Choose `ε`. Return. Call E_prime() lookahead is `EOF`. It's not `+`. Choose `ε`. Return. The input is successfully parsed when all procedures return and the input stream is exhausted. 2. What is Left recursion & define how to remove left recursion in a given grammar? Introduction: Left recursion is a common issue in context-free grammars that can cause infinite loops in top-down parsers, such as recursive descent parsers. It occurs when a non-terminal directly or indirectly derives itself as its leftmost symbol. Explanation: Definition of Left Recursion: Direct Left Recursion: A production is directly left recursive if it is of the form $A \to A\alpha$, where $A$ is a non-terminal and $\alpha$ is any sequence of terminals and non-terminals. Indirect Left Recursion: A non-terminal $A$ is indirectly left recursive if there is a sequence of productions $A \to B\alpha_1$, $B \to C\alpha_2$, ..., $X \to A\alpha_k$ that eventually leads back to $A$ as the leftmost symbol. Problem with Left Recursion: In a top-down parser, if a procedure for non-terminal $A$ encounters a production $A \to A\alpha$, it will immediately call itself again with $A$, leading to an infinite loop without consuming any input. This prevents the parser from making progress. Algorithm for Eliminating Direct Left Recursion: For a non-terminal $A$ with productions: A → Aα1 | Aα2 | ... | Aαm | β1 | β2 | ... | βn where $\beta_i$ are productions that do not start with $A$ (non-left-recursive productions), and $\alpha_i$ are the left-recursive parts. This set of productions can be transformed into a non-left-recursive equivalent by introducing a new non-terminal, say $A'$, as follows: A → β1 A' | β2 A' | ... | βn A' A' → α1 A' | α2 A' | ... | αm A' | ε Here, $A'$ (pronounced "A-prime") is a new non-terminal. The $\beta$ productions are the "base cases" that allow $A$ to derive something without immediately recursing. The $A'$ productions handle the repetition of the $\alpha$ parts. Example: Eliminate Left Recursion for the given grammar: E→E+T / T T→T*F / F F→(E) / id Step 1: Eliminate left recursion for non-terminal E. For $E \to E + T \mid T$: Left-recursive productions: $E \to E + T$ (here $\alpha_1 = +T$) Non-left-recursive productions: $E \to T$ (here $\beta_1 = T$) Applying the transformation rule, we introduce a new non-terminal $E'$: E → T E' E' → + T E' | ε Step 2: Eliminate left recursion for non-terminal T. For $T \to T * F \mid F$: Left-recursive productions: $T \to T * F$ (here $\alpha_1 = *F$) Non-left-recursive productions: $T \to F$ (here $\beta_1 = F$) Applying the transformation rule, we introduce a new non-terminal $T'$: T → F T' T' → * F T' | ε Step 3: Non-terminal F has no left recursion. The productions for F remain unchanged: F → ( E ) | id Resulting Grammar (after eliminating all direct left recursion): E → T E' E' → + T E' | ε T → F T' T' → * F T' | ε F → ( E ) | id This transformed grammar is now suitable for top-down parsing methods like recursive descent, as it no longer contains direct left recursion. Indirect left recursion also needs to be handled, usually by a more general algorithm that first orders non-terminals and then substitutes productions to expose direct left recursion, which can then be eliminated using the above method. 3. Create CLR Parsing Table for S→(S)|a Grammar. Introduction: Canonical LR (CLR) parsing is a powerful bottom-up parsing technique that uses a large number of states to produce a canonical collection of LR(1) items. It provides a more precise and general solution compared to SLR and LALR parsers by using one-symbol lookahead to make reduction decisions. Explanation: Given Grammar: 1. S → (S) 2. S → a Step 1: Augment the Grammar. Add a new start symbol $S'$ and a production $S' \to .S$. 0. S' → .S 1. S → .(S) 2. S → .a Step 2: Construct the Canonical Collection of LR(1) Items. An LR(1) item is of the form $[A \to \alpha.\beta, a]$, where $a$ is a lookahead terminal. I0: S' → .S, $ S → .(S), $ S → .a, $ On S: goto I1 On (: goto I2 On a: goto I3 I1: S' → S., $ (Reduce by S' → S) I2: S → (.S), $ S → .(S), ) S → .a, ) On S: goto I4 On (: goto I2 (Note: lookahead changes for the new S item) On a: goto I3 (Note: lookahead changes for the new S item) I3: S → a., ) (Reduce by S → a) S → a., $ (Reduce by S → a) I4: S → (S.), ) On ): goto I5 I5: S → (S)., ) (Reduce by S → (S)) Note: The lookahead for $S \to .(S)$ and $S \to .a$ in $I_2$ is `)` because it must derive an `S` which is followed by `)` in the parent production $S \to (S)$. Similarly, for $S' \to .S$, the lookahead for $S$ productions is `$`. This is the crucial difference of LR(1) from SLR(1). Let's refine $I_2$ and subsequent states with correct lookaheads. Closure(I0): I0 = { S' → .S, $ S → .(S), $ S → .a, $ } GOTO(I0, S): I1 = { S' → S., $ } GOTO(I0, '('): I2 = { S → (.S), ) S → .(S), ) S → .a, ) } GOTO(I0, 'a'): I3 = { S → a., $ } GOTO(I2, S): I4 = { S → (S.), ) } GOTO(I2, '('): I5 = { S → (.S), ) S → .(S), ) S → .a, ) } // I5 is identical to I2, so we reuse I2. GOTO(I2, '(') = I2 GOTO(I2, 'a'): I6 = { S → a., ) } GOTO(I4, ')'): I7 = { S → (S)., ) } Let's re-evaluate the states and transitions carefully for CLR(1). I0: S' → .S, $ S → .(S), $ S → .a, $ GOTO(I0, S) = I1 GOTO(I0, '(') = I2 GOTO(I0, 'a') = I3 I1: S' → S., $ I2: S → (.S), ) Closure({S → (.S), )}) S → .(S), ) S → .a, ) GOTO(I2, S) = I4 GOTO(I2, '(') = I5 GOTO(I2, 'a') = I6 I3: S → a., $ I4: S → (S.), ) I5: S → (.S), ) S → .(S), ) S → .a, ) // I5 is identical to I2. So GOTO(I2, '(') is I2. Let's make sure. // Yes, a new state is not needed, so GOTO(I2, '(') = I2. I6: S → a., ) I7: S → (S)., ) Step 3: Construct the CLR(1) Parsing Table. Production numbers: $S' \to S (0)$, $S \to (S) (1)$, $S \to a (2)$. State Action Goto a ( ) $ S 0 s3 s2 1 1 acc 2 s6 s5 4 3 r2 4 s7 5 s6 s5 4 6 r2 7 r1 Actions: `sX`: Shift to state X. `rY`: Reduce by production Y. `acc`: Accept. Notice that in state 3, $S \to a.$, the lookahead is `$` (from $I_0$). So it reduces only if `$` is the next input. In state 6, $S \to a.$, the lookahead is `)` (from $I_2$). So it reduces only if `)` is the next input. This distinction in lookaheads for identical core items is what gives CLR its power and larger number of states compared to LALR. 4. Explain Operator Precedence Grammar with example. Introduction: Operator precedence parsing is a simple and efficient bottom-up parsing technique that works for a class of grammars known as operator precedence grammars. It is suitable for expressions where the relative precedence of operators can determine the order of evaluation. Explanation: Operator Precedence Grammar: A grammar is an operator precedence grammar if it satisfies two conditions: No production rule has two adjacent non-terminals on its right-hand side (RHS). That is, there are no productions of the form $A \to \alpha BC\beta$, where $B$ and $C$ are non-terminals. For any two terminal symbols $a$ and $b$, at most one of the following precedence relations holds: $a \lessdot b$ ($a$ yields precedence to $b$) $a \doteq b$ ($a$ has the same precedence as $b$) $a \gtrdot b$ ($a$ takes precedence over $b$) Precedence Relations: These relations define how operators interact. $a \lessdot b$: $a$ has lower precedence than $b$. This implies that $b$ should be reduced first. (e.g., $id \lessdot +$, $id \lessdot *$) $a \doteq b$: $a$ has the same precedence as $b$. This usually applies to matching parentheses. (e.g., $( \doteq )$) $a \gtrdot b$: $a$ has higher precedence than $b$. This implies that a handle ending in $a$ can be reduced. (e.g., $+ \gtrdot id$, $* \gtrdot +$) For terminals $a$ and $b$, $a \lessdot b$ if $a$ is immediately to the left of the "start" of $b$'s production in some sentential form. $a \gtrdot b$ if $a$ is immediately to the left of the "end" of $b$'s production. $a \doteq b$ if $a$ and $b$ are part of the same production. Parsing Algorithm: Operator precedence parsers use a stack and an input buffer. The parser compares the top terminal on the stack ($a$) with the current input symbol ($b$). Based on the precedence relation between $a$ and $b$, it performs one of three actions: Shift ($a \lessdot b$ or $a \doteq b$): Shift $b$ onto the stack. This happens when the incoming terminal has higher or equal precedence, suggesting it should be processed next or is part of the same construct. Reduce ($a \gtrdot b$): Reduce a handle from the stack. The parser pops symbols from the stack until it finds a terminal $c$ such that $c \lessdot a$. The symbols between $c$ and $a$ (inclusive) form the handle, which is then reduced to a non-terminal. Accept: When the stack contains only the start symbol and the input is empty. Error: If no relation exists or an invalid sequence is encountered. A special symbol `$` is used to mark the bottom of the stack and the end of the input. It has the lowest precedence. Example: Consider the grammar for arithmetic expressions: E → E + T | T T → T * F | F F → (E) | id This grammar is not an operator precedence grammar directly because it has adjacent non-terminals after left recursion elimination ($E \to T E'$). However, the underlying precedence relations between terminals can be extracted. Let's consider the input string: id + id * id Operator Precedence Table (simplified for common operators): id + * ( ) $ id $\gtrdot$ $\gtrdot$ $\gtrdot$ $\gtrdot$ + $\lessdot$ $\gtrdot$ $\lessdot$ $\lessdot$ $\gtrdot$ $\gtrdot$ * $\lessdot$ $\gtrdot$ $\gtrdot$ $\lessdot$ $\gtrdot$ $\gtrdot$ ( $\lessdot$ $\lessdot$ $\lessdot$ $\lessdot$ $\doteq$ ) $\gtrdot$ $\gtrdot$ $\gtrdot$ $\gtrdot$ $ $\lessdot$ $\lessdot$ $\lessdot$ $\lessdot$ Accept Parsing Trace for id + id * id (with `$` markers): Stack Input Relation (Top of Stack, Current Input) Action $ id + id * id $ $(, id) \Rightarrow \$ \lessdot id$ Shift $ id + id * id $ $(id, +) \Rightarrow id \gtrdot +$ Reduce id $\to$ E (conceptually, in OP parsing, non-terminals are implicit) $ + id * id $ $($, +) $\Rightarrow \$ \lessdot +$ Shift $ + id * id $ $(+, id) \Rightarrow + \lessdot id$ Shift $ + id * id $ $(id, *) \Rightarrow id \gtrdot *$ Reduce id $\to$ E $ + * id $ $(+, *) \Rightarrow + \lessdot *$ Shift $ + * id $ $(*, id) \Rightarrow * \lessdot id$ Shift $ + * id $ $(id, $) \Rightarrow id \gtrdot \$ $ Reduce id $\to$ E $ + * $ $(*, $) \Rightarrow * \gtrdot \$ $ Reduce expression (id * id) $ + $ $(+, $) \Rightarrow + \gtrdot \$ $ Reduce expression (id + (id * id)) $ $ $($, $) \Rightarrow \$ \doteq \$ $ Accept Note: In actual operator precedence parsing, non-terminals are often implicitly accounted for or not explicitly pushed onto the stack. The stack primarily holds terminals. 5. Explain Parsing and its classification. Introduction: Parsing, also known as syntax analysis, is the second phase of a compiler. Its main task is to determine if the stream of tokens generated by the lexical analyzer conforms to the grammatical rules (syntax) of the programming language. If it does, the parser constructs a hierarchical representation of the input, typically a parse tree or an Abstract Syntax Tree (AST). Explanation: Role of Parsing: Syntax Validation: Ensures that the program's structure is grammatically correct according to the language specification. Structure Creation: Builds a structured representation (parse tree/AST) that highlights the relationships between language constructs. This structure is then used by subsequent compiler phases for semantic analysis and code generation. Error Reporting: Detects and reports syntax errors in the source code. Classification of Parsers: Parsers are broadly classified into two main categories based on the way they construct the parse tree: 1. Top-Down Parsing: Approach: Starts from the start symbol of the grammar and attempts to derive the input string by applying production rules. It tries to build the parse tree from the root down to the leaves. Mechanism: Predicts the next production rule to apply based on the current non-terminal and the lookahead symbol(s). Types: Recursive Descent Parsing: A collection of recursive procedures, one for each non-terminal. Each procedure tries to match its non-terminal with the input. It can involve backtracking if the grammar is not left-factored. Predictive Parsing (LL Parsers): A special form of recursive descent that doesn't require backtracking. It uses a parsing table to make deterministic decisions about which production to apply, based on the current non-terminal and the single lookahead symbol. Requires grammars to be LL(1) (no left recursion, no common prefixes). Advantages: Relatively easy to implement manually for simple grammars. Disadvantages: Cannot handle left-recursive grammars directly (requires elimination of left recursion). Requires left factoring for predictive parsing. Limited class of grammars. 2. Bottom-Up Parsing: Approach: Starts from the input symbols (leaves of the parse tree) and tries to construct the parse tree upwards towards the start symbol (root). It performs "reductions" by identifying substrings that match the right-hand side of a production and replacing them with the left-hand side non-terminal. Mechanism: Uses a stack to store grammar symbols and decides whether to "shift" an input symbol onto the stack or "reduce" a handle on the stack. Types: Shift-Reduce Parsing: The general framework for bottom-up parsers. Operator Precedence Parsing: A simple shift-reduce parser primarily used for expression grammars. It relies on precedence relations between operators. LR Parsers (Left-to-right scan, Rightmost derivation in reverse): The most powerful class of shift-reduce parsers. They use a finite automaton and a parsing table to make shift/reduce decisions. SLR (Simple LR): Simplest LR parser, uses FOLLOW sets to resolve reduce conflicts. LALR (Lookahead LR): More powerful than SLR, merges states with identical core items, reducing table size while maintaining much of LR(1)'s power. Most common in practice (e.g., Yacc/Bison). CLR (Canonical LR): Most powerful LR parser, uses full LR(1) items (state + lookahead) to distinguish states, leading to larger tables but no conflicts if the grammar is LR(1). Advantages: Can handle a much larger class of grammars compared to top-down parsers. More powerful and flexible for complex languages. Disadvantages: More complex to implement manually; typically generated by tools (parser generators). 6. Write short notes on: 1. Left Factoring 2. Yacc compiler Introduction: Left factoring and Yacc are two distinct but related concepts in compiler design. Left factoring is a grammar transformation technique used to make grammars suitable for predictive parsing, while Yacc is a powerful parser generator tool. Explanation: 1. Left Factoring: Definition: Left factoring is a grammar transformation technique used to eliminate common prefixes in the right-hand sides (RHS) of production rules for a non-terminal. When a non-terminal has two or more productions whose RHSs begin with the same sequence of symbols, a top-down parser (especially a predictive parser) cannot deterministically choose which production to apply without backtracking. Purpose: To make a grammar suitable for predictive parsing (e.g., LL(1) parsers). Predictive parsers require that for each non-terminal $A$, and for each input symbol $a$, there is at most one production $A \to \alpha$ such that $a$ is in $FIRST(\alpha)$ (or if $\alpha$ can derive $\epsilon$, then $a$ is in $FOLLOW(A)$). Common prefixes violate this condition. Transformation: If a non-terminal $A$ has productions $A \to \alpha\beta_1 \mid \alpha\beta_2 \mid \dots \mid \alpha\beta_n \mid \gamma_1 \mid \dots \mid \gamma_m$, where $\alpha$ is the longest common prefix, it can be left-factored into: A → α A' | γ1 | ... | γm A' → β1 | β2 | ... | βn where $A'$ is a new non-terminal. Example: Stmt → if Expr then Stmt else Stmt | if Expr then Stmt Here, `if Expr then Stmt` is a common prefix. Left-factored grammar: Stmt → if Expr then Stmt Stmt' Stmt' → else Stmt | ε 2. Yacc compiler (Yet Another Compiler Compiler): Definition: Yacc (and its GNU counterpart, Bison) is a powerful parser generator tool that takes a context-free grammar specification as input and generates a LALR(1) parser in C (or other languages). It's widely used for building compilers, interpreters, and other language processing tools. Input: Yacc takes a grammar specification file (typically with a `.y` extension) which includes: Declarations: Token declarations, operator precedence and associativity rules. Grammar Rules: Production rules with associated semantic actions (C code snippets). Auxiliary C Code: Helper functions, error handling routines, and the `main` function. Output: Yacc generates a C source file (e.g., `y.tab.c`) containing the parsing tables and the parser function (`yyparse()`). This C file can then be compiled with a C compiler. Mechanism: Yacc generates a LALR(1) parser, which is a type of bottom-up shift-reduce parser. It uses a finite automaton and a parsing table to control the parsing process. When a reduction occurs, the associated semantic action (C code) is executed, typically to build an Abstract Syntax Tree, generate intermediate code, or perform semantic checks. Interaction with Lex: Yacc is often used in conjunction with Lex (or Flex), a lexical analyzer generator. Lex generates the `yylex()` function, which provides tokens to Yacc's `yyparse()` function. Example: /* Yacc declarations */ %token ID NUMBER %left '+' '-' %left '*' '/' %% /* Grammar rules */ expr: expr '+' expr { $$ = $1 + $3; } | expr '*' expr { $$ = $1 * $3; } | ID { $$ = lookup_symbol($1); } | NUMBER { $$ = $1; } ; %% /* C code */ // main function, yylex, yyerror etc. Find first and follow in a given grammar (# = Epsilon) 1. E → T R 2. R → + T R | # 3. T → F Y 4. Y → * F Y | # 5. F → ( E ) | i Introduction: FIRST and FOLLOW sets are crucial in compiler design, particularly for constructing predictive parsing tables (LL(1) parsers). They help determine which production to use for a non-terminal and when to apply an epsilon production. FIRST(A) includes all terminal symbols that can begin a string derived from A. FOLLOW(A) includes all terminal symbols that can appear immediately after A in any sentential form. Explanation: Rules for FIRST Sets: If $X$ is a terminal, then $FIRST(X) = \{X\}$. If $X$ is a non-terminal and $X \to \epsilon$ is a production, then $\epsilon \in FIRST(X)$. If $X$ is a non-terminal and $X \to Y_1 Y_2 \dots Y_k$ is a production: Add $FIRST(Y_1)$ to $FIRST(X)$. If $\epsilon \in FIRST(Y_1)$, add $FIRST(Y_2)$ to $FIRST(X)$. Continue this for $Y_i$ as long as $\epsilon \in FIRST(Y_j)$ for all $j If $\epsilon \in FIRST(Y_j)$ for all $j=1 \dots k$, then add $\epsilon$ to $FIRST(X)$. Rules for FOLLOW Sets: Place `$` (end-of-input marker) in $FOLLOW(S)$, where $S$ is the start symbol. If there is a production $A \to \alpha B \beta$: Add all symbols in $FIRST(\beta)$ (excluding $\epsilon$) to $FOLLOW(B)$. If there is a production $A \to \alpha B$ OR a production $A \to \alpha B \beta$ where $\epsilon \in FIRST(\beta)$: Add all symbols in $FOLLOW(A)$ to $FOLLOW(B)$. Calculating FIRST Sets: $F \to (E) \mid i$ $FIRST(F) = \{ '(', 'i' \}$ $Y \to * F Y \mid \#$ $FIRST(Y) = \{ '*', \epsilon \}$ (Since $F$ does not derive $\epsilon$) $T \to F Y$ $FIRST(T) = FIRST(F) \cup FIRST(Y)$ (since $F$ does not derive $\epsilon$) $FIRST(T) = \{ '(', 'i', '*' \}$ $R \to + T R \mid \#$ $FIRST(R) = \{ '+', \epsilon \}$ (Since $T$ does not derive $\epsilon$) $E \to T R$ $FIRST(E) = FIRST(T)$ (since $T$ does not derive $\epsilon$) $FIRST(E) = \{ '(', 'i', '*' \}$ Final FIRST Sets: FIRST(E) = { '(', 'i', '*' } FIRST(R) = { '+', ε } FIRST(T) = { '(', 'i', '*' } FIRST(Y) = { '*', ε } FIRST(F) = { '(', 'i' } Calculating FOLLOW Sets: $FOLLOW(E)$: E is the start symbol: `$` is in $FOLLOW(E)$. From $F \to (E)$: `)` is in $FOLLOW(E)$. $FOLLOW(E) = \{ ')', '$' \}$ $FOLLOW(R)$: From $E \to T R$: $FOLLOW(R)$ includes $FOLLOW(E)$. $FOLLOW(R) = \{ ')', '$' \}$ $FOLLOW(T)$: From $E \to T R$: $FIRST(R)$ (excluding $\epsilon$) is in $FOLLOW(T)$. So `+` is in $FOLLOW(T)$. Since $\epsilon \in FIRST(R)$, $FOLLOW(E)$ is in $FOLLOW(T)$. So `)` and `$` are in $FOLLOW(T)$. From $R \to + T R$: $FIRST(R)$ (excluding $\epsilon$) is in $FOLLOW(T)$. So `+` is in $FOLLOW(T)$. Since $\epsilon \in FIRST(R)$, $FOLLOW(R)$ is in $FOLLOW(T)$. So `)` and `$` are in $FOLLOW(T)$. $FOLLOW(T) = \{ '+', ')', '$' \}$ $FOLLOW(Y)$: From $T \to F Y$: $FOLLOW(Y)$ includes $FOLLOW(T)$. $FOLLOW(Y) = \{ '+', ')', '$' \}$ $FOLLOW(F)$: From $T \to F Y$: $FIRST(Y)$ (excluding $\epsilon$) is in $FOLLOW(F)$. So `*` is in $FOLLOW(F)$. Since $\epsilon \in FIRST(Y)$, $FOLLOW(T)$ is in $FOLLOW(F)$. So `+`, `)`, `$` are in $FOLLOW(F)$. From $Y \to * F Y$: $FIRST(Y)$ (excluding $\epsilon$) is in $FOLLOW(F)$. So `*` is in $FOLLOW(F)$. Since $\epsilon \in FIRST(Y)$, $FOLLOW(Y)$ is in $FOLLOW(F)$. So `+`, `)`, `$` are in $FOLLOW(F)$. $FOLLOW(F) = \{ '*', '+', ')', '$' \}$ Final FOLLOW Sets: FOLLOW(E) = { ')', '$' } FOLLOW(R) = { ')', '$' } FOLLOW(T) = { '+', ')', '$' } FOLLOW(Y) = { '+', ')', '$' } FOLLOW(F) = { '*', '+', ')', '$' } Unit 3: Short Questions (2 Marks Each) 1. What is the difference between S-attributed and L-attributed grammars? Introduction: S-attributed and L-attributed grammars are two classes of attribute grammars used in semantic analysis to specify computations associated with productions. Explanation: S-attributed Grammar (Synthesized attributes only): Attributes are computed only from the values of the attributes of its children (and itself). The information flows purely bottom-up in the parse tree. Can be evaluated by a single bottom-up pass. L-attributed Grammar (Synthesized and Inherited attributes): Attributes can be computed from left-to-right. An attribute can depend on: Inherited attributes of its parent. Synthesized attributes of its left siblings. Inherited and synthesized attributes of itself. Information flows both bottom-up (synthesized) and left-to-right (inherited). Can be evaluated by a single depth-first, left-to-right traversal. 2. What is the role of semantic rules in translation? Introduction: Semantic rules are an integral part of Syntax-Directed Definitions (SDD) and play a crucial role in the semantic analysis phase of a compiler. Explanation: Semantic rules are associated with grammar productions and specify how to compute attribute values (e.g., type, value, memory location) for the symbols in the production. These rules define the meaning or semantics of the language constructs. They perform tasks like type checking, scope resolution, building symbol tables, and generating intermediate code. Essentially, they bridge the gap between the syntactic structure and the actual meaning of the program. 3. Define intermediate forms using post-fix notation. Introduction: Postfix notation, also known as Reverse Polish Notation (RPN), is a common intermediate form used in compilers. Explanation: In postfix notation, operators follow their operands. This eliminates the need for parentheses and explicitly defines the order of operations, making it easy to evaluate using a stack. For example, the infix expression a + b * c would be represented as a b c * + in postfix. It's a linear, unambiguous representation of expressions that is straightforward to generate from a parse tree and to translate into machine code. 4. How to parse SDT in top-down Parsing? Introduction: Syntax-Directed Translation (SDT) can be implemented in top-down parsing by embedding semantic actions directly into parser procedures. Explanation: In a recursive-descent parser, semantic actions (C code snippets) are inserted into the body of the procedures corresponding to non-terminals. These actions are typically executed when a certain point in the production is reached (e.g., after matching a terminal or calling a sub-procedure for a non-terminal). For L-attributed SDDs, inherited attributes are passed as parameters to the procedure calls, and synthesized attributes are returned as results. For S-attributed SDDs, all attributes are synthesized and returned. The order of actions must respect attribute dependencies. 5. What is intermediate code generation? Introduction: Intermediate code generation is a phase in a compiler that translates the source program into an intermediate representation. Explanation: This intermediate code (IR) is typically machine-independent but closer to machine language than the source code. It serves as a bridge between the front-end (source language dependent) and the back-end (target machine dependent) of the compiler. Common IRs include three-address code (quadruples, triples), postfix notation, and syntax trees. The purpose is to simplify code optimization and target code generation by providing a more abstract and structured representation than the source code, yet less detailed than machine code. Unit 3: Long Questions (10 Marks Each) 1. Consider the following block and construct a DAG for it: (1) a = b x c (2) d = b (3) e = d x c (4) b = e (5) f = b + c (6) g = f + d Introduction: A Directed Acyclic Graph (DAG) is a graphical representation of basic blocks in intermediate code. It is a powerful tool for optimizing basic blocks by identifying common subexpressions, eliminating dead code, and reordering computations. Each node in a DAG represents an operation, and its children represent its operands. Leaf nodes are identifiers or constants, and interior nodes are operators. Explanation: To construct a DAG, we process each statement in the basic block sequentially, performing the following steps: For each statement $x = y \text{ op } z$: Create a node for `op` if it doesn't exist. The left child of `op` is the node for `y`. The right child of `op` is the node for `z`. Attach `x` as a label to the `op` node. For each statement $x = y$: Attach `x` as a label to the node for `y`. If a node for an operation already exists with the same children, reuse that node (common subexpression elimination). If a variable is redefined, remove its old label from any previous node and attach it to the new node. Let's construct the DAG for the given block: Initial state: No nodes, variables `a, b, c, d, e, f, g` are undefined. (1) a = b x c Create node `x` with children `b` and `c`. Label the `x` node with `a`. (2) d = b Node for `b` already exists (as a leaf). Label the node `b` with `d`. (Now node `b` is labeled `b, d`). (3) e = d x c This is `d x c`. `d` is essentially `b`. So this is `b x c`. A node for `x` with children `b` and `c` already exists (from statement 1). This is a common subexpression. Label this existing `x` node with `e`. (Now the `x` node is labeled `a, e`). (4) b = e Remove `b`'s label from the leaf node `b`. Remove `d`'s label from the leaf node `b`. (The node `b` is now just `b` with no labels) Label the node for `e` (which is the `x` node from statement 1) with `b`. (Now the `x` node is labeled `a, e, b`). (5) f = b + c `b` refers to the `x` node (from statement 1, 3, 4). `c` refers to the leaf node `c`. Create node `+` with children (node for `b`) and (node for `c`). Label the `+` node with `f`. (6) g = f + d `f` refers to the `+` node (from statement 5). `d` refers to the previous value of `b` (before statement 4). This means `d` still holds the original value of `b`. Let's re-evaluate `d`. In statement (2), `d = b`. After statement (4), `b = e`. So `d` still holds the original value of `b`. The node for `d` is the original leaf node `b`. Create node `+` with children (node for `f`) and (node for `d` which is original `b` leaf). Label this new `+` node with `g`. Revised interpretation for (4) and (6): When `b = e` happens, the name `b` now refers to the value of `e`. The original value of `b` (which `d` still points to) is distinct. So, the leaf node `b` is now only labeled `d`. The `x` node is labeled `a, e, b`. Let's trace again, carefully tracking active variable labels: Nodes: `b_leaf`, `c_leaf` (1) a = b x c `Op1 = x(b_leaf, c_leaf)` Labels: `a` points to `Op1`. (2) d = b `d` points to `b_leaf`. (3) e = d x c This is `x(d_node, c_leaf)`, which is `x(b_leaf, c_leaf)`. This is `Op1`. `e` points to `Op1`. (4) b = e The name `b` now points to `Op1`. (The original `b_leaf` is now only associated with `d`). (5) f = b + c `b` refers to `Op1`. `c` refers to `c_leaf`. `Op2 = +(Op1, c_leaf)` `f` points to `Op2`. (6) g = f + d `f` refers to `Op2`. `d` refers to `b_leaf`. `Op3 = +(Op2, b_leaf)` `g` points to `Op3`. DAG Diagram: + (f) -----+ / \ | x (a,e,b) c_leaf / \ b_leaf (d) c_leaf + (g) / \ + (f) b_leaf(d) (This is incorrect, c_leaf is reused, not b_leaf for d here) Let's draw it more clearly. Nodes created: Leaf node `b` (initially for variable `b`) Leaf node `c` (initially for variable `c`) Node `x` (for `b x c`) Node `+` (for `b + c`) Node `+` (for `f + d`) Variable mappings (labels): Initially: `b` -> `b_leaf`, `c` -> `c_leaf` (1) `a = b x c`: Create `N1 = x(b_leaf, c_leaf)`. `a` -> `N1`. (2) `d = b`: `d` -> `b_leaf`. (3) `e = d x c`: `d x c` is `b_leaf x c_leaf`, which is `N1`. `e` -> `N1`. (4) `b = e`: `b` now points to `N1`. (The old `b` value is still accessible via `d`). (5) `f = b + c`: `b` is `N1`, `c` is `c_leaf`. Create `N2 = +(N1, c_leaf)`. `f` -> `N2`. (6) `g = f + d`: `f` is `N2`, `d` is `b_leaf`. Create `N3 = +(N2, b_leaf)`. `g` -> `N3`. DAG Diagram: b (d) c (c) x (a, e, b) + (f) + (g) 2. Define TAC and explain how to represent TAC using Quadruples and Triples. Introduction: Three-Address Code (TAC) is a form of intermediate code used in compilers. It is a sequence of statements, each of which has at most three operands and one operator. TAC is linear and machine-independent, making it suitable for code optimization and subsequent target code generation. Explanation: Three-Address Code (TAC): Each instruction in TAC has the general form `x = y op z`, where `op` is an operator, `y` and `z` are operands, and `x` is the result. Operands `y` and `z` can be names, constants, or compiler-generated temporary variables. The "three-address" refers to the addresses (memory locations or registers) of these three components. TAC simplifies expressions by breaking them down into a sequence of elementary operations, each with a clear result. Example: `a = b + c * d` in TAC would be: t1 = c * d t2 = b + t1 a = t2 (where `t1`, `t2` are compiler-generated temporary variables). Representations of Three-Address Code: TAC can be represented in various data structures, with Quadruples and Triples being two common forms. 1. Quadruples: Structure: A quadruple is a record structure with four fields: `operator`, `operand1`, `operand2`, and `result`. Fields: `op`: The operator (e.g., `+`, `*`, `:=`, `if_goto`). `arg1`: The first operand. `arg2`: The second operand. `result`: The variable or temporary that stores the result of the operation. Advantages: Easy to move code for optimization (e.g., by changing the `result` field). Each instruction is self-contained. Disadvantages: Temporary variables can consume a lot of space. Example: For `a = b + c * d` # op arg1 arg2 result (0) * c d t1 (1) + b t1 t2 (2) := t2 a 2. Triples: Structure: A triple is a record structure with three fields: `operator`, `operand1`, and `operand2`. Fields: `op`: The operator. `arg1`: The first operand. `arg2`: The second operand. Key Feature: Unlike quadruples, triples do not explicitly use a `result` field. Instead, the result of an operation is implicitly referred to by the index (position) of the triple itself. This eliminates the need for explicit temporary variables. Advantages: Saves space by not creating temporary variables. Disadvantages: Code movement during optimization is difficult because changing the order of triples changes their implicit results, requiring updates to all references. Requires an extra indirection if arguments are not computed in the immediately preceding triple. Example: For `a = b + c * d` # op arg1 arg2 (0) * c d (1) + b (0) (2) := a (1) Note: `(0)` refers to the result of the operation in triple 0. 3. Define SDD. Explain different forms of SDD. Introduction: Syntax-Directed Definition (SDD) is a high-level specification for translations. It augments context-free grammar with attributes for symbols and semantic rules for productions, providing a formal way to specify the semantics of a programming language. Explanation: Definition of Syntax-Directed Definition (SDD): An SDD is a context-free grammar where: Each grammar symbol $X$ (terminal or non-terminal) has a set of attributes associated with it. These attributes represent properties or information about the language construct represented by $X$ (e.g., type, value, memory location). Each production rule $A \to \alpha$ has a set of semantic rules associated with it. These rules specify how to compute the values of the attributes. Attributes can be of two types: Synthesized Attributes: Computed from the values of attributes of its children nodes in the parse tree. Information flows up the parse tree. Inherited Attributes: Computed from the values of attributes of its parent node, its left siblings, or itself. Information flows down and across the parse tree. Different Forms of SDD: Based on how attributes are computed and their dependencies, SDDs are classified into different forms: 1. S-Attributed Definitions: Characteristic: SDDs that use only synthesized attributes are called S-attributed definitions. Evaluation: All attributes are computed by passing information up the parse tree. This means that attribute values can be computed during a single bottom-up (postorder traversal) pass over the parse tree. Compatibility: S-attributed definitions are well-suited for bottom-up parsing techniques (e.g., LR parsers). Semantic actions can be embedded at the end of production rules, executed as soon as a reduction is performed. Example: Computing the value of an arithmetic expression: Production Semantic Rule E → E1 + T E.val = E1.val + T.val E → T E.val = T.val T → T1 * F T.val = T1.val * F.val T → F T.val = F.val F → ( E ) F.val = E.val F → digit F.val = digit.lexval Here, `val` is a synthesized attribute. Each `val` is computed from the `val` of its children. 2. L-Attributed Definitions: Characteristic: SDDs where an attribute can be either synthesized or inherited, but with a restriction on inherited attributes: an inherited attribute of a symbol on the right-hand side (RHS) of a production $A \to X_1 X_2 \dots X_n$ can only depend on: Inherited attributes of the non-terminal $A$. Synthesized or inherited attributes of symbols $X_j$ that appear to its left (i.e., $j Synthesized or inherited attributes of $X_i$ itself. Essentially, information flows from left-to-right across the RHS of a production. Evaluation: L-attributed definitions can be evaluated by a single depth-first, left-to-right (preorder-like) traversal of the parse tree. This makes them suitable for both top-down (predictive) and bottom-up parsing methods. Compatibility: They are more general than S-attributed definitions and are widely used in practical compilers, as many language constructs require left-to-right context (e.g., type checking, variable declarations). Example: Type checking in declarations: Production Semantic Rule D → T id L L.in = T.type // inherited attribute L.in gets type from T T → int T.type = integer // synthesized attribute T → real T.type = real // synthesized attribute L → id L1 add_type(id.entry, L.in); L1.in = L.in // L.in passed to L1 L → ε // no action Here, `type` is synthesized, and `in` is inherited. 3. Syntax-Directed Translation Scheme (SDTS): Characteristic: An SDTS is an SDD embedded directly into the grammar productions by placing semantic actions (code fragments) within the RHS. Evaluation: The position of the semantic action determines when it is executed. Actions at the end of a production are equivalent to synthesized attributes (S-attributed). Actions in the middle of a production can implement inherited attributes (L-attributed), as long as attribute dependencies are respected. Compatibility: Can be used with both top-down and bottom-up parsers, but careful placement of actions is needed to ensure attributes are available when needed. 4. Explain S and L attributes in detail. Introduction: Attributes are properties associated with grammar symbols in a Syntax-Directed Definition (SDD). They represent information about the language constructs, such as type, value, memory location, or code fragments. Attributes are classified into two main types: synthesized and inherited, based on how their values are computed within the parse tree. Explanation: 1. Synthesized Attributes: Definition: A synthesized attribute for a non-terminal $A$ at a parse tree node $N$ is computed from the values of attributes of its children nodes of $N$ and from the attributes of $N$ itself. The information effectively flows "up" the parse tree. Computation: If we have a production $A \to X_1 X_2 \dots X_n$, and $A.s$ is a synthesized attribute, its semantic rule would typically be of the form $A.s = f(X_1.a_1, X_2.a_2, \dots, X_n.a_n, A.a)$, where $f$ is some function and $a_i$ are attributes of the children or $A$ itself. Characteristics: Dependencies always point from children to parent. Can be computed during a single bottom-up (postorder) traversal of the parse tree. Grammars that use only synthesized attributes are called S-attributed grammars. Well-suited for tasks like calculating the value of an expression, constructing an abstract syntax tree, or type checking simple expressions. Example: Calculating the value of an arithmetic expression: Consider the production $E \to E_1 + T$. The value of $E$ (a synthesized attribute, say `val`) depends on the values of $E_1$ and $T$. So, $E.val = E_1.val + T.val$. Here, $E.val$ is synthesized from its children $E_1$ and $T$. Production Semantic Rule E → E1 + T E.val = E1.val + T.val T → F T.val = F.val F → digit F.val = digit.lexval 2. Inherited Attributes: Definition: An inherited attribute for a non-terminal $B$ at a parse tree node $N$ is computed from the values of attributes of its parent node, its left siblings, or itself. The information effectively flows "down" and "across" the parse tree. Computation: If we have a production $A \to X_1 X_2 \dots X_n$, and $X_i.i$ is an inherited attribute, its semantic rule could be of the form $X_i.i = f(A.a, X_1.a_1, \dots, X_{i-1}.a_{i-1}, X_i.a)$, where $f$ is some function. Characteristics: Dependencies point from parent to child or from left sibling to right sibling. Cannot be computed using a single bottom-up traversal. Requires more complex traversal orders or multiple passes. Used to pass contextual information from parent to child or from left to right. Essential for tasks like type checking in declarations (passing type information down), keeping track of symbol table entries, or determining the scope of declarations. Example: Passing type information in declarations: Consider the production $D \to T id L$. To properly declare `id` and subsequent identifiers in `L` with the type specified by `T`, the type information needs to be passed from `T` to `id` and then to `L`. Production Semantic Rule D → T id L L.in = T.type // L.in is inherited from T.type T → int T.type = integer // T.type is synthesized L → id L1 add_type(id.entry, L.in); L1.in = L.in // L.in passed to L1 Here, `L.in` is an inherited attribute that carries the type declared by `T` to the list of identifiers `L`. Relationship to L-attributed Grammars: L-attributed grammars are a powerful subclass of attribute grammars that accommodate both synthesized and inherited attributes. The key restriction is that inherited attributes must be computable in a single left-to-right pass. This means that an inherited attribute of a symbol $X_i$ on the RHS of a production can only depend on: Inherited attributes of the production's left-hand side non-terminal ($A$). Synthesized or inherited attributes of symbols to its left ($X_1, \dots, X_{i-1}$). Synthesized or inherited attributes of $X_i$ itself. This restriction ensures that attributes can be evaluated in a single depth-first traversal of the parse tree. 5. Define DAG with example and write algorithm also. Introduction: A Directed Acyclic Graph (DAG) is a graphical representation of a basic block in intermediate code. It is a powerful data structure used in the code optimization phase of a compiler to identify common subexpressions, eliminate redundant computations, and reorder operations within a basic block. Explanation: Definition of DAG: A DAG for a basic block is a directed graph where: Leaves: Represent identifiers (variables) or constants. Interior Nodes: Represent operators. Each interior node is labeled with an operator, and its children are its operands. Edges: Represent the flow of values from operands to operators. Since it's a DAG, there are no cycles. Labels for Nodes: Each node can have a list of identifiers attached to it, indicating that those identifiers hold the value computed at that node. The last assigned identifier for a node represents the current value of that variable. The construction of a DAG helps visualize the common subexpressions and the dependencies between operations, facilitating various optimizations. Algorithm for Constructing a DAG for a Basic Block: Assume the basic block consists of a sequence of three-address statements. We maintain two tables: `node_value` table: Maps (operator, left_operand_node, right_operand_node) to a node. Used to detect common subexpressions. `var_node` table: Maps variable names to the node that currently computes their value. Used to track current definitions. Algorithm Steps: Initialize `node_value` and `var_node` tables as empty. For each three-address statement `i: result = operand1 op operand2` (or `i: result = op operand1`, or `i: result = operand1`): Determine `node_operand1` and `node_operand2`: If `operand1` is a constant, create a new leaf node for it. If `operand1` is a variable, find `node_operand1 = var_node[operand1]`. If `operand1` is not in `var_node`, create a new leaf node for it and add it to `var_node`. Do similarly for `operand2`. Check for Common Subexpression: Look up `(op, node_operand1, node_operand2)` in `node_value`. If an existing node `N` is found, then `N` computes the same value. Reuse `N`. If no such node exists, create a new interior node `N` for `op` with children `node_operand1` and `node_operand2`. Add `(op, node_operand1, node_operand2)` to `node_value` mapping to `N`. Update Variable Mapping: Remove `result` from the labels of its old node (if `result` was previously defined). Add `result` as a label to the newly found or created node `N`. Update `var_node[result] = N`. For assignment statements `i: result = operand1`: Determine `node_operand1` as above. Remove `result` from labels of its old node. Add `result` as a label to `node_operand1`. Update `var_node[result] = node_operand1`. Example: Consider the basic block: 1. a = b + c 2. b = a - d 3. c = b + c 4. d = a - d Step-by-step DAG construction: Initial: `var_node = {}`, `node_value = {}` 1. `a = b + c` `node_b` = leaf('b'), `node_c` = leaf('c') Create `N1 = + (node_b, node_c)` Labels: `b`->`node_b`, `c`->`node_c`, `a`->`N1` `var_node = {b:node_b, c:node_c, a:N1}` `node_value = {(+,node_b,node_c):N1}` 2. `b = a - d` `node_a` = `var_node['a']` = `N1` `node_d` = leaf('d') Create `N2 = - (N1, node_d)` Remove `b` from `node_b`'s labels. Labels: `a`->`N1`, `c`->`node_c`, `d`->`node_d`, `b`->`N2` `var_node = {b:N2, c:node_c, a:N1, d:node_d}` `node_value = {(+,node_b,node_c):N1, (-,N1,node_d):N2}` 3. `c = b + c` `node_b` = `var_node['b']` = `N2` `node_c` = `var_node['c']` = `node_c` (the leaf) Create `N3 = + (N2, node_c)` Remove `c` from `node_c`'s labels. Labels: `a`->`N1`, `d`->`node_d`, `b`->`N2`, `c`->`N3` `var_node = {b:N2, c:N3, a:N1, d:node_d}` `node_value = {(+,node_b,node_c):N1, (-,N1,node_d):N2, (+,N2,node_c):N3}` 4. `d = a - d` `node_a` = `var_node['a']` = `N1` `node_d_old` = `var_node['d']` = `node_d` (the leaf) Check `node_value` for `(-, N1, node_d_old)`. It's `N2`. This is a common subexpression. Reuse `N2`. Remove `d` from `node_d`'s labels. Labels: `a`->`N1`, `b`->`N2`, `c`->`N3`, `d`->`N2` `var_node = {b:N2, c:N3, a:N1, d:N2}` `node_value` remains the same. Final DAG Diagram: b c d + (a) - (b, d) + (c) 3. Define TAC and explain how to represent TAC using Quadruples and Triples. Introduction: Three-Address Code (TAC) is a form of intermediate code used in compilers. It is a sequence of statements, each of which has at most three operands and one operator. TAC is linear and machine-independent, making it suitable for code optimization and subsequent target code generation. Explanation: Three-Address Code (TAC): Each instruction in TAC has the general form `x = y op z`, where `op` is an operator, `y` and `z` are operands, and `x` is the result. Operands `y` and `z` can be names, constants, or compiler-generated temporary variables. The "three-address" refers to the addresses (memory locations or registers) of these three components. TAC simplifies expressions by breaking them down into a sequence of elementary operations, each with a clear result. Example: `a = b + c * d` in TAC would be: t1 = c * d t2 = b + t1 a = t2 (where `t1`, `t2` are compiler-generated temporary variables). Representations of Three-Address Code: TAC can be represented in various data structures, with Quadruples and Triples being two common forms. 1. Quadruples: Structure: A quadruple is a record structure with four fields: `operator`, `operand1`, `operand2`, and `result`. Fields: `op`: The operator (e.g., `+`, `*`, `:=`, `if_goto`). `arg1`: The first operand. `arg2`: The second operand. `result`: The variable or temporary that stores the result of the operation. Advantages: Easy to move code for optimization (e.g., by changing the `result` field). Each instruction is self-contained. Disadvantages: Temporary variables can consume a lot of space. Example: For `a = b + c * d` # op arg1 arg2 result (0) * c d t1 (1) + b t1 t2 (2) := t2 a 2. Triples: Structure: A triple is a record structure with three fields: `operator`, `operand1`, and `operand2`. Fields: `op`: The operator. `arg1`: The first operand. `arg2`: The second operand. Key Feature: Unlike quadruples, triples do not explicitly use a `result` field. Instead, the result of an operation is implicitly referred to by the index (position) of the triple itself. This eliminates the need for explicit temporary variables. Advantages: Saves space by not creating temporary variables. Disadvantages: Code movement during optimization is difficult because changing the order of triples changes their implicit results, requiring updates to all references. Requires an extra indirection if arguments are not computed in the immediately preceding triple. Example: For `a = b + c * d` # op arg1 arg2 (0) * c d (1) + b (0) (2) := a (1) Note: `(0)` refers to the result of the operation in triple 0. 3. Define SDD. Explain different forms of SDD. Introduction: Syntax-Directed Definition (SDD) is a high-level specification for translations. It augments context-free grammar with attributes for symbols and semantic rules for productions, providing a formal way to specify the semantics of a programming language. Explanation: Definition of Syntax-Directed Definition (SDD): An SDD is a context-free grammar where: Each grammar symbol $X$ (terminal or non-terminal) has a set of attributes associated with it. These attributes represent properties or information about the language construct represented by $X$ (e.g., type, value, memory location). Each production rule $A \to \alpha$ has a set of semantic rules associated with it. These rules specify how to compute the values of the attributes. Attributes can be of two types: Synthesized Attributes: Computed from the values of attributes of its children nodes in the parse tree. Information flows up the parse tree. Inherited Attributes: Computed from the values of attributes of its parent node, its left siblings, or itself. Information flows down and across the parse tree. Different Forms of SDD: Based on how attributes are computed and their dependencies, SDDs are classified into different forms: 1. S-Attributed Definitions: Characteristic: SDDs that use only synthesized attributes are called S-attributed definitions. Evaluation: All attributes are computed by passing information up the parse tree. This means that attribute values can be computed during a single bottom-up (postorder traversal) pass over the parse tree. Compatibility: S-attributed definitions are well-suited for bottom-up parsing techniques (e.g., LR parsers). Semantic actions can be embedded at the end of production rules, executed as soon as a reduction is performed. Example: Computing the value of an arithmetic expression: Production Semantic Rule E → E1 + T E.val = E1.val + T.val E → T E.val = T.val T → T1 * F T.val = T1.val * F.val T → F T.val = F.val F → ( E ) F.val = E.val F → digit F.val = digit.lexval Here, `val` is a synthesized attribute. Each `val` is computed from the `val` of its children. 2. L-Attributed Definitions: Characteristic: SDDs where an attribute can be either synthesized or inherited, but with a restriction on inherited attributes: an inherited attribute of a symbol on the right-hand side (RHS) of a production $A \to X_1 X_2 \dots X_n$ can only depend on: Inherited attributes of the non-terminal $A$. Synthesized or inherited attributes of symbols $X_j$ that appear to its left (i.e., $j Synthesized or inherited attributes of $X_i$ itself. Essentially, information flows from left-to-right across the RHS of a production. Evaluation: L-attributed definitions can be evaluated by a single depth-first, left-to-right (preorder-like) traversal of the parse tree. This makes them suitable for both top-down (predictive) and bottom-up parsing methods. Compatibility: They are more general than S-attributed definitions and are widely used in practical compilers, as many language constructs require left-to-right context (e.g., type checking, variable declarations). Example: Type checking in declarations: Production Semantic Rule D → T id L L.in = T.type // inherited attribute L.in gets type from T T → int T.type = integer // synthesized attribute T → real T.type = real // synthesized attribute L → id L1 add_type(id.entry, L.in); L1.in = L.in // L.in passed to L1 L → ε // no action Here, `type` is synthesized, and `in` is inherited. 3. Syntax-Directed Translation Scheme (SDTS): Characteristic: An SDTS is an SDD embedded directly into the grammar productions by placing semantic actions (code fragments) within the RHS. Evaluation: The position of the semantic action determines when it is executed. Actions at the end of a production are equivalent to synthesized attributes (S-attributed). Actions in the middle of a production can implement inherited attributes (L-attributed), as long as attribute dependencies are respected. Compatibility: Can be used with both top-down and bottom-up parsers, but careful placement of actions is needed to ensure attributes are available when needed. 4. Explain S and L attributes in detail. Introduction: Attributes are properties associated with grammar symbols in a Syntax-Directed Definition (SDD). They represent information about the language constructs, such as type, value, memory location, or code fragments. Attributes are classified into two main types: synthesized and inherited, based on how their values are computed within the parse tree. Explanation: 1. Synthesized Attributes: Definition: A synthesized attribute for a non-terminal $A$ at a parse tree node $N$ is computed from the values of attributes of its children nodes of $N$ and from the attributes of $N$ itself. The information effectively flows "up" the parse tree. Computation: If we have a production $A \to X_1 X_2 \dots X_n$, and $A.s$ is a synthesized attribute, its semantic rule would typically be of the form $A.s = f(X_1.a_1, X_2.a_2, \dots, X_n.a_n, A.a)$, where $f$ is some function and $a_i$ are attributes of the children or $A$ itself. Characteristics: Dependencies always point from children to parent. Can be computed during a single bottom-up (postorder) traversal of the parse tree. Grammars that use only synthesized attributes are called S-attributed grammars. Well-suited for tasks like calculating the value of an expression, constructing an abstract syntax tree, or type checking simple expressions. Example: Calculating the value of an arithmetic expression: Consider the production $E \to E_1 + T$. The value of $E$ (a synthesized attribute, say `val`) depends on the values of $E_1$ and $T$. So, $E.val = E_1.val + T.val$. Here, $E.val$ is synthesized from its children $E_1$ and $T$. Production Semantic Rule E → E1 + T E.val = E1.val + T.val T → F T.val = F.val F → digit F.val = digit.lexval 2. Inherited Attributes: Definition: An inherited attribute for a non-terminal $B$ at a parse tree node $N$ is computed from the values of attributes of its parent node, its left siblings, or itself. The information effectively flows "down" and "across" the parse tree. Computation: If we have a production $A \to X_1 X_2 \dots X_n$, and $X_i.i$ is an inherited attribute, its semantic rule could be of the form $X_i.i = f(A.a, X_1.a_1, \dots, X_{i-1}.a_{i-1}, X_i.a)$, where $f$ is some function. Characteristics: Dependencies point from parent to child or from left sibling to right sibling. Cannot be computed using a single bottom-up traversal. Requires more complex traversal orders or multiple passes. Used to pass contextual information from parent to child or from left to right. Essential for tasks like type checking in declarations (passing type information down), keeping track of symbol table entries, or determining the scope of declarations. Example: Passing type information in declarations: Consider the production $D \to T id L$. To properly declare `id` and subsequent identifiers in `L` with the type specified by `T`, the type information needs to be passed from `T` to `id` and then to `L`. Production Semantic Rule D → T id L L.in = T.type // L.in is inherited from T.type T → int T.type = integer // synthesized attribute L → id L1 add_type(id.entry, L.in); L1.in = L.in // L.in passed to L1 Here, `L.in` is an inherited attribute that carries the type declared by `T` to the list of identifiers `L`. Relationship to L-attributed Grammars: L-attributed grammars are a powerful subclass of attribute grammars that accommodate both synthesized and inherited attributes. The key restriction is that inherited attributes must be computable in a single left-to-right pass. This means that an inherited attribute of a symbol $X_i$ on the RHS of a production can only depend on: Inherited attributes of the production's left-hand side non-terminal ($A$). Synthesized or inherited attributes of symbols to its left ($X_1, \dots, X_{i-1}$). Synthesized or inherited attributes of $X_i$ itself. This restriction ensures that attributes can be evaluated in a single depth-first traversal of the parse tree. 5. Define DAG with example and write algorithm also. Introduction: A Directed Acyclic Graph (DAG) is a graphical representation of a basic block in intermediate code. It is a powerful data structure used in the code optimization phase of a compiler to identify common subexpressions, eliminate redundant computations, and reorder operations within a basic block. Explanation: Definition of DAG: A DAG for a basic block is a directed graph where: Leaves: Represent identifiers (variables) or constants. Interior Nodes: Represent operators. Each interior node is labeled with an operator, and its children are its operands. Edges: Represent the flow of values from operands to operators. Since it's a DAG, there are no cycles. Labels for Nodes: Each node can have a list of identifiers attached to it, indicating that those identifiers hold the value computed at that node. The last assigned identifier for a node represents the current value of that variable. The construction of a DAG helps visualize the common subexpressions and the dependencies between operations, facilitating various optimizations. Algorithm for Constructing a DAG for a Basic Block: Assume the basic block consists of a sequence of three-address statements. We maintain two tables: `node_value` table: Maps (operator, left_operand_node, right_operand_node) to a node. Used to detect common subexpressions. `var_node` table: Maps variable names to the node that currently computes their value. Used to track current definitions. Algorithm Steps: Initialize `node_value` and `var_node` tables as empty. For each three-address statement `i: result = operand1 op operand2` (or `i: result = op operand1`, or `i: result = operand1`): Determine `node_operand1` and `node_operand2`: If `operand1` is a constant, create a new leaf node for it. If `operand1` is a variable, find `node_operand1 = var_node[operand1]`. If `operand1` is not in `var_node`, create a new leaf node for it and add it to `var_node`. Do similarly for `operand2`. Check for Common Subexpression: Look up `(op, node_operand1, node_operand2)` in `node_value`. If an existing node `N` is found, then `N` computes the same value. Reuse `N`. If no such node exists, create a new interior node `N` for `op` with children `node_operand1` and `node_operand2`. Add `(op, node_operand1, node_operand2)` to `node_value` mapping to `N`. Update Variable Mapping: Remove `result` from the labels of its old node (if `result` was previously defined). Add `result` as a label to the newly found or created node `N`. Update `var_node[result] = N`. For assignment statements `i: result = operand1`: Determine `node_operand1` as above. Remove `result` from labels of its old node. Add `result` as a label to `node_operand1`. Update `var_node[result] = node_operand1`. Example: Consider the basic block: 1. a = b + c 2. b = a - d 3. c = b + c 4. d = a - d Step-by-step DAG construction: Initial: `var_node = {}`, `node_value = {}` 1. `a = b + c` `node_b` = leaf('b'), `node_c` = leaf('c') Create `N1 = + (node_b, node_c)` Labels: `b`->`node_b`, `c`->`node_c`, `a`->`N1` `var_node = {b:node_b, c:node_c, a:N1}` `node_value = {(+,node_b,node_c):N1}` 2. `b = a - d` `node_a` = `var_node['a']` = `N1` `node_d` = leaf('d') Create `N2 = - (N1, node_d)` Remove `b` from `node_b`'s labels. Labels: `a`->`N1`, `c`->`node_c`, `d`->`node_d`, `b`->`N2` `var_node = {b:N2, c:node_c, a:N1, d:node_d}` `node_value = {(+,node_b,node_c):N1, (-,N1,node_d):N2}` 3. `c = b + c` `node_b` = `var_node['b']` = `N2` `node_c` = `var_node['c']` = `node_c` (the leaf) Create `N3 = + (N2, node_c)` Remove `c` from `node_c`'s labels. Labels: `a`->`N1`, `d`->`node_d`, `b`->`N2`, `c`->`N3` `var_node = {b:N2, c:N3, a:N1, d:node_d}` `node_value = {(+,node_b,node_c):N1, (-,N1,node_d):N2, (+,N2,node_c):N3}` 4. `d = a - d` `node_a` = `var_node['a']` = `N1` `node_d_old` = `var_node['d']` = `node_d` (the leaf) Check `node_value` for `(-, N1, node_d_old)`. It's `N2`. This is a common subexpression. Reuse `N2`. Remove `d` from `node_d`'s labels. Labels: `a`->`N1`, `b`->`N2`, `c`->`N3`, `d`->`N2` `var_node = {b:N2, c:N3, a:N1, d:N2}` `node_value` remains the same. Final DAG Diagram: b c d + (a) - (b, d) + (c) Unit 4: Short Questions (2 Marks Each) 1. What is Symbol table? Introduction: A symbol table is a fundamental data structure used by compilers. Explanation: It stores information about all identifiers (variables, functions, classes, etc.) encountered in the source program. For each identifier, it records attributes such as its name, type, scope, storage allocation, and potentially the number of arguments for functions. The symbol table is used throughout the compilation process, from lexical analysis (inserting new identifiers) to semantic analysis (type checking, scope resolution) and code generation (determining memory locations). 2. What is Activation record? Introduction: An activation record (or stack frame) is a data structure used to manage information about a single execution of a procedure (function or method). Explanation: When a procedure is called, an activation record is pushed onto the runtime stack. It typically contains: Return address (where to resume execution after the call). Actual parameters passed to the procedure. Local variables declared within the procedure. Saved machine registers. Control link (pointer to the caller's activation record). Access link (pointer to a non-local scope for accessing non-local variables). When the procedure returns, its activation record is popped off the stack. 3. Define Storage Allocation Strategies? Introduction: Storage allocation strategies determine how memory is managed for program variables during runtime. Explanation: Compilers employ different strategies: Static Allocation: Memory for variables is allocated at compile time and remains fixed throughout program execution. Suitable for global variables and constants. Stack Allocation: Memory is allocated at runtime when a procedure is called (using activation records) and deallocated when it returns. Used for local variables and parameters, supporting recursion. Heap Allocation: Memory is allocated and deallocated dynamically by the program during execution (e.g., using `malloc`/`free` or `new`/`delete`). Used for data structures whose size or lifetime cannot be determined at compile time. 4. Define Boolean expression? Introduction: A Boolean expression is a fundamental concept in computer science and programming languages. Explanation: It is an expression that evaluates to one of two truth values: `true` or `false`. Boolean expressions are typically formed using relational operators (e.g., `==`, `!=`, ` `) to compare values, and logical operators (e.g., `AND`, `OR`, `NOT`) to combine other Boolean expressions. They are primarily used in control flow statements (like `if-else`, `while`, `for`) to make decisions and control the execution path of a program. 5. Define Accessing local and non-local names in a block structure. Introduction: In languages with block-structured scope, accessing variables requires mechanisms to distinguish between local and non-local names. Explanation: Local Names: Variables declared within the current block or procedure. They are typically accessed directly via an offset from the current activation record's frame pointer. Non-local Names: Variables declared in an enclosing (but not current) scope. Their access depends on the scoping rule: Static (Lexical) Scoping: Non-local names are accessed using an `access link` (or `static link`) in the activation record, which points to the activation record of the lexically enclosing scope. This forms a chain of pointers to outer scopes. Dynamic Scoping: Non-local names are accessed by searching down the dynamic chain of activation records (the `control link`) until the variable is found. Unit 4: Long Questions (10 Marks Each) 1. Explain Parameter Passing and its types. Introduction: Parameter passing refers to the mechanism by which arguments (actual parameters) from a calling function are made available to the called function's parameters (formal parameters). This mechanism is fundamental to how functions communicate and share data in programming languages. Explanation: The choice of parameter passing mechanism impacts how values are transferred, whether the original data can be modified, and the efficiency of the function call. Common Types of Parameter Passing Mechanisms: 1. Call-by-Value: Mechanism: The value of the actual parameter is copied into the formal parameter of the called function. Behavior: Any modifications made to the formal parameter within the called function do not affect the original actual parameter in the calling function. The formal parameter is essentially a local variable initialized with the actual parameter's value. Advantages: Simple to understand. Protects the actual parameter from unintended modification by the called function. Disadvantages: Copying large data structures (e.g., arrays, objects) can be inefficient in terms of time and memory. Languages: C, Java (for primitive types and object references), Pascal (default). Example: void func(int x) { x = x + 1; // x is a copy } int main() { int a = 5; func(a); // a is still 5 } 2. Call-by-Reference (Call-by-Address): Mechanism: A reference (or memory address) to the actual parameter is passed to the formal parameter. The formal parameter becomes an alias for the actual parameter. Behavior: Any modifications made to the formal parameter inside the called function directly affect the original actual parameter in the calling function. Advantages: Efficient for large data structures as no copying is involved. Allows functions to return multiple values (by modifying reference parameters). Disadvantages: Can lead to unintended side effects if the called function modifies a parameter when the caller expects it to remain unchanged. Languages: C++ (using `&`), Pascal (using `var`), Python (for mutable objects). Example: void func(int &x) { // x is a reference x = x + 1; } int main() { int a = 5; func(a); // a is now 6 } 3. Call-by-Result (Copy-Out): Mechanism: Similar to call-by-value, the formal parameter is initially a local variable. However, its final value is copied back to the actual parameter when the called function returns. Behavior: Modifications to the formal parameter affect the actual parameter, but only upon function exit. If there are multiple assignments to the same actual parameter from different formal parameters, the order of copying back matters and can lead to non-determinism. Advantages: Provides some protection during execution of the function (actual parameter is untouched until return). Disadvantages: Can be complex to implement efficiently. Order of evaluation for multiple result parameters can be an issue. Languages: Ada (for `out` parameters). Example: void func(int x) { // x is local copy initially x = x + 1; } // On return, x's value is copied back to actual parameter 4. Call-by-Value-Result (Copy-In/Copy-Out): Mechanism: A combination of call-by-value and call-by-result. The actual parameter's value is copied into the formal parameter at the start of the function, and the final value of the formal parameter is copied back to the actual parameter at the end. Behavior: Within the function, it behaves like call-by-value. Externally, it behaves like call-by-reference. Advantages: Mimics call-by-reference without the aliasing problems if the same actual parameter is passed multiple times. Disadvantages: Incurs the overhead of two copies. Still has potential for non-determinism with aliasing if an actual parameter is also a global variable or part of another parameter. Languages: Ada (for `in out` parameters), Fortran (sometimes). 5. Call-by-Name: Mechanism: The actual parameter is substituted textually (like a macro expansion) for every occurrence of the formal parameter in the function body. This is also called "lazy evaluation." Behavior: The actual parameter is re-evaluated each time it is used within the function. If the actual parameter is an expression, it might be evaluated multiple times, and its value could change if it involves global variables. Advantages: Highly flexible, allows for powerful programming constructs. Disadvantages: Very complex to implement. Can be very inefficient due to repeated evaluation of expressions. Can lead to unexpected results due to side effects. Languages: Algol 60 (historically significant, rarely used in modern languages). 2. Explain storage allocation strategies in detail. Introduction: Storage allocation strategies are fundamental mechanisms within a compiler's runtime environment that manage how memory is allocated and deallocated for program variables and data structures during program execution. The choice of strategy impacts efficiency, flexibility, and the ability to support various language features like recursion and nested scopes. Explanation: There are three primary storage allocation strategies: 1. Static Allocation: Mechanism: Memory for variables is allocated at compile time (or link time) and remains fixed throughout the entire execution of the program. Characteristics: Lifetime: The lifetime of statically allocated variables is the entire duration of the program. Scope: Typically used for global variables, static variables (in C/C++), and constants. Allocation: Memory addresses are fixed and known before runtime. Support for Recursion: Cannot support recursive procedures, as each recursive call would require a new, distinct set of local variables, which static allocation cannot provide. Advantages: Simple to implement. Extremely efficient as addresses are fixed and no runtime overhead for allocation/deallocation. Disadvantages: Lack of flexibility: Fixed size, cannot handle variable-sized data structures at runtime. No support for recursion or multiple instances of a procedure's local data. Memory can be wasted if variables are not always in use. Example: Global variables in C, `static` variables inside functions. 2. Stack Allocation: Mechanism: Memory for local variables and parameters of procedures is allocated and deallocated dynamically at runtime using a stack data structure. When a procedure is called, an activation record (stack frame) containing its local data is pushed onto the runtime stack. When the procedure returns, its activation record is popped. Characteristics: Lifetime: The lifetime of stack-allocated variables corresponds to the activation of the procedure in which they are declared. Scope: Primarily used for local variables and formal parameters. Allocation: Managed by a stack pointer. Allocation involves simply adjusting the stack pointer. Support for Recursion: Fully supports recursion, as each recursive call gets its own distinct activation record on the stack, containing its independent set of local variables. Advantages: Efficient allocation and deallocation (just pointer adjustments). Supports recursion and nested procedure calls. Memory is reused effectively (LIFO order). Disadvantages: Cannot allocate memory for data whose size is unknown at compile time (e.g., dynamic arrays). Cannot return pointers to local stack variables, as they become invalid once the function returns. Limited stack size can lead to stack overflow. Example: Local variables in C/C++, Java functions. 3. Heap Allocation: Mechanism: Memory is allocated and deallocated dynamically by the program during execution from a large pool of free memory called the "heap". This is typically done through explicit programmer requests (e.g., `malloc`/`free` in C, `new`/`delete` in C++, `new` in Java) or implicitly by a garbage collector. Characteristics: Lifetime: The lifetime of heap-allocated data is independent of procedure calls. It persists until explicitly deallocated or garbage collected. Scope: Used for dynamic data structures (e.g., linked lists, trees, objects) whose size or lifetime cannot be determined at compile time or whose lifetime extends beyond the scope of a single function call. Allocation: Complex; involves searching for available blocks of memory, managing free lists, and potentially garbage collection. Support for Recursion: Can be used within recursive functions for data that needs to persist across calls or be shared. Advantages: Highly flexible: Supports dynamic data structures and objects with arbitrary lifetimes. Allows data to outlive the function that created it. Disadvantages: Less efficient than stack allocation due to overhead of managing free memory. Prone to memory leaks (if not deallocated) or dangling pointers (if deallocated prematurely). Can suffer from fragmentation. Example: Objects in Java, dynamically allocated arrays in C++ (`new int[10]`). 3. Explain how to store name in a symbol table. Introduction: The symbol table is a crucial data structure in a compiler that stores information about identifiers (names) used in the source program. Efficient storage and retrieval of names are vital for the compiler's performance. Names are typically stored as strings, and various techniques are used to manage these strings and their associated attributes. Explanation: Storing names in a symbol table involves two main aspects: Storing the string representation of the name itself. Associating attributes with that name. 1. Storing the String Representation of Names: The actual lexemes (string form) of identifiers need to be stored. Common methods include: Fixed-Length Arrays: Each identifier string is stored in a fixed-size slot. Disadvantage: Wastes space for short names, truncates long names. Not practical. Variable-Length Arrays (String Table): All identifier strings are stored consecutively in a large character array (often called a "string table" or "lexical buffer"). The symbol table entry for an identifier then stores a pointer (or index) to the beginning of its string in this string table, along with its length. Advantages: Efficient use of space for strings. Disadvantages: Requires careful management of the string table to avoid fragmentation if strings are deleted/modified. Example: For `count`, `total`, `sum`: String Table: `c o u n t t o t a l s u m \0` Symbol Table entry for `count` might store `(index: 0, length: 5)`. Symbol Table entry for `total` might store `(index: 5, length: 5)`. 2. Associating Attributes with Names (Symbol Table Structure): The symbol table itself is typically a collection of entries, where each entry corresponds to an identifier and holds its attributes. Various data structures can be used to implement the symbol table, each with its own trade-offs in terms of search time, insertion time, and space usage. Linear List (Unordered/Ordered Array/Linked List): Mechanism: Entries are stored sequentially. Search: Linear search (O(n)). Slow for large tables. Insertion: O(1) for unordered, O(n) for ordered. Use: Suitable for very small symbol tables or during initial phases where efficiency isn't critical. Binary Search Tree (BST): Mechanism: Entries are stored in a tree structure, ordered by name. Search/Insertion: O(log n) on average. Worst case O(n) for skewed trees. Use: Better performance than linear lists, but can degrade. Hash Table: Mechanism: A hash function maps each name to an index in an array (hash table). Collisions (multiple names mapping to the same index) are handled using techniques like chaining (linked lists at each index) or open addressing. Search/Insertion: Average O(1), worst case O(n) (if many collisions and poor hash function). Use: Most common and efficient method for symbol tables in practical compilers due to its average-case constant time performance. Scoped Symbol Tables (for block-structured languages): In languages with nested scopes, a single flat symbol table is insufficient. A stack of symbol tables (one for each active scope) or a single symbol table with scope information is used. When an identifier is looked up, the current scope's table is checked first, then its enclosing scope, and so on. When entering a new scope, a new table is pushed or a new scope marker is added. When exiting, the table/marker is popped. Symbol Table Entry Structure: Each entry for an identifier typically contains: Name: A pointer/index to the string table or the string itself. Type: Data type (e.g., `int`, `float`, `array`, `function`). Scope: The block or procedure where the name is declared. Storage Information: Memory address, offset from frame pointer, register number. Kind: Variable, constant, function, array, struct, etc. Other Attributes: e.g., number of parameters for a function, dimension for an array, whether it's a pointer. 4. Define Hashing technique in symbol table organization. Introduction: Hashing is the most widely used and efficient technique for organizing and accessing entries in a symbol table. It provides nearly constant-time average performance for insertion, deletion, and lookup operations, making it ideal for compilers that frequently access symbol information. Explanation: Hashing Technique: The core idea of hashing is to map an identifier's name (a string) to an integer index (hash value) within a fixed-size array, known as the hash table. This index points to the location where the identifier's information (symbol table entry) is stored. Components of a Hashing Scheme: 1. Hash Function: Purpose: Takes the identifier's string as input and computes an integer hash value. The goal is to distribute names uniformly across the hash table to minimize collisions. Qualities of a Good Hash Function: Fast to compute. Minimizes collisions (maps different keys to different indices). Distributes keys evenly across the hash table. Common Hash Function Examples: Summation of ASCII values: Sum the ASCII values of characters in the string and take modulo table size. `hash(s) = (sum(s[i])) % table_size` Weighted Summation: Assign weights (e.g., powers of a prime number) to characters. `hash(s) = (sum(s[i] * p^i)) % table_size` Folding: Divide the key into parts, combine them (e.g., sum, XOR), then apply modulo. 2. Hash Table Structure: An array of pointers or records, where each element is a "bucket" or "slot". The size of the hash table is usually a prime number to help with even distribution. 3. Collision Resolution Strategies: A collision occurs when two different identifiers hash to the same index in the hash table. Effective collision resolution is crucial for maintaining performance. Chaining (Separate Chaining): Mechanism: Each slot in the hash table points to the head of a linked list. When a collision occurs, the new identifier's entry is added to the linked list at that hash table index. Advantages: Simple to implement. Table never "fills up" (can accommodate any number of entries). Deletion is relatively easy. Disadvantages: Requires extra space for pointers in linked lists. Search time can degrade to O(n) in worst case (all items in one list). Open Addressing (Probing): Mechanism: When a collision occurs, the parser searches for the next available empty slot in the hash table using a probing sequence. Types of Probing: Linear Probing: Search sequentially for the next empty slot: `(hash_value + i) % table_size`. Quadratic Probing: Use a quadratic function for the step size: `(hash_value + i^2) % table_size`. Double Hashing: Use a second hash function to determine the step size: `(hash_value + i * hash2(key)) % table_size`. Advantages: No extra space for pointers. Better cache performance due to locality. Disadvantages: Can suffer from primary clustering (linear probing) or secondary clustering (quadratic probing). Deletion is complex (requires "tombstone" markers). Table can fill up, requiring rehashing. Example (Chaining): Hash Table Size = 10 Hash Function: `sum_of_ascii_values % 10` Identifiers: `sum`, `count`, `total`, `main` Assume: `hash('sum') = 3`, `hash('count') = 3`, `hash('total') = 7`, `hash('main') = 3` Index | List of Entries ------------------- 0 | 1 | 2 | 3 | -> (sum_entry) -> (count_entry) -> (main_entry) 4 | 5 | 6 | 7 | -> (total_entry) 8 | 9 | 5. Explain Activation Records in detail. Introduction: An activation record, also known as a stack frame, is a data structure that contains essential information about a single activation (execution) of a procedure (function or method). It is pushed onto the runtime stack when a procedure is called and popped when the procedure returns, playing a central role in managing the runtime environment of programs, especially in languages that support recursion and nested scopes. Explanation: The structure of an activation record can vary slightly depending on the language and architecture, but it typically includes the following components: Components of an Activation Record: Return Value: Space for the value returned by the called procedure to the caller. Often, this is a dedicated register or a location known to the caller. Actual Parameters: The values or references of the arguments passed by the calling procedure to the called procedure. These are placed here by the caller. Control Link (Dynamic Link): A pointer to the activation record of the caller. This link forms a chain of activation records on the stack, representing the dynamic (calling) sequence of procedure calls. Used to return control to the caller upon procedure completion. Access Link (Static Link, Environment Pointer): A pointer to the activation record of the lexically (statically) enclosing scope. This link is used in languages with block-structured scope rules (e.g., Pascal, Ada, C-like languages) to access non-local variables. For nested procedures, it forms a chain that allows access to variables in outer scopes that are not the direct caller. Saved Machine Status: Includes the program counter (return address), which points to the instruction in the caller's code where execution should resume after the call. Also includes values of machine registers that must be restored for the caller to continue its execution. Local Data: Space for local variables and temporaries declared within the procedure. These are unique to each activation of the procedure. Temporaries: Storage for intermediate results of complex expressions that cannot be held in registers. Diagram: Structure of an Activation Record Return Value Actual Parameters Control Link Access Link Saved Machine Status (Return Address, Registers) Local Data Temporaries Top of Stack (Stack Pointer) Previous Activation Record Lexically Enclosing Scope's AR Runtime Stack and Activation Records: When a program starts, the `main` function (or equivalent) gets the first activation record on the stack. When `Procedure A` calls `Procedure B`, `B`'s activation record is pushed on top of `A`'s. When `B` returns, its record is popped, and control returns to `A`. This LIFO (Last-In, First-Out) behavior perfectly supports procedure calls and returns, including recursion. Management of Activation Records: Caller's Responsibility: Typically, the caller (calling procedure) is responsible for pushing actual parameters, saving registers it needs, and setting up the control link (dynamic link) and possibly the access link (static link) for the callee. It also updates the stack pointer. Callee's Responsibility: The callee (called procedure) is responsible for saving any additional registers it modifies, allocating space for its local variables, and setting up its own stack frame. Before returning, it restores saved registers, places the return value, and restores the stack pointer to the caller's frame. Unit 5: Short Questions (2 Marks Each) 1. What is Basic Block? Introduction: A basic block is a fundamental concept in code optimization. Explanation: It is a sequence of consecutive three-address intermediate code statements where control enters only at the beginning of the sequence and leaves only at the end without any possibility of branching except at the end. In simpler terms, once the first statement in a basic block is executed, all subsequent statements in that block are executed sequentially without any jumps or calls until the very last statement, which might be a conditional or unconditional jump. 2. Define loop optimization and loop invariant computation? Introduction: Loop optimization is a critical aspect of code optimization, focusing on improving the performance of loops, which are often the most time-consuming parts of a program. Explanation: Loop Optimization: A set of techniques applied to loops to reduce their execution time. This includes transformations like code motion, induction variable elimination, strength reduction, and loop unrolling. Loop Invariant Computation: A specific loop optimization technique where expressions whose values do not change during the execution of a loop (loop-invariant computations) are identified. These computations are then moved outside the loop, typically to a pre-header block, so they are computed only once instead of repeatedly in each iteration. 3. Define issues in design code generator? Introduction: Designing an effective code generator, the final phase of a compiler, involves addressing several complex issues to produce high-quality target code. Explanation: Key issues include: Instruction Selection: Choosing the appropriate machine instructions for each intermediate code operation to achieve efficiency. Register Allocation: Deciding which variables should reside in CPU registers (fast access) and which in memory (slow access), and managing register spills/fills. Instruction Ordering: Sequencing instructions to avoid pipeline stalls and maximize parallelism. Target Machine Characteristics: Accounting for specific architectural features, instruction costs, and addressing modes of the target machine. Runtime Storage Management: Ensuring correct allocation and deallocation of memory for variables. Error Handling: Propagating information about errors to the user. 4. Define sources of optimization? Introduction: Code optimization aims to improve the efficiency (speed, memory, power) of the generated target code without changing its observable behavior. Optimizations can be applied at various levels and target different aspects of the program. Explanation: Key sources of optimization include: Local Optimization: Within a basic block (e.g., common subexpression elimination, dead code elimination). Loop Optimization: Specifically targeting loops (e.g., loop invariant code motion, strength reduction). Global Optimization: Across basic blocks within a single function (e.g., data-flow analysis, constant propagation). Interprocedural Optimization: Across multiple functions or the entire program. Machine-Dependent Optimization: Leveraging specific features of the target CPU (e.g., instruction scheduling, register allocation). Machine-Independent Optimization: Transformations that are not specific to any particular architecture (e.g., algebraic simplifications). 5. Dead code elimination? Introduction: Dead code elimination is a fundamental code optimization technique aimed at improving program efficiency and reducing executable size. Explanation: Dead code refers to any part of a program that is never executed or whose results are never used in any subsequent computation. The elimination technique identifies such unreachable or useless code segments and removes them from the intermediate or target code. This includes statements that assign values to variables that are never read, or entire blocks of code that are never branched to. Identifying dead code often requires data-flow analysis to track variable definitions and uses. Example: int x = 10; int y = 20; // y is dead if never used later x = x + 1; // if (false) { ... dead code ... } Unit 5: Long Questions (10 Marks Each) 1. Explain Peephole optimization in detail. Introduction: Peephole optimization is a simple yet effective technique for improving target code efficiency. It works by examining a small sliding window (a "peephole") of target instructions and replacing suboptimal instruction sequences with shorter or faster equivalents. Explanation: Mechanism: The code generator typically produces a sequence of assembly or machine instructions. Peephole optimization operates on this generated code. It involves a small, localized analysis of a few (typically 2-4) adjacent instructions. If a recognized pattern of inefficient instructions is found, it is replaced by a known sequence of more efficient instructions that achieve the same result. This process is repeated by sliding the peephole window across the entire code, potentially multiple times, as one optimization might open up opportunities for further optimizations. Characteristics: Local Optimization: It is a local optimization technique, operating on a very small segment of code. Machine Dependent: The specific patterns and replacements are highly dependent on the target machine's instruction set, addressing modes, and performance characteristics. Iterative: Often applied repeatedly because one transformation can create new opportunities for others. Common Types of Peephole Optimizations: 1. Redundant Instruction Elimination: Removing instructions that compute a value that is already available or whose result is never used. Example: MOV R0, a ; Load 'a' into R0 MOV R0, a ; Redundant, R0 already holds 'a' Can be optimized to just `MOV R0, a`. 2. Flow-of-Control Optimizations: Simplifying or eliminating redundant jumps. Example: GOTO L1 ... L1: GOTO L2 Can be optimized to `GOTO L2` directly. Example: IF condition GOTO L1 GOTO L2 L1: ... Can be optimized to: IF NOT condition GOTO L2 L1: ... 3. Algebraic Simplification and Strength Reduction: Replacing expensive operations with cheaper ones. Example: ADD R1, R1, #0 ; Adding zero is useless MUL R1, R1, #1 ; Multiplying by one is useless Eliminate these. Example (Strength Reduction): MUL R1, R2, #2 ; Multiply by 2 can be replaced by ADD R1, R2, R2 ; Add R2 to itself (faster on some architectures) or SHL R1, R2, #1 ; Left shift by 1 (even faster) 4. Use of Machine Idioms: Replacing a sequence of instructions with a single, specialized machine instruction that performs the same task more efficiently. Example: Incrementing a register by 1 might be a separate `ADD R1, R1, #1` or a dedicated `INC R1` instruction. The peephole optimizer would prefer `INC R1`. 5. Address Mode Optimization: Replacing complex address computations with simpler, more efficient addressing modes provided by the target architecture. Example: `MOV R1, 0(R0)` (load from address in R0) is better than `ADD R0, #offset; MOV R1, (R0)`. Implementation: Peephole optimizers are often implemented as a set of rules, where each rule specifies a pattern of instructions to match and its replacement. A typical implementation would involve: Loading a window of instructions. Matching the instructions against known patterns. If a match is found, replacing the sequence and re-evaluating the current window (or a slightly adjusted one, as the code might have changed). Sliding the window to the next position. 2. Construct DAG for the following basic block: B10: S1 = 4 x I S2 = addr(A) -- 4 S3 = S2[S1] // A[4*I] S4 = 4 x I S5 = addr(B) -- 4 S6 = S5[S4] // B[4*I] S7 = S3 x S6 S8 = PROD + S7 PROD = S8 S9 = I + 1 I = S9 If I Introduction: A Directed Acyclic Graph (DAG) is a powerful intermediate representation for basic blocks, used to perform local optimizations such as common subexpression elimination, dead code elimination, and code reordering. It visualizes the flow of values and dependencies within a sequence of straight-line code. Explanation: We will construct the DAG by processing each statement sequentially, keeping track of current variable definitions and identifying common subexpressions. We'll use the algorithm described in Unit 3, Q5. Initial state: `var_node = {}`, `node_value = {}`. Assume `I`, `A`, `B`, `PROD` are initially defined (leaf nodes). 1. `S1 = 4 x I` `node_4` = leaf('4'), `node_I` = leaf('I') Create `N1 = x(node_4, node_I)` Labels: `I`->`node_I`, `S1`->`N1` `var_node = {I:node_I, S1:N1}` `node_value = {(x,node_4,node_I):N1}` 2. `S2 = addr(A) -- 4` `node_addrA` = leaf('addr(A)') `node_4` = leaf('4') Create `N2 = -(node_addrA, node_4)` Labels: `I`->`node_I`, `S1`->`N1`, `S2`->`N2` `var_node = {I:node_I, S1:N1, S2:N2}` `node_value = {... , (-,node_addrA,node_4):N2}` 3. `S3 = S2[S1]` (Array access, base + index) `node_S2` = `var_node['S2']` = `N2` `node_S1` = `var_node['S1']` = `N1` Create `N3 = [] (N2, N1)` (array access operator) Labels: `...`, `S3`->`N3` `var_node = {..., S3:N3}` `node_value = {... , ([],N2,N1):N3}` 4. `S4 = 4 x I` `node_4` = leaf('4'), `node_I` = `var_node['I']` = `node_I` (the leaf) Check `node_value` for `(x, node_4, node_I)`. It's `N1`. This is a common subexpression. Reuse `N1`. Labels: `...`, `S4`->`N1` (now `N1` is labeled `S1, S4`) `var_node = {..., S4:N1}` 5. `S5 = addr(B) -- 4` `node_addrB` = leaf('addr(B)') `node_4` = leaf('4') Create `N4 = -(node_addrB, node_4)` Labels: `...`, `S5`->`N4` `var_node = {..., S5:N4}` `node_value = {... , (-,node_addrB,node_4):N4}` 6. `S6 = S5[S4]` `node_S5` = `var_node['S5']` = `N4` `node_S4` = `var_node['S4']` = `N1` Create `N5 = [] (N4, N1)` Labels: `...`, `S6`->`N5` `var_node = {..., S6:N5}` `node_value = {... , ([],N4,N1):N5}` 7. `S7 = S3 x S6` `node_S3` = `var_node['S3']` = `N3` `node_S6` = `var_node['S6']` = `N5` Create `N6 = x(N3, N5)` Labels: `...`, `S7`->`N6` `var_node = {..., S7:N6}` `node_value = {... , (x,N3,N5):N6}` 8. `S8 = PROD + S7` `node_PROD` = leaf('PROD') `node_S7` = `var_node['S7']` = `N6` Create `N7 = +(node_PROD, N6)` Labels: `...`, `S8`->`N7` `var_node = {..., S8:N7}` `node_value = {... , (+,node_PROD,N6):N7}` 9. `PROD = S8` Remove `PROD` from `node_PROD`'s labels. `node_S8` = `var_node['S8']` = `N7` Labels: `...`, `PROD`->`N7` (now `N7` is labeled `S8, PROD`) `var_node = {..., PROD:N7}` 10. `S9 = I + 1` `node_I` = `var_node['I']` = `node_I` (the leaf) `node_1` = leaf('1') Create `N8 = +(node_I, node_1)` Labels: `...`, `S9`->`N8` `var_node = {..., S9:N8}` `node_value = {... , (+,node_I,node_1):N8}` 11. `I = S9` Remove `I` from `node_I`'s labels. `node_S9` = `var_node['S9']` = `N8` Labels: `...`, `I`->`N8` (now `N8` is labeled `S9, I`) `var_node = {..., I:N8}` 12. `If I This is a control flow statement; it does not assign a value that needs a node in the main DAG. Its operands (`I`, `20`) refer to existing nodes. Final DAG Diagram: 4 I addr(A) addr(B) PROD 1 x (S1, S4) - (S2) - (S5) + (S9, I) [] (S3) [] (S6) x (S7) + (S8, PROD) Analysis and Optimizations from DAG: Common Subexpression Elimination: `4 x I` (computed for `S1`) is reused for `S4`. The DAG reflects this by `S1` and `S4` pointing to the same `x` node. Dead Code Elimination: Any temporary variable (`S1`, `S2`, `S3`, `S4`, `S5`, `S6`, `S7`, `S8`, `S9`) whose label is not associated with a final variable (like `I` or `PROD`) and is not an operand for a subsequent live computation could be eliminated. For example, if `S3` was not used in `S7`, its computation could be removed if `S3` itself was not a live-out variable. Variable Renaming/Copy Propagation: `S8 = PROD + S7` followed by `PROD = S8` can be simplified to `PROD = PROD + S7`. The DAG shows `S8` and `PROD` pointing to the same `+` node. `S9 = I + 1` followed by `I = S9` can be simplified to `I = I + 1`. The DAG shows `S9` and `I` pointing to the same `+` node. Optimized Basic Block (derived from DAG): B10: t1 = 4 x I // N1, for S1 and S4 S2 = addr(A) - 4 // N2 S3 = S2[t1] // N3 S5 = addr(B) - 4 // N4 S6 = S5[t1] // N5 S7 = S3 x S6 // N6 PROD = PROD + S7 // N7 (S8, PROD) I = I + 1 // N8 (S9, I) If I 3. How to find leader, nodes, and edges in a basic block? Introduction: Identifying leaders, nodes, and edges is a crucial step in dividing intermediate code into basic blocks and then constructing a control-flow graph (CFG). This partitioning is fundamental for many compiler optimizations, as optimizations are often applied to basic blocks or across basic blocks using the CFG. Explanation: 1. Finding Leaders: A "leader" is the first statement of a basic block. The following rules are used to identify leaders: The first statement of the program (or function) is a leader. Any statement that is the target of a conditional or unconditional jump (GOTO) is a leader. Any statement that immediately follows a conditional or unconditional jump is a leader. Once all leaders are identified, each basic block consists of a leader and all subsequent statements up to (but not including) the next leader, or the end of the program/function. Example: Consider the following three-address code: 1. t1 = a + b 2. t2 = c * d 3. if t1 > t2 goto 6 4. t3 = x - y 5. goto 8 6. t4 = u / v 7. t5 = t1 + t4 8. print t5 Leaders identified: Statement 1 (first statement). Statement 6 (target of `goto 6` from statement 3). Statement 4 (immediately follows `if t1 > t2 goto 6`). Statement 8 (target of `goto 8` from statement 5). Statement 6 (follows `goto 8` from statement 5) is already a leader. Corrected leaders: Statement 1 is a leader. Statement 6 is a leader (target of `goto 6`). Statement 4 is a leader (follows `if t1 > t2 goto 6`). Statement 8 is a leader (target of `goto 8`). Statement 6 (follows `goto 8` from statement 5) is already a leader. So, leaders are statements 1, 4, 6, 8. Basic Blocks: B1: Statements 1, 2, 3 (from leader 1 up to, but not including, leader 4) B2: Statements 4, 5 (from leader 4 up to, but not including, leader 6) B3: Statements 6, 7 (from leader 6 up to, but not including, leader 8) B4: Statement 8 (from leader 8 to end) 2. Finding Nodes (Basic Blocks) for Control-Flow Graph (CFG): Once leaders are identified, each basic block formed by a leader and its subsequent statements becomes a node in the Control-Flow Graph (CFG). Each basic block identified above (B1, B2, B3, B4) is a node in the CFG. 3. Finding Edges for Control-Flow Graph (CFG): Edges in the CFG represent the possible flow of control between basic blocks. An edge exists from block $B_i$ to block $B_j$ if control can pass from the last statement of $B_i$ to the first statement of $B_j$. If the last statement of $B_i$ is an unconditional jump `goto L`, and `L` is the first statement of $B_j$, then there is an edge from $B_i$ to $B_j$. If the last statement of $B_i$ is a conditional jump `if condition goto L`, and `L` is the first statement of $B_j$, then there are two edges: One edge from $B_i$ to $B_j$ (for when the condition is true). Another edge from $B_i$ to $B_k$, where $B_k$ is the basic block immediately following $B_i$ in the program text (for when the condition is false). If the last statement of $B_i$ is a non-jump statement (e.g., assignment, function call), then there is an edge from $B_i$ to $B_j$, where $B_j$ is the basic block immediately following $B_i$ in the program text. Example (continued): From B1 (ends with `if t1 > t2 goto 6`): Edge to B3 (because `6` is leader of B3). Edge to B2 (because B2 immediately follows B1). From B2 (ends with `goto 8`): Edge to B4 (because `8` is leader of B4). From B3 (ends with `t5 = t1 + t4` - a non-jump statement): Edge to B4 (because B4 immediately follows B3). From B4 (ends with `print t5` - last statement, implicitly exits): No outgoing edges within this function's CFG (it might return). Control Flow Graph (Conceptual): +---+ | B1| +---+ / \ / \ v v +---+ +---+ | B2| | B3| +---+ +---+ \ / \ / v V +---+ | B4| +---+ 4. Explain Transformation in optimization? Introduction: Code optimization involves applying various transformations to the intermediate code (or sometimes the target code) to improve its performance (speed, space, power consumption) without altering the observable behavior of the program. These transformations are the core mechanisms through which compilers make programs more efficient. Explanation: A transformation is a rule or algorithm that systematically modifies a program segment. For an optimization transformation to be valid, it must satisfy two key properties: Preservation of Meaning (Correctness): The transformation must not change the output or the desired side effects of the program for any valid input. This is paramount; an optimization that breaks the program is useless. Improvement in Performance: The transformation should, on average, lead to better performance (e.g., faster execution, smaller code size, less memory usage). Sometimes, a transformation might make a particular sequence of code slightly worse but enable larger, more beneficial transformations later. Transformations can be categorized based on their scope and the type of analysis they require: I. Local Transformations (within a basic block): These transformations operate on a single basic block, where control flow is strictly sequential. Common Subexpression Elimination (CSE): Description: If an expression `x + y` is computed, and later `x + y` is computed again, the second computation can be replaced by reusing the result of the first, provided no operand (`x` or `y`) has changed value in between. Example: a = b + c d = b + c can be optimized to a = b + c d = a Dead Code Elimination: Description: Remove statements whose computed values are never used subsequently. Example: If `x = y + z` is followed by `x = a + b` and `y` and `z` are not used between these two statements, the first statement is dead. Copy Propagation: Description: If `x = y`, subsequent uses of `x` can be replaced by `y`, provided `y` is not redefined. Example: x = y z = x + 1 can be optimized to x = y z = y + 1 (If `x` is dead, it can then be eliminated.) Constant Folding: Description: Replace expressions with constant operands with their computed constant value at compile time. Example: `a = 5 + 3` becomes `a = 8`. Algebraic Simplification: Description: Apply algebraic identities to simplify expressions. Example: `x = y * 1` becomes `x = y`; `x = y + 0` becomes `x = y`. II. Global Transformations (across basic blocks within a procedure): These require data-flow analysis to track information across basic block boundaries. Loop-Invariant Code Motion: Description: Move computations whose operands do not change within a loop body outside the loop, computing them only once. Example: for (i = 0; i optimized to x = y + z; for (i = 0; i Strength Reduction: Description: Replace expensive operations (like multiplication) with cheaper ones (like addition or shifts) within loops, especially for induction variables. Example: In a loop, `i * 4` might be replaced by `temp = temp + 4` where `temp` is initialized to 0 outside the loop. Induction Variable Elimination: Description: Replace or eliminate induction variables (variables whose values form an arithmetic progression within a loop) by using a single variable or by replacing loop control variables with other variables. III. Interprocedural Transformations (across procedures): These require analyzing the entire program. Inlining: Description: Replace a function call with the body of the called function. Advantages: Eliminates call overhead, enables further local optimizations within the inlined code. Disadvantages: Can increase code size. IV. Machine-Dependent Transformations: These optimizations leverage specific features of the target architecture. Register Allocation: Assigning frequently used variables to CPU registers. Instruction Scheduling: Reordering instructions to minimize pipeline stalls. Peephole Optimization: Local, pattern-based optimization of a few adjacent instructions (as discussed in Q1).