### UNIT 1 – INTRODUCTION TO COMPILERS This unit covers the fundamental concepts of compilers, including their structure, phases, and related language processing tools. It also delves into lexical analysis, regular expressions, finite automata, and methods for converting regular expressions to automata, along with DFA minimization. ### 1. Translators * **Definition:** A translator is a computer program that converts a program written in one programming language (source language) into a functionally equivalent program in another computer language (target language). The key is to preserve the functional or logical structure ("essence") of the original code. * **Purpose:** To make programs written in high-level languages executable by a machine, which understands only machine code. * **Analogy:** Similar to a human language translator converting a document from one language to another. ### 2. Types of Language Translators There are several types of widely used translators that convert computer program code into machine code. #### Assembler * **Definition:** An assembler translates assembly language programs into machine code. * **Input:** Assembly language (low-level, symbolic representation of machine code). * **Output:** Machine code (object file). * **Process:** 1. **Source Code:** Program written in assembly language. 2. **Assembly Process:** The assembler reads the assembly code. 3. **Executable Code:** Generates machine code that can be directly executed by the CPU. * **Diagram:** ``` Source Code --> Assembly Process --> Executable Code ``` * **Key Point:** Assembly language is a human-readable form of machine instructions, making assemblers simpler than compilers. #### Compiler * **Definition:** A compiler is a program that reads a program written in one language (source language) and translates it into an equivalent program in another language (target language). * **Input:** High-level programming language (e.g., C, C++, Java source code). * **Output:** Target language, often machine code or intermediate code. * **Process:** 1. **Source Program:** The entire program written in source language. 2. **Compilation:** The compiler analyzes and translates the entire program. 3. **Target Program:** An equivalent program in the target language (e.g., executable machine code). 4. **Error Messages:** Reports errors found during the translation process. * **Diagram:** ```mermaid graph TD A[source Program] --> C(Compiler) C --> B[target program] C --> D[error messages] ``` * **Key Point:** Translates the *entire* program before execution. #### Interpreter * **Definition:** An interpreter translates high-level source code into executable code, but unlike a compiler, it translates and executes one line at a time. * **Input:** High-level programming language. * **Output:** Immediate execution of instructions; generally, no separate object code is produced. * **Process:** 1. **Input:** Source code statement. 2. **Translation & Execution:** The interpreter translates the statement and immediately executes it. 3. **Next Instruction:** Moves to the next statement. * **Diagram:** ```mermaid graph TD A[Source Code] --> B(Interpreter) B --> C[Get next instruction] C --> D[Executable Code] ``` * **Key Point:** Translates and executes *line by line*. #### Hybrid Compiler / JIT Compiler * **Definition:** A hybrid compiler combines features of both compilers and interpreters. It translates human-readable source code into an intermediate bytecode, which is then interpreted or compiled just-in-time (JIT) for execution. * **Example:** Java is a prime example. 1. **Java Source Program:** Written in Java. 2. **Compilation to Bytecode:** A Java compiler translates the source code into an intermediate form called **byte codes**. 3. **Interpretation/JIT Compilation:** The byte codes are then interpreted by a **Virtual Machine (VM)**. Just-in-Time (JIT) compilers within the VM can translate frequently used bytecode sections into machine language immediately before execution for faster processing. * **Benefit:** Byte codes compiled on one machine can be interpreted on another, enabling platform independence ("write once, run anywhere"). * **Diagram:** ```mermaid graph TD A[Source program] --> B(Translator) B --> C[Intermediate Program] C --> D(Virtual Machine) D --> E[Output] D --> F[Input] ``` * **Key Point:** Achieves portability (intermediate code) and performance (JIT compilation). ### 3. Compilation and Interpretation * **Compilation:** The conceptual process of translating source code into CPU-executable binary target code *before* execution. The entire program is translated first. * **Interpretation:** The conceptual process of translating high-level source code into executable code *during* execution, one statement at a time. ### 4. Difference Between Compiler and Interpreter | Feature | Compiler | Interpreter | | :------------------ | :------------------------------------------------------ | :------------------------------------------------------ | | **Input** | Takes the entire program as input. | Takes one statement at a time as input. | | **Process** | Works on the complete program at once. | Works line by line. | | **Output** | Generates intermediate code (object code/machine code). | Does not generate intermediate object code. | | **Execution** | Executes compiled code directly. | Executes statements directly after translation. | | **Speed** | Faster in execution (once compiled). | Slower in execution. | | **Error Reporting** | Reports all errors after checking the entire program. | Reports errors line by line, stopping upon first error. | | **Debugging** | Debugging is harder. | Debugging is easier. | | **Presence** | Compiler not needed during execution. | Interpreter must be present during execution. | | **Memory** | Compiled programs take more memory (entire object code). | Interpreted programs are more memory efficient. | | **Recompilation** | Compile once, run any time; no recompilation needed. | Re-interpreted every time; recompilation on error. | | **Examples** | C, C++, COBOL | BASIC, Python, Ruby, Perl, MATLAB, Lisp | ### 5. History of Compiler * **Before 1952:** Most programs were written in assembly language. * **1952:** Grace Hopper wrote the first compiler for the A-0 programming language. * **1957-1958:** John Backus wrote the first Fortran compiler. Optimization of code was an integral component of this compiler. * **Evolution:** Compilers enabled the development and widespread use of high-level programming languages, abstracting away machine-specific details. ### 6. Applications of Compiler Technology Compiler technology is vital beyond just translating programming languages. * **Implementation of High-Level Programming Languages:** The primary application, enabling programmers to write code in more abstract and human-friendly languages. * **Optimizations for Computer Architectures:** Compilers harness the potential performance of modern architectures (e.g., parallelism, memory hierarchies) by optimizing code for these features. * **Design of a New Computer Architecture:** Compiler design influences hardware design, as new architectures often require new compiler techniques to fully utilize their capabilities. * **Program Translations (Binary Translation, Hardware Synthesis, Database Query Interpreters, Compiled Simulation):** Compilers are used for various forms of program translation, such as converting code between different machine instruction sets, generating hardware descriptions from high-level specifications, interpreting database queries, and simulating complex systems. * **Software Productivity Tools (Structure editors, type checking, bound checking, memory management tools):** Many development tools incorporate compiler-like analysis to enhance programmer productivity and code quality. ### 7. Language Processors A language processor is a program that handles programs written in a programming language (source language). A key component is a language translator, which converts the program into machine code, assembly language, or another target language. An integrated software development environment (IDE) typically includes various language processors: #### Preprocessor * **Definition:** System software that processes the source program *before* it is fed into the compiler. * **Functions:** 1. **Macro Processing:** Allows users to define macros (shorthand for longer constructs), which are expanded by the preprocessor. 2. **File Inclusion:** Inserts content of header files into the program text (e.g., `#include ` in C). 3. **Rational Preprocessors:** Provide built-in macros for common constructs like `while` or `if` statements. 4. **Language Extensions:** Add features similar to built-in macros (e.g., the Equel database query language embedded in C). #### Compiler * **Definition:** (As described above) Translates an entire high-level language program into a target language (e.g., machine code). * **Key Role:** The core translation engine. #### Assembler * **Definition:** (As described above) Translates assembly language into machine code. * **Output:** Object file (contains machine instructions and data). #### Linker * **Definition:** A computer program that combines various object files (generated by compilers/assemblers) into a single executable file. * **Tasks:** * Searches and locates referenced modules/routines (e.g., library functions) in a program. * Determines memory locations where these codes will be loaded. * Resolves absolute references for program instructions. #### Loader * **Definition:** A part of the operating system responsible for loading executable files into memory and initiating their execution. * **Tasks:** * Calculates the size of program instructions and data. * Creates memory space for the program. * Initializes various registers to begin execution. ### 8. Phases of Compiler A compiler operates in several phases, each transforming the source program from one representation to another, progressively moving closer to machine code. * **Diagram:** ```mermaid graph TD A[High Level Language] --> B(Lexical Analyzer) B --> C(Syntax Analyzer) C --> D(Semantic Analyzer) D --> E(Intermediate Code Generator) E --> F(Code Optimiser) F --> G(Target Code Generation) G --> H[Assembly Code] B --> I[Symbol Table] C --> I D --> I E --> I F --> I G --> I B --> J[Error Handling] C --> J D --> J E --> J F --> J G --> J ``` * **Key Concept:** The compiler is often described by the **Analysis-Synthesis Model**. ### 9. Analysis and Synthesis Model Compilation is divided into two main parts: #### Analysis Phase (Front End) * **Goal:** Breaks up the source program into constituent pieces and creates an intermediate representation. It determines the operations implied by the source program and records them in a hierarchical structure (e.g., a syntax tree). * **Phases Included:** * **Lexical Analysis:** Scans the source code to identify tokens. * **Syntax Analysis:** Groups tokens into grammatical phrases, creating a parse tree. * **Semantic Analysis:** Checks for meaning-related errors and gathers type information. * **Dependencies:** Primarily depends on the source language, largely independent of the target machine. * **Other Components:** Creation of the symbol table, part of code optimization, and error handling are associated with the front end. #### Synthesis Phase (Back End) * **Goal:** Constructs the desired target program from the intermediate representation. * **Phases Included:** * **Intermediate Code Generation:** Creates a machine-independent intermediate representation. * **Code Optimization:** Improves the intermediate code for better performance. * **Code Generation:** Produces the final target machine code. * **Dependencies:** Depends on the target machine, but is independent of the source language (it works on the intermediate language). * **Other Components:** Necessary symbol table and error handling operations. #### Categories of Compiler Design (Based on Grouping of Phases) 1. **Single Compiler for Different Machines:** By keeping the front end common and redoing only the back end, a single compiler can target multiple machines for the same source language. 2. **Several Compilers for One Machine:** By using a common back end for different front ends, multiple source languages can target the same machine. ### 10. Lexical Analysis * **Phase Name:** Lexical Analysis (or Scanning) * **Role:** The first phase of a compiler. It reads the raw source code characters and groups them into a stream of **tokens**. * **Input:** Stream of characters from the source program. * **Output:** Stream of tokens. #### Key Concepts * **Token:** A logically cohesive sequence of characters representing an atomic unit in the language. Examples: identifier, keyword, operator, constant, punctuation symbol. * **Example:** In `position := initial + rate * 60`, `position`, `:=`, `initial`, `+`, `rate`, `*`, `60` are tokens. * **Lexeme:** The actual sequence of characters in the source program that matches a pattern for a token. * **Example:** For the token `identifier`, `position`, `initial`, and `rate` are lexemes. For the token `number`, `60` is a lexeme. * **Pattern:** A rule that describes the set of lexemes that can represent a specific token. * **Example:** The pattern for an `identifier` might be "letter followed by letters and digits". The pattern for `number` might be "any numeric constant". #### Example: `position := initial + rate * 60` | Token | Sample Lexemes | Informal Description of Pattern | | :--------- | :-------------------------- | :------------------------------------------ | | `Const` | `Const` | `const` | | `If` | `If` | `if` | | `Relation` | ` `, `>`, `>=` | ` ` or `>` or `>=` | | `Id` | `pi`, `count`, `A2` | Letter followed by letters and digits | | `Num` | `3.1416`, `0`, `6.02E23` | Any numeric constant | | `Literal` | `"garbage collection"` | Any characters between `"` and `"` | #### Attributes for Tokens * When multiple patterns match a lexeme (e.g., ` ` * ` ` * ` ` * ` ` * ` ` * ` ` * ` ` * For some tokens (like `assign_op`), no attribute value might be needed. For others, the character string forms the value stored in the symbol table. #### Secondary Tasks of Lexical Analyzer 1. **Stripping Comments and Whitespace:** Removes non-essential characters (blanks, tabs, newlines, comments) from the source program. 2. **Error Reporting Correlation:** Keeps track of line numbers to associate error messages from the compiler with the correct location in the source program. #### Issues in Lexical Analysis (Reasons for Separation from Parsing) 1. **Simpler Design:** Separating lexical analysis simplifies both the lexical analyzer and the parser. Parsing with comments and whitespace would be more complex. 2. **Improved Efficiency:** A dedicated lexical analyzer can be highly optimized for reading characters and partitioning them into tokens, often using specialized buffering techniques. This improves overall compiler speed. 3. **Enhanced Portability:** Input alphabets and device-specific anomalies can be isolated to the lexical analyzer, making the rest of the compiler more portable. ### 11. Syntax Analysis * **Phase Name:** Syntax Analysis (or Parsing) * **Role:** The second phase of a compiler. It takes the stream of tokens produced by lexical analysis and groups them into grammatical phrases, verifying the syntactic structure of the program. * **Input:** Stream of tokens. * **Output:** A parse tree or syntax tree (a hierarchical representation of the program's structure). * **Process:** Checks if the sequence of tokens conforms to the grammar rules (syntax) of the programming language. * **Example: `position := initial + rate * 60`** * Lexical analyzer output: `Id1 := Id2 + Id3 * 60` * Syntax analyzer constructs a parse tree (or syntax tree) based on grammar rules. * **Syntax Tree Representation:** ``` := / \ position + / \ initial * / \ rate 60 ``` * **Key Point:** Focuses on the *grammatical structure* of the program. ### 12. Semantic Analysis * **Phase Name:** Semantic Analysis * **Role:** The third phase of a compiler. It checks the parse tree for semantic consistency with the language definition. It ensures that the program's components logically make sense and adhere to type rules. * **Input:** Parse tree (or syntax tree) from the syntax analyzer. * **Output:** An annotated syntax tree (with type information and other semantic checks). * **Checks Performed:** * **Type Checking:** Ensures operations are applied to compatible data types (e.g., cannot add a string to an integer). * **Declaration Checking:** Verifies that identifiers are declared before use. * **Scope Checking:** Ensures identifiers are used within their correct scope. * **Argument Matching:** Checks if function calls have the correct number and types of arguments. * **Example: `position := initial + rate * 60`** * If `rate` is an integer and `60` is an integer, the multiplication `rate * 60` results in an integer. * If `initial` is a real number, then `initial + (integer result)` would require a type conversion. The semantic analyzer might insert an `inttoreal` conversion. * **Annotated Syntax Tree:** ``` := / \ position + / \ initial * / \ rate inttoreal(60) ``` * **Key Point:** Focuses on the *meaning and logical consistency* of the program. ### 13. Intermediate Code Generation * **Phase Name:** Intermediate Code Generation * **Role:** The fourth phase of a compiler. It generates an intermediate representation of the source code for an abstract machine, simplifying the translation to target machine code. * **Input:** Annotated syntax tree from the semantic analyzer. * **Output:** Intermediate code (e.g., three-address code). * **Properties of Intermediate Code:** * **Easy to Produce:** Relatively straightforward to generate from the syntax tree. * **Easy to Translate:** Simple to convert into target machine code. * **Form:** Often uses **three-address code**, where each instruction has at most three operands (e.g., `result = operand1 op operand2`). Each memory location can act like a register. * **Example: `position := initial + rate * 60`** (after semantic analysis inserting `inttoreal`) ``` temp1 := inttoreal(60) temp2 := id3 * temp1 (where id3 refers to 'rate') temp3 := id2 + temp2 (where id2 refers to 'initial') id1 := temp3 (where id1 refers to 'position') ``` * **Key Point:** Bridges the gap between high-level and machine-level languages, making optimization and code generation easier. ### 14. Code Optimization * **Phase Name:** Code Optimization * **Role:** The fifth phase of a compiler. It transforms the intermediate code to improve its efficiency (e.g., faster execution, less memory usage) without changing its meaning. * **Input:** Intermediate code. * **Output:** Optimized intermediate code. * **Goal:** Remove unnecessary code, rearrange statements, and make better use of CPU and memory resources. * **Example: `position := initial + rate * 60`** (optimized from the intermediate code) * Original intermediate code: ``` temp1 := inttoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 ``` * Optimized intermediate code (e.g., by constant folding and strength reduction): ``` temp1 := id3 * 60.0 (if 60 can be treated as float directly) id1 := id2 + temp1 ``` * **"Optimizing Compilers":** Compilers that perform extensive code optimization. * **Key Point:** Improves the performance and resource usage of the generated code. ### 15. Code Generation * **Phase Name:** Code Generation * **Role:** The final phase of a compiler. It translates the optimized intermediate code into the target machine code (e.g., assembly code or relocatable machine code). * **Input:** Optimized intermediate code. * **Output:** Target machine code. * **Tasks:** * **Register Allocation:** Variables are assigned to CPU registers. * **Instruction Selection:** Intermediate operations are mapped to specific machine instructions. * **Instruction Ordering:** Instructions are arranged for optimal performance. * **Example: `position := initial + rate * 60`** (after optimization, generating assembly code) * Target code for `temp1 := id3 * 60.0`: ```assembly MOVF id3, R2 ; Move value of id3 (rate) to Register R2 MULF #60.0, R2 ; Multiply R2 by constant 60.0, store in R2 ``` * Target code for `id1 := id2 + temp1`: ```assembly MOVF id2, R1 ; Move value of id2 (initial) to Register R1 ADDF R2, R1 ; Add R2 (result of multiplication) to R1, store in R1 MOVF R1, id1 ; Move final result from R1 to memory location id1 (position) ``` * **Notation:** * `MOVF`: Move Floating-point value. * `MULF`: Multiply Floating-point value. * `ADDF`: Add Floating-point value. * `R1`, `R2`: Registers. * `#`: Indicates a constant. * **Key Point:** Produces the final executable form of the program specific to the target hardware. ### 16. Symbol Table Manager * **Role:** An auxiliary component that interacts with all phases of the compiler. It is a data structure used to store information about identifiers encountered in the source program. * **Data Structure:** Contains a record for each identifier, with fields for its attributes. * **Attributes Stored:** * **Storage Allocation:** Memory location assigned to the identifier. * **Type:** Data type (e.g., integer, float, string). * **Scope:** The region of the program where the identifier is valid. * **Procedure Information (for function names):** Number and types of arguments, method of passing arguments (e.g., by reference, by value), and return type. * **Interaction with Phases:** * **Lexical Analysis:** Enters new identifiers into the table. * **Syntax Analysis:** Uses type and scope information for grammar checks. * **Semantic Analysis:** Crucial for type checking and ensuring correct usage of identifiers. * **Intermediate Code Generation & Code Generation:** Retrieves information about identifiers for generating appropriate code. * **Key Point:** Essential for managing and retrieving information about program entities throughout the compilation process. ### 17. Error Handler * **Role:** An auxiliary component that interacts with all phases of the compiler to detect and report errors. It also attempts to recover from errors to allow compilation to proceed and find further errors. * **Goal:** Detect as many errors as possible, provide helpful error messages, and enable recovery to continue scanning the program. #### Errors Encountered in Different Phases 1. **Errors during Lexical Analysis:** * Occur if characters in the input do not form any valid token of the language. * **Examples:** * **Strange characters:** Characters not allowed by the language's alphabet (unless in a quoted string). * **Long quoted strings:** Missing closing quote, especially if strings cannot span multiple lines. * **Invalid numbers:** Malformed numeric literals (e.g., `123.45.67`). * **Recovery Actions:** Deleting extraneous characters, inserting missing characters, replacing incorrect characters, transposing adjacent characters. 2. **Errors during Syntax Analysis:** * Occur if the token stream violates the structural rules (grammar) of the language. * **Examples:** * Misplaced semicolons. * Unbalanced parentheses or braces (e.g., `(` without a matching `)`). * `case` statement without an enclosing `switch`. * **Recovery Strategy:** Often, the parser can suggest acceptable tokens at the point of error. * **Example Error:** `Source: A + * B` -> `Error: Found '*', expect one of: Identifier, Constant, '('` 3. **Errors during Semantic Analysis:** * Occur if the program has correct syntactic structure but lacks meaning or violates type rules. * **Examples:** * **Type errors:** Applying an operator to incompatible types (e.g., adding an integer to a procedure name). `Ex: Adding two identifiers, one of which is the name of the array, and the other the name of a procedure.` * Undeclared variables. * Assignment of incompatible value types. * **Key Point:** Semantic errors are harder to detect and recover from than lexical or syntax errors. #### Error Recovery Strategies in Parser 1. **Panic Mode:** The parser discards input symbols one by one until a synchronizing token (e.g., semicolon, brace) is found. This is simple but may skip many errors. 2. **Statement Mode:** The parser attempts to make local corrections (e.g., inserting a missing semicolon, replacing a comma with a semicolon) to allow parsing to continue. Care must be taken to avoid infinite loops. 3. **Error Productions:** The grammar is augmented with productions that generate known erroneous constructs. When such a production is used, an error is reported. 4. **Global Correction:** The parser attempts to find a closest error-free program to the erroneous input. This is complex and rarely implemented in practice due to high time and space costs. ### 18. Structure of Compiler The structure of a compiler follows the analysis-synthesis model, divided into a front end and a back end. * **Front End:** * **Phases:** Lexical Analysis, Syntactic Analysis, Semantic Analysis, Intermediate Code Generation (partially), Symbol Table Creation, and associated Error Handling. * **Dependency:** Primarily source-language dependent, largely target-machine independent. * **Output:** Intermediate Representation. * **Diagram:** ```mermaid graph LR A[character stream] --> B(Lexical Analyzer) B --> C[token stream] C --> D(Syntax Directed Translator) D --> E[intermediate representation] ``` * **Back End:** * **Phases:** Code Optimization, Code Generation, and associated Symbol Table/Error Handling operations. * **Dependency:** Primarily target-machine dependent, source-language independent (operates on intermediate language). * **Output:** Target Machine Code (e.g., assembly code). ### 19. Role of Lexical Analyzer The lexical analyzer (scanner) is the first phase of the compiler. * **Primary Role:** * Read input characters from the source program. * Group them into meaningful sequences called **lexemes**. * Produce as output a stream of **tokens**, where each token consists of a token name and an optional attribute value. * **Interaction with Parser:** The lexical analyzer often acts as a subroutine or co-routine of the parser. When the parser needs the next token, it calls the lexical analyzer (e.g., `get next token` command). The lexical analyzer then reads characters until it identifies the next token. * **Secondary Roles:** * **Stripping comments and whitespace:** Removes non-essential parts of the source code. * **Correlating error messages:** Keeps track of line numbers to provide accurate error locations. * **Buffering input:** Uses techniques like input buffering to efficiently read characters and look ahead. ### 20. Input Buffering Input buffering is a technique used by the lexical analyzer to efficiently read the source program characters and identify tokens. It's designed to reduce the overhead of character processing. #### Buffer Pairs * **Concept:** The input buffer is divided into two halves, each capable of holding `N` characters (e.g., 1024 or 4096, corresponding to disk block size). * **Mechanism:** 1. **Reading Input:** `N` characters are read into one half of the buffer using a single system read command, reducing I/O operations. 2. **Pointers:** Two pointers are maintained: * `lexeme_beginning`: Marks the start of the current lexeme. * `forward`: Scans ahead to find the end of the current lexeme. 3. **Buffer Reloading:** * When `forward` reaches the end of the first half, the second half is filled with new input. * When `forward` reaches the end of the second half, the first half is filled with new input, and `forward` wraps around to the beginning of the first half. 4. **Token Identification:** Once a lexeme is identified, both `lexeme_beginning` and `forward` are advanced to the character immediately past the lexeme. * **Problem:** Checking for the end of a buffer and reloading it requires two tests for every character movement, which can be inefficient. #### Sentinels * **Concept:** A special character, `eof` (end-of-file marker), is placed at the end of each buffer half. This reduces the number of tests required for buffer boundary checks. * **Mechanism:** * Instead of checking if `forward` has reached the end of the buffer *and* if it's `eof`, we only check if `forward` points to `eof`. * If `forward` encounters `eof`: * If it's at the end of the first half, reload the second half and place `eof` at its end. * If it's at the end of the second half, reload the first half and place `eof` at its end (wrapping `forward`). * If it's the actual end of the input file (the `eof` character is the real end-of-file), terminate lexical analysis. * **Benefit:** Reduces the number of checks per character from two to one, improving efficiency. #### Code to Advance Forward Pointer (with Sentinels) ``` forward := forward + 1; if forward == eof then begin if forward at end of first half then begin reload second half; forward := forward + 1 end else if forward at end of second half then begin reload first half; move forward to beginning of first half end else /* eof within a buffer signifying end of input */ terminate lexical analysis end ``` ### 21. Specification of Tokens Tokens are specified using mathematical concepts like strings, languages, and regular expressions. #### Strings and Languages * **Alphabet ($\Sigma$):** Any finite set of symbols (e.g., letters, digits, ASCII characters). * **String:** A finite sequence of symbols drawn from an alphabet. * **Example:** `101011` is a string over `{0, 1}`. `$\epsilon$` (epsilon) is the empty string. * **Length of a String:** The number of occurrences of symbols in the string. * **Example:** `|101| = 3`. * **Language:** Any set of strings over some fixed alphabet $\Sigma$. #### Common String Terms | Term | Definition | Example | | :------------------- | :-------------------------------------------------------------------------------- | :------------------------------------ | | **Prefix of s** | String obtained by removing zero or more trailing symbols of `s`. | `ban` is a prefix of `banana`. | | **Suffix of s** | String obtained by deleting zero or more leading symbols of `s`. | `nana` is a suffix of `banana`. | | **Substring of s** | String obtained by deleting a prefix and a suffix from `s`. | `nan` is a substring of `banana`. | | **Proper prefix/suffix/substring** | A nonempty `x` that is a prefix/suffix/substring of `s`, where `s ≠ x`. | `ban` is a proper prefix of `banana`. | | **Subsequence of s** | String formed by deleting zero or more non-necessarily contiguous symbols from `s`. | `baaa` is a subsequence of `banana`. | #### Operations on Languages | Operation | Notation | Definition | | :-------------------------------------------- | :--------------- | :------------------------------------------------ | | **Union of L and M** | `L U M` | `{ s | s is in L or s is in M }` | | **Concatenation of L and M** | `LM` | `{ st | s is in L and t is in M }` | | **Kleene Closure of L** (zero or more) | `L*` | `$\bigcup_{i=0}^{\infty} L^i$` (L concatenated with itself i times) | | **Positive Closure of L** (one or more) | `L+` | `$\bigcup_{i=1}^{\infty} L^i$` (L concatenated with itself i times) | ### 22. Recognition of Tokens Tokens are recognized based on their grammatical specification, which often involves regular expressions. #### Example Grammar Fragment ``` stmt -> if expr then stmt | if expr then stmt else stmt | epsilon expr -> term relop term | term term -> id | num ``` #### Regular Definitions for Terminals * `if`: `if` * `then`: `then` * `else`: `else` * `relop`: ` | > | >=` * `id`: `letter (letter | digit)*` * `num`: `digit+ (. digit+)? (E (+ | -)? digit+)?` * `letter`: `A | B | ... | Z | a | b | ... | z` * `digit`: `0 | 1 | ... | 9` #### Regular Definition for White Space (`ws`) * `delim`: `blank | tab | newline` * `ws`: `delim+` (one or more delimiters) #### Token and Attribute Value Output The lexical analyzer's goal is to identify the lexeme for the next token and produce a pair: `(token, attribute_value)`. | Regular Expression | Token | Attribute - Value | | :----------------- | :--------- | :---------------- | | `ws` | | - | | `if` | `If` | - | | `then` | `Then` | - | | `else` | `Else` | - | | `id` | `Id` | Pointer to table entry | | `num` | `Num` | Pointer to table entry | | ` ` | `Relop` | `NE` | | `>` | `Relop` | `GT` | | `>=` | `Relop` | `GE` | ### 23. Lex Tool * **Definition:** Lex is a tool that generates lexical analyzers (scanners). It's commonly used with `yacc` (Yet Another Compiler Compiler), which generates parsers. * **Process of Creating a Lexical Analyzer with Lex:** 1. **Specification:** Write a lexical analyzer specification in the Lex language, typically in a file named `lex.l`. 2. **Lex Compilation:** Run `lex.l` through the Lex compiler (the `lex` command). This generates a C program file, usually `lex.yy.c`. 3. **C Compilation:** Compile `lex.yy.c` using a C compiler (e.g., `gcc`). This produces an executable object program, typically `a.out`. 4. **Execution:** The `a.out` program is the lexical analyzer, which takes an input stream and transforms it into a sequence of tokens. * **Diagram:** ```mermaid graph LR A[lex.l] --> B(Lex compiler) B --> C[lex.yy.c] C --> D(C compiler) D --> E[a.out] F[input stream] --> E E --> G[sequence of tokens] ``` #### Lex Specification Structure A Lex program consists of three sections, separated by `%%`: ``` { definitions } %% { rules } %% { user subroutines } ``` 1. **Definitions Section:** * Includes declarations of variables, constants, and regular definitions (e.g., `DIGIT [0-9]`, `ID [a-zA-Z][a-zA-Z0-9]*`). * **Example:** ```c %{ int v=0,c=0; %} ``` 2. **Rules Section:** * Contains statements of the form `pattern {action}`. * `pattern`: A regular expression that defines the lexeme. * `action`: C code executed when the pattern matches a lexeme. * **Example:** ``` [aeiouAEIOU] v++; // Count vowels [a-zA-Z] c++; // Count consonants ``` 3. **User Subroutines Section:** * Contains auxiliary C functions needed by the actions in the rules section (e.g., `main()` function, helper utilities). * These can be compiled separately and linked with the lexical analyzer. * **Example:** ```c %% main() { printf("ENTER INPUT : \n"); yylex(); // Lex-generated function to perform lexical analysis printf("VOWELS=%d\nCONSONANTS=%d\n",v,c); } ``` ### 24. Finite Automata * **Definition:** A finite automaton (FA) is a mathematical model for a recognizer. It takes a string as input and determines if the string belongs to a language (i.e., if it is a "sentence" of the language). * **Use in Compilers:** Regular expressions (used to specify tokens) are compiled into FAs, which then serve as the lexical analyzer. #### Components of a Finite Automaton A finite automaton `M` is formally defined by a 5-tuple `(Q, Σ, qo, F, δ)`: * `Q`: A finite set of states. * `Σ`: A finite set of input symbols (the alphabet). * `qo`: The initial (start) state, `qo ∈ Q`. * `F`: A set of final (accepting) states, `F ⊆ Q`. * `δ`: The transition function, which maps `(state, input_symbol)` to `state`. #### Types of Finite Automata 1. **Non-deterministic Finite Automata (NFA):** * Allows more than one transition for a given `(state, input_symbol)` pair. * Allows transitions on an empty string (`ε`-transitions). * A string is accepted if there is at least one path from the start state to an accepting state. 2. **Deterministic Finite Automata (DFA):** * For each `(state, input_symbol)` pair, there is exactly one transition to a next state. * No `ε`-transitions are allowed. * A string is accepted if there is exactly one path from the start state to an accepting state. #### Transition Diagrams * **Graphical Representation:** FAs are often represented using transition diagrams. * **States:** Drawn as circles. * **Start State:** Indicated by an arrow pointing to it. * **Final States:** Indicated by double circles. * **Transitions:** Arrows labeled with input symbols showing transitions between states. * `*` (asterisk) on a state: Indicates input retraction (lexeme ends before this character). #### Example Transition Diagram for Relational Operators ```mermaid graph TD start --> s0 s0 -- "=" --> s1(return (relop, EQ)) s0 -- " s2 s0 -- ">" --> s3 s2 -- "=" --> s4(return (relop, LE)) s2 -- ">" --> s5(return (relop, NE)) s2 -- other --> s6*(return (relop, LT)) s3 -- "=" --> s7(return (relop, GE)) s3 -- other --> s8*(return (relop, GT)) ``` *This diagram is a simplified representation of the original image, which is more complex and shows state numbers, but captures the essence.* ### 25. NFA and DFA #### NFA (Non-deterministic Finite Automata) * **Characteristics:** * A state can have multiple transitions for the same input symbol. * A state can have transitions on the empty string (`ε`). * A string is accepted if *any* sequence of choices leads to a final state. * **Power:** NFAs and DFAs have equivalent power; any language recognized by an NFA can also be recognized by a DFA. * **Construction:** Easier to construct from regular expressions. #### DFA (Deterministic Finite Automata) * **Characteristics:** * For every state and every input symbol, there is exactly one transition. * No `ε`-transitions. * A string is accepted if the *unique* path leads to a final state. * **Execution:** Easier and faster to simulate directly for recognizing strings. * **Construction:** Can be more complex to construct directly, often derived from NFAs. #### Epsilon-Closure ($\epsilon$-closure) * **Definition:** The $\epsilon$-closure of a state `s` is the set of all states reachable from `s` by following zero or more `ε`-transitions. * **Purpose:** Crucial for converting an NFA to a DFA using the subset construction algorithm. * **Example 1:** ```mermaid graph LR q0 -- "$\epsilon$" --> q1 q1 -- "$\epsilon$" --> q2 q0 -- "0" --> q1 q1 -- "1" --> q2 ``` * $\epsilon$-closure(q0) = {q0, q1, q2} * $\epsilon$-closure(q1) = {q1, q2} * $\epsilon$-closure(q2) = {q2} * **Example 2:** ```mermaid graph LR s1 -- "$\epsilon$" --> s2 s1 -- "$\epsilon$" --> s3 s2 -- "a" --> s4 s3 -- "b" --> s5 s3 -- "$\epsilon$" --> s6 s5 -- "$\epsilon$" --> s7 ``` * $\epsilon$-closure(1) = {1, 2, 3, 6} * $\epsilon$-closure(2) = {2, 3, 6} * $\epsilon$-closure(3) = {3, 6} * $\epsilon$-closure(4) = {4} * $\epsilon$-closure(5) = {5, 7} * $\epsilon$-closure(6) = {6} * $\epsilon$-closure(7) = {7} ### 26. Regular Expressions * **Definition:** A regular expression (RE) is a powerful notation for specifying patterns of strings, which are used to define tokens in lexical analysis. Each RE `r` denotes a language `L(r)`. * **Basis (Atomic Regular Expressions):** * `$\epsilon$`: Is a regular expression denoting the language `{ $\epsilon$ }` (the empty string). * `a`: If `a` is a symbol in `$\Sigma$`, then `a` is a regular expression denoting the language `{ a }`. * **Induction (Combining Regular Expressions):** * If `r` and `s` are regular expressions denoting `L(r)` and `L(s)` respectively, then: * **Union:** `(r) | (s)` is a regular expression denoting `L(r) U L(s)`. * **Concatenation:** `(r)(s)` is a regular expression denoting `L(r)L(s)`. * **Kleene Closure:** `(r)*` is a regular expression denoting `(L(r))*` (zero or more concatenations). * **Positive Closure:** `(r)+` is a regular expression denoting `(L(r))+` (one or more concatenations). * **Language denoted by RE:** A language denoted by a regular expression is called a **regular set**. #### Precedence and Associativity of Operators 1. **`*` (Kleene Closure):** Highest precedence, left associative. 2. **Concatenation:** Second highest precedence, left associative. 3. **`|` (Union):** Lowest precedence, left associative. * **Example:** `a | b*c` is equivalent to `a | ( (b)* c )`. #### Examples of Regular Expressions Let `$\Sigma$ = {a, b}`. * `a | b`: Denotes `{ a, b }`. * `(a | b)(a | b)`: Denotes `{ aa, ab, ba, bb }` (all strings of length two). Equivalent to `aa | ab | ba | bb`. * `a*`: Denotes `{ $\epsilon$, a, aa, aaa, ... }` (zero or more `a`'s). * `(a | b)*`: Denotes the set of all strings of `a`'s and `b`'s, including the empty string. * `a | a*b`: Denotes `{ a, $\epsilon$b, ab, aab, aaab, ... }` (the string `a` or zero or more `a`'s followed by a `b`). #### Regular Definitions * **Definition:** Assigning names to regular expressions. A regular definition is a sequence of definitions `d1 -> r1, d2 -> r2, ..., dn -> rn`, where each `di` is a distinct name and `ri` is a regular expression over `$\Sigma$` and previously defined names. * **Example: Regular Definition for Identifiers** ``` letter -> A | B | ... | Z | a | b | ... | z digit -> 0 | 1 | ... | 9 id -> letter (letter | digit)* ``` * **Example: Regular Definition for Numbers** ``` digit -> 0 | 1 | ... | 9 digits -> digit digit* optional_fraction -> . digits | $\epsilon$ optional_exponent -> ( E ( + | - | $\epsilon$ ) digits ) | $\epsilon$ num -> digits optional_fraction optional_exponent ``` #### Notational Shorthands 1. **One or more instances (`+`):** `r+` means "one or more occurrences of `r`". Equivalent to `rr*`. * **Example:** `digit+` is `digit digit*`. 2. **Zero or one instance (`?`):** `r?` means "zero or one occurrence of `r`". Equivalent to `r | $\epsilon$`. * **Example:** `(.digits)?` means `(.digits | $\epsilon$)`. 3. **Character Classes (`[]`):** `[a-z]` is a shorthand for `a | b | ... | z`. * **Example:** Identifiers using character classes: `[A-Za-z][A-Za-z0-9]*`. ### 27. Conversion of Regular Expression to Automata Regular expressions can be converted into finite automata, typically NFA first, then DFA. #### Methods 1. **Thompson's Subset Construction (RE to NFA to DFA):** * Convert the regular expression into an NFA. * Convert the NFA into an equivalent DFA using the subset construction algorithm. 2. **Direct Method (RE to DFA):** * Convert the regular expression directly into a DFA without an intermediate NFA step. This uses concepts like `nullable`, `firstpos`, `lastpos`, and `followpos`. #### Rules for Converting Regular Expression to NFA (Thompson's Construction) * **Base Cases:** * For `$\epsilon$`: ```mermaid graph LR start -- "$\epsilon$" --> final ``` * For `a` (single symbol): ```mermaid graph LR start -- "a" --> final ``` * **Union (`r1 | r2`):** ```mermaid graph TD start --> e1_r1(epsilon) start --> e2_r2(epsilon) e1_r1 --> r1_start e2_r2 --> r2_start r1_final --> e3_final(epsilon) r2_final --> e4_final(epsilon) e3_final --> final_state e4_final --> final_state ``` * **Concatenation (`r1 r2`):** ```mermaid graph TD start --> r1_start r1_final --> epsilon_transition(epsilon) epsilon_transition --> r2_start r2_final --> final_state ``` * **Kleene Closure (`r1*`):** ```mermaid graph TD start --> epsilon1(epsilon) epsilon1 --> r1_start r1_final --> epsilon2(epsilon) epsilon2 --> r1_start epsilon2 --> final_state start --> final_state ``` #### Subset Construction Algorithm (NFA to DFA) 1. **Compute $\epsilon$-closure of the NFA's start state:** This becomes the DFA's start state. 2. **Iteratively compute new DFA states:** For each existing DFA state `D` and each input symbol `a` in `$\Sigma$`: * Calculate `D'` = `$\epsilon$-closure(move(D, a))`. * If `D'` is a new state, add it to the set of DFA states. * Add a transition `D -- a --> D'`. 3. **Repeat:** Continue until no new DFA states are generated. 4. **Final States:** A DFA state is a final state if its $\epsilon$-closure contains at least one final state of the original NFA. * **`move(D, a)`:** The set of all NFA states reachable from any state in `D` by a single transition on input symbol `a`. * **`Dtran[D, a]`:** The DFA's transition function, defined as `$\epsilon$-closure(move(D, a))`. #### Example: NFA to DFA for `(a|b)*abb` (The provided image shows the final DFA, not the step-by-step construction. Here's a conceptual outline of the subset construction steps based on the image's states) 1. **Start State:** Let NFA start state be `S0`. Compute `$\epsilon$-closure(S0)`. Let this be DFA state `A`. 2. **Transitions from A:** * On `a`: Compute `$\epsilon$-closure(move(A, a))`. Let this be DFA state `B`. * On `b`: Compute `$\epsilon$-closure(move(A, b))`. Let this be DFA state `C`. 3. **Transitions from B:** * On `a`: Compute `$\epsilon$-closure(move(B, a))`. This might lead back to `B` or `A`. * On `b`: Compute `$\epsilon$-closure(move(B, b))`. Let this be DFA state `D`. 4. **Transitions from C:** * ...and so on, until all reachable DFA states and their transitions are found. 5. **Final States:** Any DFA state containing an NFA final state (e.g., the state corresponding to the NFA's final state for `abb`) becomes a DFA final state. #### Direct Method (RE to DFA) - Key Concepts This method builds a syntax tree for the regular expression and then uses properties of its nodes to construct the DFA. * **Augmented Regular Expression:** Append a unique end-marker `#` to the RE (e.g., `(a|b)*abb#`). * **Syntax Tree:** Build a syntax tree where interior nodes are operators and leaf nodes are input symbols or `$\epsilon$`. * **Functions for Nodes `n`:** * `nullable(n)`: True if the language `L(n)` can generate the empty string `$\epsilon$`. * `firstpos(n)`: Set of positions in the RE corresponding to the first symbol of string generated by `L(n)`. * `lastpos(n)`: Set of positions in the RE corresponding to the last symbol of string generated by `L(n)`. * **`followpos(i)`:** For a position `i` in the RE, `followpos(i)` is the set of positions `j` such that there is some string `...ij...` in `L(RE#)`. This determines DFA transitions. #### Rules for `nullable`, `firstpos`, `lastpos` | Node `n` | `nullable(n)` | `firstpos(n)` | `lastpos(n)` | | :------------- | :------------ | :--------------------------------------- | :--------------------------------------- | | `$\epsilon$` (leaf) | True | `$\emptyset$` | `$\emptyset$` | | `a` (leaf, pos `i`)| False | `{i}` | `{i}` | | `n1 | n2` | `nullable(n1) OR nullable(n2)` | `firstpos(n1) U firstpos(n2)` | `lastpos(n1) U lastpos(n2)` | | `n1 n2` | `nullable(n1) AND nullable(n2)` | `firstpos(n1) U firstpos(n2)` if `nullable(n1)` is True, else `firstpos(n1)` | `lastpos(n1) U lastpos(n2)` if `nullable(n2)` is True, else `lastpos(n2)` | | `n1*` | True | `firstpos(n1)` | `lastpos(n1)` | #### Rules for `followpos` 1. If `n` is a concatenation node `n1 n2` and `i` is a position in `lastpos(n1)`, then all positions in `firstpos(n2)` are in `followpos(i)`. 2. If `n` is a Kleene closure node `n1*` and `i` is a position in `lastpos(n1)`, then all positions in `firstpos(n1)` are in `followpos(i)`. #### Direct DFA Construction Steps (using `followpos`) 1. The start state of the DFA is `firstpos(root)` of the augmented RE syntax tree. 2. For each DFA state `D` (which is a set of positions) and each input symbol `a`: * Let `D_new` be the union of `followpos(p)` for all positions `p` in `D` where the symbol at `p` is `a`. * If `D_new` is not empty and is a new state, add it. * Add transition `D -- a --> D_new`. 3. A DFA state `D` is final if it contains the position of the end-marker `#`. ### 28. DFA Minimization * **Goal:** To reduce the number of states in a DFA while preserving the language it recognizes. A minimal DFA is unique for a given regular language. * **Algorithm (Partitioning Method / Hopcroft's Algorithm):** 1. **Initial Partition:** Divide the set of all states `Q` into two initial groups: * `P0`: Set of all final states `F`. * `P1`: Set of all non-final states `Q - F`. 2. **Iterative Refinement:** For each group `G` in the current partition `P` and for each input symbol `a`: * Split `G` into subgroups if necessary. A group `G` is split if there are two states `s1, s2 ∈ G` such that `move(s1, a)` and `move(s2, a)` go to states in *different* groups of the current partition `P`. * If `s1` and `s2` are in `G`, but `move(s1, a)` is in `G_X` and `move(s2, a)` is in `G_Y` (where `G_X ≠ G_Y`), then `s1` and `s2` must be separated. * Create a new partition `P_new` by splitting groups from `P`. 3. **Repeat:** Continue step 2 until no group can be further split for any input symbol. The resulting partition is the set of states for the minimal DFA. 4. **Construct Minimal DFA:** * Each group in the final partition becomes a state in the minimal DFA. * The start state is the group containing the original DFA's start state. * Final states are the groups containing original DFA's final states. * For a group `G` and input `a`, if `move(s, a)` for any `s ∈ G` goes to group `G'`, then there is a transition `G -- a --> G'`. * **Example (Conceptual):** * Suppose `Q = {A, B, C, D, E}`. * `F = {D, E}` (final states). * `Q - F = {A, B, C}` (non-final states). * **Initial Partition:** `P = {{A, B, C}, {D, E}}`. * **Refine (e.g., for symbol 'a'):** * If `move(A, a)` goes to `B` (in `{A,B,C}`) and `move(B, a)` goes to `D` (in `{D,E}`), then `{A, B, C}` must be split. * This process continues until no state within a group can be distinguished from another state in the same group by any input symbol. ### Most Expected Exam Questions * **2-Mark Questions:** * Define Translator. * What is the difference between a compiler and an interpreter? * List the phases of a compiler. * What is a token, lexeme, and pattern? Give an example. * What is the role of the symbol table? * What is epsilon-closure? * What are the two main parts of the Analysis-Synthesis model? * Mention two applications of compiler technology. * What are sentinels in input buffering? * Define regular expression. * **8-Mark Questions:** * Explain the different types of language translators with examples and diagrams. * Describe the various phases of a compiler with a neat diagram. * Explain the concepts of token, lexeme, and pattern with an example from a programming language statement. * Discuss the role of the lexical analyzer and its secondary tasks. * Explain input buffering techniques in lexical analysis, including sentinels. * What are regular expressions? Explain the operations on languages and their use in token specification. * **16-Mark Questions:** * Discuss the Analysis-Synthesis model of compilation in detail, explaining each phase with an example like `position := initial + rate * 60`. Include diagrams for syntax trees and intermediate code. * Explain the process of converting a regular expression into an NFA using Thompson's construction, and then converting the NFA to a DFA using subset construction. Provide a detailed example. * Describe the different types of errors encountered in various phases of a compiler and discuss error recovery strategies. * Explain the architecture of a Lex tool and how it is used to generate a lexical analyzer, including its input/output files and internal structure. * Compare and contrast Compiler, Interpreter, and Hybrid Compiler, highlighting their advantages and disadvantages. ### Important Definitions to Memorize * **Translator:** Program converting one language to another. * **Compiler:** Translates entire source code to target code. * **Interpreter:** Translates and executes source code line by line. * **Assembler:** Translates assembly language to machine code. * **Hybrid Compiler (JIT):** Compiles to bytecode, then interprets/optimizes during runtime. * **Lexical Analyzer (Scanner):** First phase, converts characters to tokens. * **Token:** Logical unit (e.g., keyword, identifier, operator). * **Lexeme:** Sequence of characters forming a token. * **Pattern:** Rule describing lexemes for a token. * **Syntax Analyzer (Parser):** Second phase, checks grammar and builds parse tree. * **Semantic Analyzer:** Third phase, checks meaning and type consistency. * **Intermediate Code:** Machine-independent representation between source and target code. * **Code Optimizer:** Improves intermediate code for efficiency. * **Code Generator:** Final phase, produces target machine code. * **Symbol Table:** Data structure storing identifier information. * **Error Handler:** Detects and reports errors, attempts recovery. * **Regular Expression:** Notation for specifying patterns of strings (tokens). * **Finite Automata (FA):** Mathematical model for recognizing languages. * **NFA:** Non-deterministic FA, allows multiple transitions for an input/epsilon. * **DFA:** Deterministic FA, one unique transition for each state/input. * **$\epsilon$-closure:** Set of states reachable via epsilon transitions. ### Frequently Confused Concepts * **Compiler vs. Interpreter:** The core difference lies in *when* the translation occurs (before execution for compiler, during execution for interpreter) and *what* is produced (executable file vs. direct execution). * **Token vs. Lexeme vs. Pattern:** * **Pattern** is the rule. * **Lexeme** is the actual text matching the rule. * **Token** is the category/type of the lexeme. * **Syntax Error vs. Semantic Error:** * **Syntax Error:** Grammatical mistake (e.g., missing semicolon). Program structure is incorrect. * **Semantic Error:** Meaningful mistake (e.g., type mismatch, undeclared variable). Program structure might be correct, but the logic/types are invalid. * **`L*` vs. `L+`:** * `L*` (Kleene closure) includes the empty string (zero or more occurrences). * `L+` (Positive closure) does NOT include the empty string (one or more occurrences). * **NFA vs. DFA:** NFA allows non-determinism and epsilon transitions, making it easier to construct but harder to execute. DFA is deterministic, easier to execute but harder to construct directly. They are equivalent in power. ### Final Rapid Revision Section * **Compiler Phases in Order:** Lexical -> Syntax -> Semantic -> Intermediate Code -> Optimization -> Code Generation. * **Front End (Analysis):** Source-dependent, output IR. * **Back End (Synthesis):** Target-dependent, input IR, output machine code. * **Lexical Analyzer's Job:** Characters to tokens. Strips whitespace/comments. Uses input buffering (sentinels). * **Parser's Job:** Tokens to parse tree. Checks grammar. * **Semantic Analyzer's Job:** Parse tree to annotated parse tree. Checks meaning, types. * **Intermediate Code:** Three-address code is common. * **Optimization:** Improves speed/memory. * **Code Generation:** IR to target machine code (assembly), register allocation. * **Symbol Table:** Central repository for identifier info across all phases. * **Error Handling:** Localized recovery at each phase. Panic mode, statement mode, error productions. * **Regular Expressions:** Used to define tokens. Operations: union, concatenation, closures. * **Finite Automata:** REs converted to NFA (Thompson's), then NFA to DFA (Subset Construction). * **DFA Minimization:** Reduce states using partitioning method.