Course Overview: CS210 - Database Management Systems This course provides a comprehensive understanding of database systems, from fundamental concepts to advanced topics like concurrency control and data mining. It aims to equip you with the skills to design, query, and manage robust database solutions. UNIT-I: Database Concepts and Data Model 1. Database System Fundamentals Definition & Purpose: A DBMS is software that allows users to define, create, maintain, and control access to the database. Its purpose is to store, manage, and retrieve data efficiently and securely. Applications: E-commerce, banking, healthcare, education, inventory management, etc. Data Abstraction: Hiding complex data storage details from users. Physical Level: How data is physically stored (e.g., files, disk blocks). Logical Level: What data is stored and relationships between data (e.g., tables, rows). View Level: Application-specific views of the database, customized for different user groups. Database Architecture: Typically 3-tier (client, application server, database server) or 2-tier (client, database server). Database Users & Administrators: Users: Interact with the database through applications or direct queries. DBA: Manages the database system, including schema definition, storage structure, access control, and backup/recovery. Instances & Schema: Schema: The logical design of the database (structure). Instance: The actual data stored in the database at a particular moment in time. 2. Data Models Entity-Relationship (ER) Model: A high-level conceptual data model. Overview & Definitions: Represents real-world entities and their relationships. Entity: A real-world object distinguishable from other objects (e.g., Student, Course). Attribute: Properties describing an entity (e.g., Student ID, Name). Relationship: Association between entities (e.g., Student ENROLLS in Course). ER Diagram: Graphical representation using rectangles (entities), ovals (attributes), diamonds (relationships). Mapping Cardinalities: Express the number of entities to which another entity can be associated via a relationship set. 1:1 (One-to-one) 1:N (One-to-many) N:1 (Many-to-one) M:N (Many-to-many) Reduction to Relational Schema: Converting an ER diagram into relational tables. Extended ER (EER) Features: Specialization, Generalization, Aggregation, Inheritance. Relational Model: The most widely used data model. Structure: Data is organized into two-dimensional tables (relations). Each table has a unique name. Relation (Table): A set of tuples (rows). Tuple (Row): A record representing a single entity or relationship instance. Attribute (Column): A named column in a table, representing a property. Domain: The set of permissible values for an attribute. Keys: Used to uniquely identify tuples and establish relationships. Primary Key (PK): Uniquely identifies each row in a table. Cannot be NULL. Foreign Key (FK): A column or set of columns that refers to the primary key of another table, establishing a link. Candidate Key: Any attribute or set of attributes that can uniquely identify a tuple. Super Key: Any set of attributes that includes a candidate key. 3. Relational Query Languages Relational Algebra: Procedural query language. Operations include: Selection ($\sigma$): Filters rows based on a condition. $\sigma_{condition}(R)$ Projection ($\pi$): Selects specific columns. $\pi_{attributes}(R)$ Union ($\cup$), Intersection ($\cap$), Difference ($-$): Set operations on relations with compatible schemas. Cartesian Product ($\times$): Combines every row from one relation with every row from another. $R \times S$ Join ($\bowtie$): Combines rows from two or more tables based on a related column between them. Natural Join ($\bowtie$): Joins on common attributes and removes duplicate columns. Theta Join ($\bowtie_{\theta}$): Joins based on an arbitrary condition $\theta$. Rename ($\rho$): Renames a relation or its attributes. Tuple Relational Calculus (TRC): Non-procedural query language, describes what to retrieve without specifying how. $\{t | P(t)\}$ (Set of all tuples $t$ such that predicate $P(t)$ is true). Domain Relational Calculus (DRC): Similar to TRC but uses domain variables. $\{ | P(x_1, x_2, ..., x_n)\}$ UNIT-II: Database Design and Querying 1. Relational Database Design Overview & Features: Focuses on creating a logical schema that avoids redundancy and ensures data integrity. Normalization: A process of organizing columns and tables to minimize data redundancy and improve data integrity. Functional Dependency (FD): $X \to Y$ means that the value of $X$ uniquely determines the value of $Y$. Normal Forms: 1NF (First Normal Form): All attributes must be atomic (indivisible) and each column must contain only one value. No repeating groups. 2NF (Second Normal Form): Is in 1NF and all non-key attributes are fully functionally dependent on the primary key (no partial dependencies). 3NF (Third Normal Form): Is in 2NF and has no transitive dependencies (non-key attributes do not depend on other non-key attributes). BCNF (Boyce-Codd Normal Form): A stricter form of 3NF. For every non-trivial FD $X \to Y$, $X$ must be a superkey. Decomposition: Breaking down a relation into smaller, well-structured relations to achieve higher normal forms. Lossless Join Decomposition: Original relation can be reconstructed from decomposed relations without spurious tuples. Dependency Preserving Decomposition: All functional dependencies can be enforced in the decomposed relations. Multi-Valued Dependencies (MVDs): Occur when an attribute can have multiple independent values in a single tuple, leading to 4NF. 2. Structured Query Language (SQL) Definition: Standard language for managing and manipulating relational databases. Basic Structure: SELECT column1, column2 FROM table_name WHERE condition ORDER BY column; Data Types: INT , VARCHAR , DATE , BOOLEAN , DECIMAL , etc. Basic Operations: DDL (Data Definition Language): Defines database schema. CREATE TABLE table_name (column_name data_type, ...); ALTER TABLE table_name ADD column_name data_type; DROP TABLE table_name; DML (Data Manipulation Language): Manages data within tables. INSERT INTO table_name (column1, ...) VALUES (value1, ...); UPDATE table_name SET column1 = value1 WHERE condition; DELETE FROM table_name WHERE condition; DCL (Data Control Language): Manages permissions and access. GRANT permission ON object TO user; REVOKE permission ON object FROM user; TCL (Transaction Control Language): Manages transactions. COMMIT; ROLLBACK; SAVEPOINT; Set Operations: UNION , INTERSECT , EXCEPT (or MINUS ). Aggregate Functions: Perform calculations on a set of rows. COUNT() , SUM() , AVG() , MIN() , MAX() . Used with GROUP BY to group rows sharing common values. HAVING clause filters groups after aggregation. Nested Sub-queries: A query embedded within another SQL query. SELECT * FROM Employees WHERE Salary > (SELECT AVG(Salary) FROM Employees); Join Expressions: Combining data from multiple tables. INNER JOIN : Returns rows when there is a match in both tables. LEFT (OUTER) JOIN : Returns all rows from the left table, and matching rows from the right. RIGHT (OUTER) JOIN : Returns all rows from the right table, and matching rows from the left. FULL (OUTER) JOIN : Returns all rows when there is a match in one of the tables. Views: Virtual tables based on the result-set of a SQL query. CREATE VIEW expensive_products AS SELECT * FROM Products WHERE Price > 100; Transactions: A single logical unit of work performed on a database. ACID Properties: Atomicity, Consistency, Isolation, Durability (covered in UNIT-III). Integrity Constraints: Rules to maintain data accuracy and consistency. PRIMARY KEY , FOREIGN KEY , UNIQUE , NOT NULL , CHECK , DEFAULT . Authorization: Granting/revoking privileges (e.g., SELECT, INSERT, UPDATE, DELETE) to users. 3. PL-SQL (Procedural Language/SQL) Definition: Oracle's procedural extension to SQL, combining SQL with procedural programming features. Basic Structure: DECLARE -- declarations BEGIN -- SQL statements -- PL/SQL statements EXCEPTION -- error handling END; Procedures: Named PL/SQL blocks stored in the database, can accept parameters. Functions: Similar to procedures, but always return a single value. Cursors: Pointers to the context area, used to process a set of rows returned by a query one by one. Implicit Cursors: Automatically created by Oracle for DML statements. Explicit Cursors: Declared and managed by the programmer for multi-row queries. Triggers: PL/SQL blocks that automatically execute in response to a specific database event (INSERT, UPDATE, DELETE) on a table. BEFORE / AFTER , FOR EACH ROW / FOR EACH STATEMENT . Packages: Encapsulate related procedures, functions, variables, and other PL/SQL constructs into a single named unit. UNIT-III: Query Processing and Fast Retrieval 1. Query Processing Basic Steps: Parsing and Translation: SQL query is parsed and translated into an equivalent relational algebra expression. Optimization: Query optimizer finds the most efficient execution plan (lowest cost). Evaluation: Query execution engine executes the chosen plan. Measures of Query Cost: Number of disk I/Os, CPU time, network communication. Disk I/O is usually the dominant factor. Query Optimization: Process of choosing the most efficient query execution plan. Heuristic Optimization: Applying rules of thumb (e.g., perform selections early). Cost-Based Optimization: Estimating the cost of different plans and choosing the cheapest. Equivalent Expression and Query Evaluation Plan: Different sequences of relational algebra operations can produce the same result but with varying costs. A query evaluation plan specifies the physical operators and their order. 2. Indexing Definition: A data structure technique used to quickly locate and access data in a database. Purpose: Speed up data retrieval operations (SELECT). Types of Indexing: Primary Index: Defined on the ordered field of the file. Secondary Index: Defined on a non-ordering field. Clustering Index: Defined on a non-key field, but records are physically ordered on this field. B-Tree and B+ Tree: B-Tree: Self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. Internal nodes can store data. B+ Tree: A variation of B-Tree where all data pointers are stored only at the leaf nodes, which are linked together to facilitate efficient range queries. Internal nodes only store keys. Widely used in databases. 3. Hashing Basic Concepts: A technique to directly calculate the disk address of a record rather than searching through an index structure. Hash Function: A function that maps search keys to disk block addresses. Static Hashing: The number of buckets remains fixed. Collisions are handled by chaining or open addressing. Dynamic Hashing: Allows the hash function and number of buckets to grow or shrink dynamically as the database changes (e.g., Extendable Hashing, Linear Hashing). Comparison of Indexing and Hashing: Indexing: Good for range queries, ordered retrieval. Hashing: Excellent for equality searches (exact match), but poor for range queries or ordered retrieval. 4. Transaction Management Overview: A logical unit of work that must be processed in its entirety or not at all. Transaction States: Active: The initial state; the transaction is executing. Partially Committed: After the last statement is executed. Failed: Transaction cannot proceed due to an error. Aborted: Transaction has been rolled back and database restored to previous state. Committed: Transaction completed successfully and changes are permanently recorded. ACID Properties: Atomicity: All operations within a transaction are completed successfully, or none are. "All or nothing." Consistency: A transaction must bring the database from one valid state to another valid state. Isolation: Concurrent transactions appear to execute serially; intermediate results of one transaction are not visible to others. Durability: Once a transaction is committed, its changes are permanent and survive system failures. Implementation of ACID Properties: Achieved through logging (for atomicity and durability) and concurrency control (for isolation). Serializability: A property of transaction schedules where the concurrent execution of multiple transactions produces the same result as some serial execution of the same transactions. It's the gold standard for isolation. UNIT-IV: Concurrency Control and DB Architecture 1. Concurrency Control Overview: Managing simultaneous execution of multiple transactions to ensure data consistency and integrity. Problems: Lost Update, Dirty Read, Non-repeatable Read, Phantom Read. Lock Types: Shared (S) Lock: Allows concurrent read access. Exclusive (X) Lock: Grants exclusive access for read and write operations. Lock-based Protocols: Two-Phase Locking (2PL): Guarantees serializability. Growing Phase: Transaction can acquire locks but not release any. Shrinking Phase: Transaction can release locks but not acquire any new ones. Strict 2PL: Holds all exclusive locks until commit/abort. Prevents cascading rollbacks. Deadlock Conditions and Handling: Deadlock: Two or more transactions are waiting indefinitely for each other to release locks. Conditions: Mutual Exclusion, Hold and Wait, No Preemption, Circular Wait. Handling: Prevention: Imposing an ordering on resources, timestamp ordering. Detection and Recovery: Building a wait-for graph, detecting cycles, choosing a victim to rollback. 2. Recovery Systems Failure Classification: Transaction Failure: Logical errors, deadlocks. System Crash: Hardware/software errors, loss of main memory. Disk Failure: Loss of non-volatile storage. Storage: Volatile Storage: Data lost on power failure (RAM). Non-Volatile Storage: Data persists on power failure (Disk, SSD). Stable Storage: Data guaranteed to survive failures (achieved through redundancy). Recovery Algorithms: Log-based Recovery: Uses a log file (write-ahead logging) to record all database modifications. UNDO: For aborted transactions or transactions not yet committed at crash time. REDO: For committed transactions whose changes were not yet written to disk at crash time. Shadow Paging: Maintains two page tables, one for current state and one for shadow (previous) state. 3. Parallel and Distributed Databases Parallel Databases: Utilize multiple CPUs and disks to execute queries in parallel, improving performance. I/O Parallelism: Distributing data across multiple disks. Inter-query Parallelism: Executing different queries in parallel. Intra-query Parallelism: Executing parts of a single query in parallel. Intra-operation Parallelism: Parallelizing individual operations (e.g., sort, join). Inter-operation Parallelism: Executing different operations of a query plan in parallel. Distributed Databases: Data is stored across multiple physically separate sites, connected by a network. Homogeneous vs. Heterogeneous: Homogeneous: All sites use the same DBMS software. Heterogeneous: Different sites may use different DBMS software. Transaction System Architecture: How components of a DBMS (query processor, storage manager, transaction manager, etc.) are structured and interact. Concurrency Control (in Distributed DBs): More complex due to network delays and potential for partial failures. Protocols like distributed 2PL, timestamp ordering. UNIT-V: Data Mining and Information Retrieval 1. Data Mining Definition: The process of discovering patterns and insights from large datasets. Association Rules: Discovering relationships between items in large datasets (e.g., "customers who buy diapers also buy baby food"). Support: Frequency of itemset occurrence. Confidence: Conditional probability of consequent given antecedent. Apriori Algorithm: A classic algorithm for finding frequent itemsets. Classification: Predicting a categorical class label for new data based on a training set (e.g., classifying emails as spam/not spam). Algorithms: Decision Trees, Naive Bayes, Support Vector Machines (SVMs). Clustering: Grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups (e.g., customer segmentation). Algorithms: K-Means, Hierarchical Clustering. 2. Data Warehouse Definition: A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process. Architecture and Schemes: Architecture: Data sources, ETL (Extract, Transform, Load) process, Data warehouse, OLAP (Online Analytical Processing) tools, Reporting tools. Schemes: Star Schema: A central fact table surrounded by dimension tables. Snowflake Schema: A star schema where dimension tables are normalized into multiple related tables. 3. Information Retrieval (IR) Definition: The activity of obtaining information resources relevant to an information need from a collection of information resources. Ranking (keyword-based, Relevance-based): Ordering search results by their relevance to the user's query. Keyword-based: Simple matching of keywords. Relevance-based: More sophisticated algorithms considering term frequency, inverse document frequency (TF-IDF), PageRank, etc. Retrieval Effectiveness Measures: Precision: Proportion of retrieved documents that are relevant. $Precision = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$ Recall: Proportion of relevant documents that are retrieved. $Recall = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$ F-measure: Harmonic mean of precision and recall. Web Crawling and Indexing: Web Crawling: Process of systematically browsing the World Wide Web, typically for the purpose of Web indexing. Indexing: Creating an index of web pages for fast searching. 4. Specialized Databases Introduction to Spatial Databases: Stores and queries data that represents objects in a geometric space (e.g., maps, GIS data). Temporal Databases: Stores data relating to time, often including valid-time and transaction-time to track historical data. Multimedia Databases: Manages various types of multimedia data (images, audio, video) with specialized indexing and retrieval techniques. Case Study: Oracle: A widely used commercial relational database management system. Reference Books Abraham Silberschatz, Henry F. Korth and S. Sudarshan, Database System Concepts , Sixth Edition, McGraw-Hill International, Inc., 2011. Elmasri and Navathe, Fundamentals of Database Systems , Seventh Edition, Addison-Wesley, 2012. Fred R McFadden, Jeffery A. Hoffer and Mary B. Prescott, Modern Database Management , Addison Wesley.