Types of Data Quantitative Data: Numerical, measurable data. Discrete: Countable, finite values (e.g., number of students). Continuous: Measurable, infinite values within a range (e.g., height, temperature). Qualitative Data: Descriptive, non-numerical data. Nominal: Categories without order (e.g., gender, color). Ordinal: Categories with a meaningful order (e.g., satisfaction rating: low, medium, high). Structured Data: Organized in a defined format (e.g., databases, spreadsheets). Unstructured Data: No predefined format (e.g., text documents, images, audio). Semi-structured Data: Has some organizational properties but not fully structured (e.g., XML, JSON). Data Characteristics Volume: Amount of data. Velocity: Speed at which data is generated and processed. Variety: Different forms of data (structured, unstructured). Veracity: Trustworthiness and accuracy of data. Value: Potential for insights and decision-making. Data Quality and Reliability 1. Dimensions of Data Quality Accuracy: Data is correct and reflects reality. Completeness: All required data is present. Consistency: Data values are consistent across systems and time. Timeliness: Data is current and available when needed. Validity: Data conforms to defined rules and formats. Uniqueness: No duplicate records. 2. Techniques for Assessment Profiling: Analyzing data to discover patterns, anomalies, and statistics. Auditing: Tracking data changes and origins over time. Rule-based Validation: Applying predefined business rules to check data. Sampling: Examining a subset of data for quality issues. Data Cleansing: Correcting or removing incorrect, incomplete, or inconsistent data. Metadata Analysis: Examining data about data to understand its structure and quality. User Feedback: Collecting input from data users about perceived quality. Sourcing and Gathering Data 1. Data Sources Internal Sources: Data generated within an organization (e.g., sales records, CRM, ERP systems, logs). External Sources: Data from outside the organization. Public Data: Government data, open-source projects, academic research. Commercial Data: Purchased data from vendors (e.g., market research, demographic data). Social Media: User-generated content, interactions. Sensors/IoT: Data from physical devices. Web Scraping: Extracting data from websites. 2. Data Gathering Methods Surveys/Questionnaires: Direct collection of information from individuals. Interviews: One-on-one discussions to gather qualitative insights. Observations: Watching and recording behaviors or events. Experiments: Controlled studies to test hypotheses. APIs (Application Programming Interfaces): Programmatic access to data from other systems. Database Queries: Extracting data from structured databases (SQL, NoSQL). Web Crawling/Scraping: Automated extraction of data from web pages. Sensor Data Collection: Automated collection from IoT devices. Log File Analysis: Examining system-generated logs. Big Data 1. Concept Big Data refers to extremely large datasets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions. It's characterized by the "Vs". 2. Characteristics (The 5 Vs) Volume: Petabytes of data, often terabytes to exabytes. Velocity: Data streaming in at unprecedented speed, requiring real-time or near real-time processing. Variety: Data in many forms – structured, unstructured, semi-structured. Veracity: The quality and trustworthiness of the data. Often, big data can be messy and unreliable. Value: The ability to transform big data into business value and actionable insights. 3. Sources of Big Data Social Media: Posts, likes, shares, comments, user profiles. Machine-generated Data: Sensor data (IoT), log files, RFID, GPS data. Transactional Data: E-commerce transactions, financial records, billing records. Web Data: Clickstream data, web server logs, search engine queries. Human-generated Data: Emails, documents, images, videos.