Cloud Computing & Big Data Essentials

Shared 1/3/2026•37 views

/ 1

Cheatsheet Content

Amazon Cloud (AWS) Definition: Amazon Web Services (AWS) is a cloud computing platform offering on-demand IT resources. Service Models: Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) Advantages of AWS Flexibility: Instant availability of new features, effortless hosting of legacy apps, choice in deployment. Cost-effectiveness: No upfront investment, minimum expense, pay-as-you-go model. Scalability/Elasticity: Auto-scaling, elastic load balancing, handles unpredictable loads. Security: End-to-end security, virtual infrastructure for privacy and isolation. Amazon Elastic Compute Cloud (EC2) Virtual Server Platform: Create and run virtual machines (VMs) on Amazon Server firm. Amazon Machine Images (AMI): Used to launch server instances for various OS (Linux, Windows). AWS Core Services Amazon Simple Queue Service (SQS): Message queue for distributed Internet-based applications. Amazon Simple Notification Service (SNS): Publish messages from an application. Amazon CloudWatch: Monitoring EC2 Cloud resources. Elastic Load Balancing: Detects failing instances, checks traffic health. EC2 Pricing Options On-Demand: Pay fixed rate per hour/second, no commitment. Ideal for low cost and flexibility. Reserved: Contract for 1 or 3 years, significant discount for reserved capacity. Spot Instances: Bid for instance capacity, useful for flexible start/end times and low compute prices. Dedicated Hosts: Physical server fully dedicated to your use, reduces costs with existing software licenses. Simple Storage Service (S3) Definition: Stores data in containers called buckets. Buckets: Each has policies/configurations, names must be unique. Limit of 100 per account (can be increased). Max Size: 5TB per bucket. Bucket Contents: Key, Version ID, Value, Metadata, Sub resources, Access control information, Tags. S3 Versioning: Keeps records of previously uploaded files, prevents overwrites/deletions, adds storage cost. Bucket Policy: Document verifying access to S3 buckets from within AWS account. Access Control Lists (ACLs): Verifies access to S3 buckets from outside AWS, specific to each bucket. Lifecycle Rules: Cost-saving practice to move files to AWS Glacier or other S3 storage classes, or delete data. Key: Unique identifier for an object in a bucket. Advantages of S3 Bucket Scalability: Horizontally scalable, handles large data, automatic scaling. High Availability: Data accessible from any region, 99.9% uptime SLA. Data Lifecycle Management: Automates transition/expiration of objects based on rules. Integration: Integrates with other AWS services. Amazon Relational Database Service (RDS) Definition: Managed SQL database service for setting up, operating, and scaling relational databases in the cloud. Supported Engines: Supports various database engines. Management: Helps with migration, backup, recovery, patching. Amazon RDS Features Replication: Creates read replicas for read-only copies, without changing the production database. Monitoring: Amazon CloudWatch for managed monitoring, capacity/I/O metrics. Patching: Provides patches for chosen database engines. Backups: Managed instance backups with transaction logs for point-in-time recovery. Benefits of Amazon RDS Easy to Administer: Simplifies deployment. Highly Scalable: Scale compute/storage resources with minimal downtime. Available and Durable: Runs on highly reliable AWS infrastructure. Fast: Supports demanding database applications. Secure: Easy control over network access. Inexpensive: Low rates, pay only for consumed resources. Ease of Use: Admins can manage multiple instances via console without specific tools. Cost-effectiveness: Pay only for what you use. Drawbacks of Amazon RDS Lack of Root Access: Users do not have root access to the server. Downtime: Systems go offline for patching/scaling, timing varies. AWS Cloud Development Kit (AWS CDK) Definition: Build reliable, scalable, cost-effective cloud applications using programming languages. Benefits: Easier Cloud Onboarding: Use preferred language/IDE. Faster Development: Expressive power of programming languages accelerates development. Customizable and Shareable: Extend components to meet security, compliance, governance. Easily shared. No Context Switching: Write runtime code and define AWS resources in the same language/IDE. Google Cloud Platform (GCP) Definition: Cloud computing platform for building, deploying, scaling applications using Google-managed data centers. Services: Infrastructure, storage, data analytics, AI, networking, developer tools. Service Models: IaaS: Virtual machines, storage, networking. PaaS: App development and deployment platforms. SaaS: Ready-to-use software applications. Key Services of Google Cloud Platform (GCP) Compute Services: Compute Engine: Virtual Machines (VMs) for flexible computing. App Engine: Platform for developing and hosting web apps. Cloud Run: Deploy and run containerized apps without managing servers. Storage Services: Cloud Storage: Object storage for unstructured data (images, videos). Persistent Disk: Block storage for VMs. Filestore: Managed file storage for applications. Cloud Storage Buckets: For data archiving and backup. Database Services: Cloud SQL: Managed relational databases (MySQL, PostgreSQL, SQL Server). Cloud Firestore / Datastore: NoSQL document databases for mobile/web apps. Bigtable: High-performance NoSQL for large analytical workloads. Spanner: Global, horizontally scalable relational database. Networking Services: Virtual Private Cloud (VPC): Private, secure network. Cloud Load Balancing: Distributes user traffic efficiently. Cloud CDN: Delivers content quickly worldwide. Cloud DNS: Scalable and reliable domain name system. Big Data & Analytics Services: BigQuery: Serverless data warehouse for fast SQL analytics. Dataflow: Stream and batch data processing. Pub/Sub: Messaging service for real-time event streaming. AI & Machine Learning: Vertex AI: Unified platform for ML model training and deployment. TensorFlow: Open-source ML framework. AI APIs: Pre-trained models for Vision, Speech, Natural Language, Translation. Developer & Management Tools: Cloud SDK: Command-line tool for managing GCP resources. Cloud Build: For building and deploying code. Cloud Source Repositories: Private Git repositories. Cloud Monitoring & Logging: Observability tools for app health/performance. Identity & Security Services: Cloud IAM: Manage user access and roles. Cloud Security Command Center: Centralized security management. Cloud KMS: Manages encryption keys. Shielded VMs & VPC Service Controls: Enhance system and data protection. Advantages of GCP Scalability: Automatically scales resources based on usage. Global Infrastructure: Data centers worldwide. Security: Built-in encryption, IAM, compliance standards. Cost Efficiency: Pay-as-you-go pricing, sustained use discounts. AI & ML Integration: Access to Google's advanced AI models. Sustainability: Runs on 100% renewable energy. GCP Security Features IAM: Control who can access what. VPC Service Controls: Prevent data exfiltration. Encryption: Data encrypted at rest and in transit. Cloud Armor: DDoS protection. Google Compute Engine IaaS: Run virtual machines (VMs) on Google's cloud infrastructure. VMs: Run Linux or Windows OS. Configurable: Number of CPUs, RAM, Disk size/type, Network/firewall settings. Key Features of Compute Engine Feature Description Custom Machine Types Choose exact vCPUs and memory. Live Migration VMs can move to another host without rebooting. Snapshots & Images Back up VM disks or create custom machine images. Autoscaling Automatically adds/removes VM instances based on traffic. Load Balancing Distributes incoming network traffic across multiple VMs. GPU/TPU Support Attach GPUs or TPUs for high-performance computing/AI/ML. Components of Compute Engine Instances: Virtual machines you create and run. Machine Types: Defines CPU, memory, and performance class. Images: OS templates (e.g., Ubuntu, Windows Server, Debian). Disks: Persistent or local storage. Networks: Virtual Private Cloud (VPC) networks that connect VMs. Firewalls: Rules that control inbound and outbound traffic. Metadata: Key-value pairs for configuration data to instances. Advantages of Compute Engine Highly customizable machine types Scalable and flexible Global network performance Cost-effective pricing (with sustained use discounts) Integration with other GCP services (BigQuery, Cloud Storage) Google Developer Tools Definition: Software, SDKs, APIs, and cloud services for developers. Capabilities: Build, test, deploy web/mobile applications. Manage code and collaboration. Monitor performance. Integrate Google services (Maps, Firebase, Ads, AI). Cover frontend web development to backend cloud infrastructure. Popular Google Tools Tool Used For Firebase Fast app backend, database, hosting, analytics. Android Studio Building Android apps. Flutter Cross-platform app UI toolkit. Chrome DevTools Debugging and optimizing web apps. TensorFlow / Vertex AI Machine learning and AI. Cloud SDK Manage GCP via command line. BigQuery Analyze huge datasets in seconds. Benefits of Google Developer Tools Scalable and reliable infrastructure. Integration with Google services. Security and privacy controls. Global developer community and documentation. Database Relational Databases: Information stored in tables, rows, columns; best for structured data. ACID Properties: Atomicity: Entire transaction takes place at once or not at all. Consistency: Database must be consistent before and after the transaction. Isolation: Multiple transactions occur independently. Durability: Changes of a successful transaction persist even if system fails. Cloud Storage Google Cloud Storage (GCS): Stores and accesses data (files, backups, media) on Google's infrastructure. Characteristics: Highly scalable, durable, secure object storage system. Object Storage: Stores unstructured data (images, videos, documents, logs, backups). Buckets: Data stored in "buckets" which hold objects (files). Key Features of Cloud Storage Feature Description Object Storage Stores data as objects (not files or blocks). Each object has data, metadata, unique ID. Scalability Automatically scales to handle any amount of data. Durability 99.999999999% (11 nines) annual durability due to multi-region replication. Availability High availability with automatic failover. Security Data encrypted at rest and in transit; supports IAM and ACLs. Versioning Keeps multiple versions of an object for backup and recovery. Lifecycle Management Automatically moves data between storage classes or deletes old data based on rules. Public or Private Access Make data public (for websites) or private (for backups). Cloud Tools for Eclipse Plugin: Developed by Google for Java developers to build, deploy, manage GCP applications directly from Eclipse IDE. Integration: Integrates Eclipse with GCP services like App Engine, Cloud Storage, Cloud Pub/Sub. Supported Google Cloud Services (with Eclipse Tools) Service Purpose / Integration App Engine Build and deploy scalable web apps. Compute Engine Manage and deploy Java apps on virtual machines. Cloud Storage Connect applications to Google Cloud Storage for file storage. Cloud SQL Integrate with relational databases in GCP. Pub/Sub Add message-based communication between microservices. Benefits of Using Cloud Tools for Eclipse Simplifies cloud deployment (no need to leave Eclipse). Provides local testing and debugging. Offers code templates and wizards for cloud projects. Integrates with Google Cloud SDK for command-line compatibility. Easy to monitor logs and performance inside the IDE. Reduces configuration effort for Java developers moving to the cloud. MapReduce Framework: Processes huge amounts of data in parallel on large clusters. Core Tasks: Map and Reduce. Map: Takes data, breaks into key/value pairs. Reduce: Takes map output, combines data tuples into smaller set. Sequence: Reduce task always performed after map job. MapReduce Architecture Components Client: Brings the job for processing. Multiple clients can send jobs. Job: Actual work requested by client, comprised of smaller tasks. Hadoop MapReduce Master: Divides job into job-parts. Job-Parts: Sub-jobs from main job division. Results combined for final output. Input Data: Dataset fed to MapReduce. Output Data: Final result after processing. How MapReduce Works (Word Count Example) Input Splitting: Large input file split into smaller chunks (blocks). Each block processed in parallel. Mapping Phase: Map function runs on each chunk, processes data, produces key-value pairs. Input: "cat dog cat" Output: ("cat", 1), ("dog", 1), ("cat", 1) Shuffling and Sorting: System groups all values with the same key across Mappers. Example: ("cat", [1,1]), ("dog", [1]) Reducing Phase: Each Reducer processes values for a given key and aggregates them. Example: Reduce("cat", [1,1]) $\to$ ("cat", 2), Reduce("dog", [1]) $\to$ ("dog", 1) Output: Results written to an output file. cat 2 dog 1 Applications of MapReduce Entertainment: Discover popular movies based on logs and clicks. E-commerce: Identify popular items based on customer behavior, analyze website records, purchase history. Data Warehouse: Analyze large data volumes, implement business logic for data insights. Fraud Detection: Used in monetary enterprises for misrepresentation recognition and transaction analysis. Hadoop Framework: Open-source framework by Apache Software Foundation. Purpose: Storing and processing large-scale data in a distributed computing environment. Capabilities: Stores massive data across many computers, processes in parallel efficiently and reliably. Main Components of Hadoop Ecosystem Component Description HDFS (Hadoop Distributed File System) Stores large files by splitting into blocks and replicating them across multiple nodes. YARN (Yet Another Resource Negotiator) Manages and schedules system resources for running applications in the Hadoop cluster. MapReduce Data processing model: divides tasks into Map and Reduce phases. Hadoop Common Collection of utilities and libraries for other Hadoop modules. 3 Types of Schedulers in Hadoop FIFO Scheduler: First In First Out: Tasks served in submission order. Default in Hadoop. No intervention once scheduled; high-priority tasks might wait. Advantages: No configuration needed, First Come First Serve, simple to execute. Disadvantages: Task priority doesn't matter, not suitable for shared cluster. Capacity Scheduler: Introduced by Yahoo!. Divides cluster into multiple queues, each with configured capacity. Each queue can run multiple jobs simultaneously. Advantages: Good for multiple clients/priority jobs, maximizes throughput. Disadvantages: More complex, not easy to configure for everyone. Fair Scheduler: Similar to Capacity Scheduler, priority considered. Developed by Facebook. Ensures all jobs get an equal share of cluster resources over time. Resources redistributed if a job finishes. Advantages: Resources assigned based on priority. Disadvantages: Configuration is required. Fault Tolerance in Hadoop Definition: Ability of a system to recover from failures/errors without data loss or functionality. Mechanism: Relies on data replication and checkpointing. Duplicates data blocks across multiple nodes. Periodically saves computation state to disk. Benefits: High reliability and durability. Trade-offs: Consumes more disk space and network bandwidth. Best for: Low-cost high-availability environments that can tolerate disk I/O and network latency. Example of HDFS Fault Tolerance User stores file XYZ. HDFS breaks file into blocks (A, B, C). Assume four DataNodes (D1, D2, D3, D4). HDFS creates replicas of each block on different nodes for fault tolerance (two replicas per original block). Example: Block A on D1, D2, D4; Block B on D2, D3, D4; Block C on D1, D2, D3. If D1 fails, blocks A and C are still available from other DataNodes (D2, D4 for A; D2, D3 for C). Result: No data loss even in unfavorable conditions.

Cloud Computing & Big Data Essentials

Related Cheatsheets

Big Data MCQs

Cloud Computing Fundamentals

Data Science & Big Data Intro

Cloud Computing Fundamentals

Cloud Auditing Knowledge (CCAK) Essentia

Data Structures in C: Roadmap

Create Your Own AI Cheatsheet