Snowflake Data Engineer Cheatsheet

Shared 12/1/2025•92 views

/ 1

Cheatsheet Content

Author Amol Sanjeev Patil Consultant at Hoonartek Snowflake Architecture Overview (Deep Dive) Cloud Services Layer: The "brain" of Snowflake. Authentication & Access: Secure login (SSO, MFA) and fine-grained access control (roles, privileges). Metadata Management: Stores all object definitions, data statistics (min/max values), and query history. Enables powerful optimization. Query Optimization: Analyzes SQL, rewrites for efficiency, and generates optimal execution plans. Transaction Management: Ensures ACID compliance across all operations. Query Processing Layer (Virtual Warehouses): The "muscles" for computation. Elastic Compute: Independent clusters of compute resources that dynamically scale up/down, suspend/resume. Workload Isolation: Different warehouses can run concurrently, preventing resource contention. Local Caching: Utilizes fast SSDs to cache frequently accessed data for blazing-fast query performance. Database Storage Layer: The "heart" of your data. Micro-partitions: Data is automatically broken into immutable, compressed, columnar units (50-200 MB). Columnar Storage: Optimizes analytical queries by only reading necessary columns. Automatic Optimization: Snowflake transparently handles indexing, partitioning, and data organization. Built-in Security: All data is encrypted at rest and in transit by default. Zero-Copy Cloning: Instantly create writable copies of data without physical duplication. Time Travel: Access historical data for recovery, analysis, and auditing. Key Concepts & Objects (Essential Building Blocks) Account: Your dedicated Snowflake instance. Database & Schema: Logical containers for organizing your data objects. Tables: Where your structured data lives. Permanent: Full Time Travel & Fail-safe. Transient: Limited Time Travel (0-1 day), no Fail-safe. Cost-effective for intermediate data. Temporary: Session-bound, dropped at session end. Views: Virtual tables based on SQL queries. Standard: Executed on demand, no storage cost. Materialized: Pre-computed and stored for performance, automatically maintained. Stages: Locations for files prior to loading. External: Your cloud storage (S3, Azure Blob, GCS). Internal: Snowflake-managed storage (User, Table, Named). File Format: Defines how Snowflake interprets staged data files (CSV, JSON, Parquet, etc.). Snowpipe: Serverless, continuous data ingestion from files in stages. Ideal for micro-batches. Streams: Capture Change Data Capture (CDC) from tables, enabling incremental processing. Tasks: Schedule SQL statements or stored procedures, forming powerful DAGs for orchestration. UDFs (User-Defined Functions): Extend SQL with custom logic (SQL, JavaScript, External). Stored Procedures: Execute complex procedural logic and multi-statement transactions (JavaScript, SQL Scripting). Data Loading & Ingestion (Bringing Data In) 1. COPY INTO Bulk Loading: Efficiently load large volumes of data from staged files. Flexible Error Handling: `ON_ERROR = 'CONTINUE'` to skip bad records, `VALIDATION_MODE` to preview errors. In-flight Transformations: Apply simple SQL expressions directly during load. COPY INTO target_table (col_a, col_b) FROM (SELECT $1:id, $1:name FROM @my_json_stage) FILE_FORMAT = (TYPE = JSON); 2. Snowpipe (Automated, Continuous) Event-Driven: Triggered by new files arriving in cloud storage via native cloud notifications. Serverless & Scalable: Snowflake manages all compute resources, scaling automatically. Cost-Effective: Pay-per-file processing, ideal for near real-time ingestion. CREATE PIPE my_pipe AUTO_INGEST = TRUE AS COPY INTO raw_data FROM @external_s3_stage/data/ FILE_FORMAT = (TYPE = CSV COMPRESSION = 'AUTO'); 3. Data Sharing Secure & Zero-Copy: Share live data (tables, views) with other Snowflake accounts or external applications without moving or duplicating data. Provider/Consumer Model: Data providers grant access to consumers. Data Transformation (Shaping Your Data) 1. SQL (The Foundation) Leverage Snowflake's powerful SQL extensions. Window Functions: For advanced analytics (`ROW_NUMBER()`, `LAG()`, `LEAD()`). CTEs: Enhance readability and modularity of complex queries. Qualify Clause: Simplify filtering results of window functions. 2. Semi-Structured Data Handling Native Support: `VARIANT`, `ARRAY`, `OBJECT` data types for JSON, XML, Avro. Intuitive Access: Use dot notation (`:`) and bracket notation (`[]`) to navigate data. FLATTEN: Unnest arrays or objects into relational rows for easier querying. SELECT value:order_id::INT AS order_id, value:item_name::VARCHAR AS item_name FROM my_json_table, LATERAL FLATTEN(INPUT => json_data:items); 3. Streams and Tasks (Orchestrating ELT) Streams (CDC): Track DML changes on source tables. `METADATA$ACTION`: Identifies `INSERT`, `UPDATE`, `DELETE`. `METADATA$ISUPDATE`: Identifies rows part of an `UPDATE`. Tasks (Scheduling): Automate the execution of SQL or stored procedures. Build complex data pipelines (DAGs) using `AFTER` dependencies. Can run serverless (Snowflake-managed compute) or on a specific warehouse. CREATE TASK refresh_agg_table WAREHOUSE = ETL_WH SCHEDULE = '15 MINUTE' WHEN SYSTEM$STREAM_HAS_DATA('my_stream') AS INSERT INTO aggregated_sales SELECT product_id, SUM(amount) FROM my_stream WHERE METADATA$ACTION = 'INSERT' GROUP BY product_id; Performance & Optimization (Maximizing Efficiency) Warehouse Sizing: Match warehouse size (XS, S, M, L, etc.) to query complexity and concurrency needs. Scale up for speed, scale out (multi-cluster) for concurrency. Clustering Keys: Define for large tables to improve query pruning and reduce scan times. Snowflake automatically micro-partitions data. ALTER TABLE large_fact_table CLUSTER BY (date_key, customer_id); Materialized Views: Pre-compute and store results of common, complex queries. Snowflake handles automatic refresh. Search Optimization Service: Accelerate point lookups and substring searches on large tables. Query Profile: Analyze query execution plans in the UI to pinpoint and resolve performance bottlenecks. Caching: Result Cache: Stores query results for 24 hours; identical queries return instantly. Data Cache: Warehouse local disk cache for frequently accessed data. Efficient SQL: Use `UNION ALL` instead of `UNION` when duplicate rows are acceptable. Filter early and often. Security & Access Control (Protecting Your Data) Role-Based Access Control (RBAC): Grant privileges to roles, then assign roles to users. Role Hierarchy: Roles can inherit privileges from other roles. Best practice: Create custom roles (e.g., `DATA_ANALYST_ROLE`, `ETL_DEVELOPER_ROLE`). Privileges: Specific rights on database objects (e.g., `SELECT` on a table, `USAGE` on a warehouse). GRANT SELECT ON TABLE sales_data TO ROLE DATA_ANALYST_ROLE; GRANT USAGE ON WAREHOUSE ANALYTICS_WH TO ROLE DATA_ANALYST_ROLE; Future Grants: Automatically apply privileges to new objects created in a schema or database. GRANT SELECT ON FUTURE TABLES IN SCHEMA raw_data TO ROLE ETL_DEVELOPER_ROLE; Column-level Security (Masking Policies): Dynamically mask sensitive data based on user role or context. CREATE MASKING POLICY email_mask AS (val VARCHAR) RETURNS VARCHAR -> CASE WHEN CURRENT_ROLE() IN ('SECURITY_ADMIN') THEN val ELSE '*****' END; ALTER TABLE users MODIFY COLUMN email SET MASKING POLICY email_mask; Row-level Security (Row Access Policies): Filter rows returned by a query based on user identity or attributes. CREATE ROW ACCESS POLICY region_access AS (region VARCHAR) RETURNS BOOLEAN -> CASE WHEN CURRENT_ROLE() = 'ADMIN' THEN TRUE WHEN CURRENT_ROLE() = 'EAST_SALES' THEN region = 'East' ELSE FALSE END; ALTER TABLE orders ADD ROW ACCESS POLICY region_access ON (region); Network Policies: Restrict network access to your Snowflake account via IP whitelisting/blacklisting. Key Functions & Commands (Your Daily Toolkit) Time Travel: Query historical versions of your data. `AT (TIMESTAMP => ...)` or `AT (OFFSET => ...)` for specific points in time. `BEFORE (STATEMENT => ...)` to query data before a specific transaction. Zero-Copy Cloning: Create instant, writable copies of databases, schemas, or tables. Only metadata is copied initially. CREATE DATABASE dev_db CLONE prod_db; Undrop: Recover accidentally dropped objects within their Time Travel retention period. UNDROP TABLE accidentally_deleted_table; Information Schema: Query metadata about your database objects and account activities. `SNOWFLAKE.ACCOUNT_USAGE` for account-level monitoring (billing, history, security). `INFORMATION_SCHEMA` for database/schema-level metadata. Context Functions: `CURRENT_USER()`, `CURRENT_ROLE()`, `CURRENT_WAREHOUSE()`, etc. for dynamic logic. `SHOW` & `DESCRIBE` Commands: Essential for discovering and inspecting objects (`SHOW TABLES; DESCRIBE TABLE my_table;`). Cost Management (Optimizing Spend) Credit Consumption: Primarily driven by virtual warehouse usage (compute) and Snowpipe. Auto-Suspend/Resume: Configure warehouses to automatically suspend after inactivity and resume on query. This is critical for cost savings. Resource Monitors: Set credit limits at account or warehouse level to prevent unexpected cost overruns. Define triggers for `NOTIFY`, `SUSPEND`, `SUSPEND_IMMEDIATELY`. CREATE RESOURCE MONITOR monthly_limit WITH CREDIT_QUOTA = 500 TRIGGERS ON 80 PERCENT DO NOTIFY ON 100 PERCENT DO SUSPEND; ALTER WAREHOUSE etl_wh SET RESOURCE_MONITOR = monthly_limit; Transient Tables: Use for intermediate data that doesn't require full data retention and fail-safe, reducing storage costs. Query Optimization: Efficient queries consume fewer warehouse credits. Monitor `QUERY_HISTORY` for long-running or expensive queries.

Snowflake Data Engineer Cheatsheet

Create Your Own AI Cheatsheet