1️⃣
Choosing the Perfect Size of Virtual Warehouse
To select the right size
(X-Small to 6X-Large), consider:
- Data
Volume & Complexity: Larger datasets or complex joins/aggregations may
require Medium or Large warehouses.
- Concurrency
Needs: More users or parallel queries? Scale up or use multi-cluster
warehouses.
- Performance
SLAs: If low latency is critical, opt for larger sizes or auto-scaling.
- Cost
vs. Speed Trade-off: Start small, monitor query performance via Query
Profile, and scale only if needed.
ðŸ§
Tip: Use Query History and Warehouse Load Charts to analyze CPU usage
and queue times before resizing.
2️⃣
Optimizing Storage & Compute Cost in Snowflake
Storage Optimization:
- Use
data compression (Snowflake does this automatically).
- Archive
unused data to lower-cost storage tiers.
- Drop
unused tables/stages and purge Fail-safe data when possible.
Compute Optimization:
- Use
auto-suspend and auto-resume for warehouses.
- Schedule
jobs during off-peak hours.
- Use
result caching and materialized views for repetitive queries.
- Avoid
over-provisioning; monitor warehouse usage and scale down if
underutilized.
3️⃣
Checklist for On-Prem to Snowflake Migration
- ✅
Source System Analysis: Understand schema, data types, volume, and
dependencies.
- ✅
Data Quality Checks: Nulls, duplicates, referential integrity.
- ✅
Transformation Logic: Map ETL logic to Snowflake-compatible SQL or ELT.
- ✅
Security & Compliance: Identify PII, encryption needs, access
controls.
- ✅
Performance Benchmarking: Compare query performance pre/post migration.
- ✅
Tooling: Choose between ADF, Informatica, or custom scripts for ingestion.
- ✅
Validation Strategy: Row counts, checksums, sample data comparison.
4️⃣
Clustering vs. Search Optimization Service
Feature |
Clustering |
Search Optimization |
Purpose |
Improve query performance on large tables |
Accelerate point lookup queries |
Use Case |
Range scans, filtering on clustered columns |
Fast retrieval on high-cardinality columns |
Maintenance |
Manual or automatic clustering |
Fully managed by Snowflake |
Cost |
Storage overhead due to clustering metadata |
Additional cost for optimization service |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ðŸ§
Use Clustering for large fact tables with predictable filters. Use
Search Optimization for fast lookups on semi-structured or sparse data.
5️⃣
Query Performance Optimization Techniques
- 📌
Use Query Profile to identify bottlenecks.
- 📌
Avoid SELECT *; project only needed columns.
- 📌
Use materialized views for expensive aggregations.
- 📌
Partition large tables using clustering keys.
- 📌
Leverage result caching and CTAS for intermediate steps.
- 📌
Rewrite subqueries as joins or CTEs for better optimization.
- 📌
Ensure proper data types and avoid implicit conversions.
6️⃣
Query to Check Table Usage in Views
sql
SELECT table_name, view_name
FROM
information_schema.view_table_usage
WHERE table_name =
'YOUR_TABLE_NAME';
This checks if the table is
referenced in any view definitions within the current database.
7️⃣
Implementing SCD Type 2 in Snowflake
Use a combination of MERGE
and metadata columns:
sql
MERGE INTO target_table AS
tgt
USING staging_table AS src
ON tgt.business_key =
src.business_key AND tgt.is_current = TRUE
WHEN MATCHED AND
tgt.hash_diff != src.hash_diff THEN
UPDATE SET tgt.is_current = FALSE,
tgt.end_date = CURRENT_DATE
WHEN NOT MATCHED THEN
INSERT (business_key, attribute1, attribute2,
start_date, end_date, is_current)
VALUES (src.business_key, src.attribute1,
src.attribute2, CURRENT_DATE, NULL, TRUE);
- is_current:
Flag for active record
- start_date,
end_date: Track validity
- hash_diff:
Detect changes efficiently
8️⃣
COPY from Date-wise Folder in S3
Assuming folder structure
like s3://bucket/yyyy-mm-dd/file.csv:
sql
DECLARE current_date STRING;
SET current_date =
TO_CHAR(CURRENT_DATE, 'YYYY-MM-DD');
COPY INTO my_table
FROM
@my_s3_stage/$(current_date)/file.csv
FILE_FORMAT = (TYPE = 'CSV');
You can also use external
tables or ADF with dynamic path resolution.
9️⃣
Create Table from Parquet File in S3 (Without Stage Browsing)
sql
CREATE OR REPLACE TABLE
parquet_table
USING TEMPLATE (
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION =>
'@my_s3_stage/path/to/file.parquet',
FILE_FORMAT => 'PARQUET'
)
)
);
This uses INFER_SCHEMA to
auto-generate the table structure.
🔟
Identifying PII/Sensitive Data
- 📌
Use Snowflake's TAGS and CLASSIFY function (if enabled).
- 📌
Scan column names for patterns like email, ssn, dob, credit_card.
- 📌
Use data profiling tools (e.g., Great Expectations, Collibra).
- 📌
Implement column-level masking policies.
- 📌
Maintain a PII inventory and enforce RBAC.
Basic Questions
- What
is Snowflake, and how does it differ from traditional databases?
Snowflake is a cloud-native data platform that separates compute, storage,
and services. Unlike traditional databases, it offers scalability,
elasticity, and native support for semi-structured data without
infrastructure management.
- What
are the key features of Snowflake?
- Separation
of compute/storage/services
- Time
Travel & Fail-safe
- Native
support for JSON, Avro, Parquet
- Secure
data sharing
- Multi-cloud
support
- Auto-scaling
virtual warehouses
- How
does Snowflake handle data storage? Data is stored in compressed,
columnar micro-partitions. Storage is centralized and decoupled from
compute.
- What
is Snowflake’s architecture? Three layers:
- Storage:
Centralized, compressed micro-partitions
- Compute:
Virtual warehouses (MPP clusters)
- Services:
Metadata, optimization, security
- What
are Snowflake’s advantages over other cloud data warehouses?
- True
separation of compute and storage
- Native
semi-structured data support
- Zero
management overhead
- Cross-cloud
replication and sharing
- What
is the role of virtual warehouses in Snowflake? They provide isolated
compute resources for queries and ETL. Can be scaled and run concurrently.
- How
does Snowflake support multi-cloud environments? Snowflake runs on
AWS, Azure, and GCP, allowing cross-cloud replication and failover.
- What
is Snowflake’s pricing model? Pay-per-use model based on compute
credits and storage. Editions include Standard, Enterprise, Business
Critical.
- What
types of workloads can Snowflake handle?
- Data
warehousing
- Data
lake
- Real-time
analytics
- Machine
learning
- Data
sharing
- What
is Snowflake’s approach to security?
- End-to-end
encryption
- Role-based
access control
- Network
policies
- Data
masking and governance
- How
does Snowflake handle structured and semi-structured data? Supports
VARIANT data type for JSON, Avro, Parquet. Can query and transform using
SQL.
- What
is Snowflake’s marketplace? A platform to discover and share datasets,
applications, and services across Snowflake accounts.
- What
is Snowflake’s data sharing feature? Enables secure, real-time sharing
of data across accounts and clouds without data movement.
- How
does Snowflake ensure high availability? Built-in redundancy across
availability zones. Automatic failover and replication.
- What
is Snowflake’s caching mechanism? Results caching, metadata caching,
and warehouse-level caching improve performance.
- How
does Snowflake handle concurrency? Multi-cluster warehouses scale
automatically to handle concurrent users.
- What
is Snowflake’s role in data analytics? Acts as a central platform for
BI, ML, and predictive analytics with high performance and scalability.
- What
are Snowflake’s supported programming languages? SQL, Python (via
Snowpark), Java, Scala, JavaScript (for stored procedures)
- How
does Snowflake integrate with third-party tools? Native connectors for
Power BI, Tableau, Informatica, dbt, Airflow, and more.
- What
is Snowflake’s approach to disaster recovery? Cross-region
replication, failover groups, and Time Travel ensure resilience.
Technical Questions
- What
is Snowflake Time Travel, and how does it work? Allows querying
historical data (up to 90 days). Useful for recovery, auditing, and
debugging.
- How
does Snowflake handle schema changes? Supports dynamic schema
evolution and zero-downtime DDL operations.
- What
is Snowflake’s clustering mechanism? Clustering keys define logical
sort order to improve pruning and query performance.
- How
does Snowflake optimize query performance?
- Pruning
micro-partitions
- Result
caching
- Materialized
views
- Query
profiling and tuning
- What
is Snowflake’s micro-partitioning feature? Data is automatically
divided into micro-partitions (~16MB) for efficient scanning and pruning.
- How
does Snowflake handle data compression? Columnar storage with
automatic compression reduces storage and improves performance.
- What
is Snowflake’s role in ETL processes? Supports ELT workflows with SQL,
Snowpark, and integration with ETL tools.
- How
does Snowflake handle JSON and XML data? VARIANT data type allows
storage and querying of semi-structured formats using SQL functions.
- What
is Snowflake’s COPY command? Loads data from external stages (S3,
Azure Blob, GCS) into Snowflake tables.
- How
does Snowflake handle data ingestion?
- Batch:
COPY command
- Streaming:
Snowpipe
- CDC:
Streams and Tasks
- What
is Snowflake’s fail-safe mechanism? Provides 7-day recovery window
after Time Travel expires for disaster recovery.
- How
does Snowflake handle role-based access control? Roles are assigned
privileges; users inherit access via roles. Supports role hierarchy.
- What
is Snowflake’s approach to indexing? No traditional indexes. Uses
metadata and clustering for performance.
- How
does Snowflake handle workload isolation? Separate virtual warehouses
ensure isolated compute for different workloads.
- What
is Snowflake’s approach to metadata management? Metadata is stored in
the services layer and used for optimization, governance, and lineage.
- How
does Snowflake handle stored procedures? Supports JavaScript and
SQL-based stored procedures with control flow logic.
- What
is Snowflake’s approach to query optimization? Uses cost-based
optimizer, pruning, caching, and statistics for efficient execution.
- How
does Snowflake handle materialized views? Stores precomputed results
for faster querying. Automatically refreshed.
- What
is Snowflake’s approach to data governance? Tags, masking policies,
access history, and RBAC support compliance and control.
- How
does Snowflake handle replication across regions? Replication groups
and failover groups enable cross-region data availability.
No comments:
Post a Comment