DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE: 2025

1️⃣ Choosing the Perfect Size of Virtual Warehouse

To select the right size (X-Small to 6X-Large), consider:

Data Volume & Complexity: Larger datasets or complex joins/aggregations may require Medium or Large warehouses.
Concurrency Needs: More users or parallel queries? Scale up or use multi-cluster warehouses.
Performance SLAs: If low latency is critical, opt for larger sizes or auto-scaling.
Cost vs. Speed Trade-off: Start small, monitor query performance via Query Profile, and scale only if needed.

🧠 Tip: Use Query History and Warehouse Load Charts to analyze CPU usage and queue times before resizing.

2️⃣ Optimizing Storage & Compute Cost in Snowflake

Storage Optimization:

Use data compression (Snowflake does this automatically).
Archive unused data to lower-cost storage tiers.
Drop unused tables/stages and purge Fail-safe data when possible.

Compute Optimization:

Use auto-suspend and auto-resume for warehouses.
Schedule jobs during off-peak hours.
Use result caching and materialized views for repetitive queries.
Avoid over-provisioning; monitor warehouse usage and scale down if underutilized.

3️⃣ Checklist for On-Prem to Snowflake Migration

✅ Source System Analysis: Understand schema, data types, volume, and dependencies.
✅ Data Quality Checks: Nulls, duplicates, referential integrity.
✅ Transformation Logic: Map ETL logic to Snowflake-compatible SQL or ELT.
✅ Security & Compliance: Identify PII, encryption needs, access controls.
✅ Performance Benchmarking: Compare query performance pre/post migration.
✅ Tooling: Choose between ADF, Informatica, or custom scripts for ingestion.
✅ Validation Strategy: Row counts, checksums, sample data comparison.

4️⃣ Clustering vs. Search Optimization Service

Feature	Clustering	Search Optimization
Purpose	Improve query performance on large tables	Accelerate point lookup queries
Use Case	Range scans, filtering on clustered columns	Fast retrieval on high-cardinality columns
Maintenance	Manual or automatic clustering	Fully managed by Snowflake
Cost	Storage overhead due to clustering metadata	Additional cost for optimization service

🧠 Use Clustering for large fact tables with predictable filters. Use Search Optimization for fast lookups on semi-structured or sparse data.

5️⃣ Query Performance Optimization Techniques

📌 Use Query Profile to identify bottlenecks.
📌 Avoid SELECT *; project only needed columns.
📌 Use materialized views for expensive aggregations.
📌 Partition large tables using clustering keys.
📌 Leverage result caching and CTAS for intermediate steps.
📌 Rewrite subqueries as joins or CTEs for better optimization.
📌 Ensure proper data types and avoid implicit conversions.

6️⃣ Query to Check Table Usage in Views

sql

SELECT table_name, view_name

FROM information_schema.view_table_usage

WHERE table_name = 'YOUR_TABLE_NAME';

This checks if the table is referenced in any view definitions within the current database.

7️⃣ Implementing SCD Type 2 in Snowflake

Use a combination of MERGE and metadata columns:

sql

MERGE INTO target_table AS tgt

USING staging_table AS src

ON tgt.business_key = src.business_key AND tgt.is_current = TRUE

WHEN MATCHED AND tgt.hash_diff != src.hash_diff THEN

UPDATE SET tgt.is_current = FALSE, tgt.end_date = CURRENT_DATE

WHEN NOT MATCHED THEN

INSERT (business_key, attribute1, attribute2, start_date, end_date, is_current)

VALUES (src.business_key, src.attribute1, src.attribute2, CURRENT_DATE, NULL, TRUE);

is_current: Flag for active record
start_date, end_date: Track validity
hash_diff: Detect changes efficiently

8️⃣ COPY from Date-wise Folder in S3

Assuming folder structure like s3://bucket/yyyy-mm-dd/file.csv:

sql

DECLARE current_date STRING;

SET current_date = TO_CHAR(CURRENT_DATE, 'YYYY-MM-DD');

COPY INTO my_table

FROM @my_s3_stage/$(current_date)/file.csv

FILE_FORMAT = (TYPE = 'CSV');

You can also use external tables or ADF with dynamic path resolution.

9️⃣ Create Table from Parquet File in S3 (Without Stage Browsing)

sql

CREATE OR REPLACE TABLE parquet_table

USING TEMPLATE (

SELECT *

FROM TABLE(

INFER_SCHEMA(

LOCATION => '@my_s3_stage/path/to/file.parquet',

FILE_FORMAT => 'PARQUET'

)

);

This uses INFER_SCHEMA to auto-generate the table structure.

🔟 Identifying PII/Sensitive Data

📌 Use Snowflake's TAGS and CLASSIFY function (if enabled).
📌 Scan column names for patterns like email, ssn, dob, credit_card.
📌 Use data profiling tools (e.g., Great Expectations, Collibra).
📌 Implement column-level masking policies.
📌 Maintain a PII inventory and enforce RBAC.

Basic Questions

What is Snowflake, and how does it differ from traditional databases? Snowflake is a cloud-native data platform that separates compute, storage, and services. Unlike traditional databases, it offers scalability, elasticity, and native support for semi-structured data without infrastructure management.
What are the key features of Snowflake?

Separation of compute/storage/services
Time Travel & Fail-safe
Native support for JSON, Avro, Parquet
Secure data sharing
Multi-cloud support
Auto-scaling virtual warehouses

How does Snowflake handle data storage? Data is stored in compressed, columnar micro-partitions. Storage is centralized and decoupled from compute.
What is Snowflake’s architecture? Three layers:

Storage: Centralized, compressed micro-partitions
Compute: Virtual warehouses (MPP clusters)
Services: Metadata, optimization, security

What are Snowflake’s advantages over other cloud data warehouses?

True separation of compute and storage
Native semi-structured data support
Zero management overhead
Cross-cloud replication and sharing

What is the role of virtual warehouses in Snowflake? They provide isolated compute resources for queries and ETL. Can be scaled and run concurrently.
How does Snowflake support multi-cloud environments? Snowflake runs on AWS, Azure, and GCP, allowing cross-cloud replication and failover.
What is Snowflake’s pricing model? Pay-per-use model based on compute credits and storage. Editions include Standard, Enterprise, Business Critical.
What types of workloads can Snowflake handle?

Data warehousing
Data lake
Real-time analytics
Machine learning
Data sharing

What is Snowflake’s approach to security?

End-to-end encryption
Role-based access control
Network policies
Data masking and governance

How does Snowflake handle structured and semi-structured data? Supports VARIANT data type for JSON, Avro, Parquet. Can query and transform using SQL.
What is Snowflake’s marketplace? A platform to discover and share datasets, applications, and services across Snowflake accounts.
What is Snowflake’s data sharing feature? Enables secure, real-time sharing of data across accounts and clouds without data movement.
How does Snowflake ensure high availability? Built-in redundancy across availability zones. Automatic failover and replication.
What is Snowflake’s caching mechanism? Results caching, metadata caching, and warehouse-level caching improve performance.
How does Snowflake handle concurrency? Multi-cluster warehouses scale automatically to handle concurrent users.
What is Snowflake’s role in data analytics? Acts as a central platform for BI, ML, and predictive analytics with high performance and scalability.
What are Snowflake’s supported programming languages? SQL, Python (via Snowpark), Java, Scala, JavaScript (for stored procedures)
How does Snowflake integrate with third-party tools? Native connectors for Power BI, Tableau, Informatica, dbt, Airflow, and more.
What is Snowflake’s approach to disaster recovery? Cross-region replication, failover groups, and Time Travel ensure resilience.

Technical Questions

What is Snowflake Time Travel, and how does it work? Allows querying historical data (up to 90 days). Useful for recovery, auditing, and debugging.
How does Snowflake handle schema changes? Supports dynamic schema evolution and zero-downtime DDL operations.
What is Snowflake’s clustering mechanism? Clustering keys define logical sort order to improve pruning and query performance.
How does Snowflake optimize query performance?

Pruning micro-partitions
Result caching
Materialized views
Query profiling and tuning

What is Snowflake’s micro-partitioning feature? Data is automatically divided into micro-partitions (~16MB) for efficient scanning and pruning.
How does Snowflake handle data compression? Columnar storage with automatic compression reduces storage and improves performance.
What is Snowflake’s role in ETL processes? Supports ELT workflows with SQL, Snowpark, and integration with ETL tools.
How does Snowflake handle JSON and XML data? VARIANT data type allows storage and querying of semi-structured formats using SQL functions.
What is Snowflake’s COPY command? Loads data from external stages (S3, Azure Blob, GCS) into Snowflake tables.
How does Snowflake handle data ingestion?

Batch: COPY command
Streaming: Snowpipe
CDC: Streams and Tasks

What is Snowflake’s fail-safe mechanism? Provides 7-day recovery window after Time Travel expires for disaster recovery.
How does Snowflake handle role-based access control? Roles are assigned privileges; users inherit access via roles. Supports role hierarchy.
What is Snowflake’s approach to indexing? No traditional indexes. Uses metadata and clustering for performance.
How does Snowflake handle workload isolation? Separate virtual warehouses ensure isolated compute for different workloads.
What is Snowflake’s approach to metadata management? Metadata is stored in the services layer and used for optimization, governance, and lineage.
How does Snowflake handle stored procedures? Supports JavaScript and SQL-based stored procedures with control flow logic.
What is Snowflake’s approach to query optimization? Uses cost-based optimizer, pruning, caching, and statistics for efficient execution.
How does Snowflake handle materialized views? Stores precomputed results for faster querying. Automatically refreshed.
What is Snowflake’s approach to data governance? Tags, masking policies, access history, and RBAC support compliance and control.
How does Snowflake handle replication across regions? Replication groups and failover groups enable cross-region data availability.

DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE

Wednesday, August 6, 2025

SNOWFLAKE Interview Questions 2025

Thursday, April 3, 2025

ADF functions

Thursday, March 20, 2025

SQLSERVER and ORACLE IMPORTANT Queries

SQLSERVER and ORACLE IMPORTANT

SNOWFLAKE Interview Questions 2025

DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE