Snowflake Performance Tuning
Snowflake Performance tuning
The snowflake comes with small number of options for performance tuning. This article will describe the method and possible things i have tried to improve my performance.
You need to understand type of cache , partition and clustering before jumping to performance tuning .
- Result cache
- Local disk cache
- Remote disk
Result Cache: Which holds the results of every query executed in the past 24 hours. These are available across virtual warehouses, so jQuery results returned to one user is available to any other user on the system who executes the same query, provided the underlying data has not changed.
Local Disk Cache: Which is used to cache data used by SQL queries. Whenever data is needed for a given query it's retrieved from the Remote Disk storage, and cached in SSD and memory.
Remote Disk: Which holds the long term storage. This level is responsible for data resilience, which in the case of Amazon Web Services, means 99.999999999% durability. Even in the event of an entire data centre failure.
Clustering & Partition
Snowflake tables is automatically divided into micro-partitions, which are contiguous units of storage. Each micro-partition contains between 50 MB and 500 MB of uncompressed data.
Groups of rows in tables are mapped into individual micro-partitions, organized in a columnar fashion. This size and structure allows for extremely granular pruning of very large tables, which can be comprised of millions, or even hundreds of millions, of micro-partitions.Snowflake stores metadata about all rows stored in a micro-partition, including:
- The range of values for each of the columns in the micro-partition.
- The number of distinct values.
- Additional properties used for both optimization and efficient query processing
A clustering key is a subset of columns in a table (or expressions on a table) that are explicitly designated to co-locate the data in the table in the same micro-partitions. This is useful for very large tables where the ordering was not ideal (at the time the data was inserted/loaded) or extensive DML has caused the table’s natural clustering to degrade.
Alter table tbl cluster by (column ,column1)
queries benefit from clustering when the queries filter or sort on the clustering key for the table. Sorting is commonly done for ORDER BY operations, for GROUP BY operations, and for some joins.
The more frequently a table is queried, the more benefit clustering provides.
Snowflake recommends prioritizing keys in the order below:
- Cluster columns that are most actively used in selective filters. For many fact tables involved in date-based queries (for example “WHERE invoice_date > x AND invoice date <= y”), choosing the date column is a good idea. For event tables, event type might be a good choice, if there are a large number of different event types. (If your table has only a small number of different event types, then see the comments on cardinality below before choosing an event column as a clustering key.)
- If there is room for additional cluster keys, then consider columns frequently used in join predicates, for example “FROM table1 JOIN table2 ON table2.column_A = table1.column_B”.
The number of distinct values (i.e. cardinality) in a column/expression is a critical aspect of selecting it as a clustering key. It is important to choose a clustering key that has:
- A large enough number of distinct values to enable effective pruning on the table.
- A small enough number of distinct values to allow Snowflake to effectively group rows in the same micro-partitions.
A column with very low cardinality (e.g. a column that indicates only whether a person is male or female) might yield only minimal pruning. At the other extreme, a column with very high cardinality (e.g. a column containing UUID or nanosecond timestamp values) is also typically not a good candidate to use as a clustering key directly
SELECT SYSTEM$CLUSTERING_DEPTH(EVENTS_BIG);
SELECT SYSTEM$CLUSTERING_RATIO('EVENTS_BIG';
SELECT SYSTEM$CLUSTERING_INFORMATION('EVENTS_BIG');
SYSTEM$CLUSTERING_DEPTH computes the average depth of the table according to the clustering keys defined for the table or the clustering keys specified in the function arguments. The average depth is a number greater than 1. The lower the average depth is, the better clustered the table is with regards to the specified clustering keys.
SYSTEM$CLUSTERING_RATIO computes the clustering ratio. The clustering ratio is a number from 0 to 100. The higher the ratio, the better clustered the table is.
SYSTEM$CLUSTERING_INFORMATION function is used to determine data distribution on the clustering key columns for your table. It returns a JSON object that describes the current state of clustering for the table, including average depth, as well as a partition_depth_histogram that depicts the distribution of overlap depth for each micro-partition in the table. The value "00001" : 1337324 in the partition_depth_histogram element means the number of partitions with a depth of 1 is 1,337,324.
Conclusions
tips for determining clustering keys and maintaining the clustering for the table:
- Determine the most frequently executed and most important queries, and then analyze predicates in those queries to determine candidate clustering keys.
- Choose the order of clustering keys carefully. As shown above, clustering key order matters.
- Look at the data characteristics of candidate columns, such as the number of distinct values. The columns with fewer distinct values should come earlier in the clustering key order.
- ETL flow and DML operations can have a big impact on the cost of maintaining your table’s clustering order. Choose clustering key columns based on a balance between query workload and DML patterns.
- Consider date/time columns as good candidates for the leading clustering key column, if your data is loaded by order of time.
- Pick a clustering depth for a table that achieves good query performance, and recluster the table if it goes above the target depth.
Clustering is very important for getting the best performance out of Snowflake, especially for queries on large fact tables.
Comments