Materialize your Amazon Redshift Views to Speed Up Query Execution
At AWS, we take pride in building state of the art virtualization technologies to simplify the management and access to cloud services such as networks, computing resources or object storage.
In a Relational Database Management Systems (RDBMS), a view is virtualization applied to tables : it is a virtual table representing the result of a database query. Views are frequently used when designing a schema, to present a subset of the data, summarized data (such as aggregated or transformed data) or to simplify data access across multiple tables. When using data warehouses, such as Amazon Redshift, a view simplifies access to aggregated data from multiple tables for Business Intelligence (BI) tools such as Amazon QuickSight or Tableau.
Views provide ease of use and flexibility but they are not speeding up data access. The database system must evaluate the underlying query representing the view each time your application accesses the view. When performance is key, data engineers use
create table as (CTAS) as an alternative. A CTAS is a table defined by a query. The query is executed at table creation time and your applications can use it like a normal table, with the downside that the CTAS data set is not refreshed when underlying data are updated. Furthermore, the CTAS definition is not stored in the database system. It is not possible to know if a table was created by a CTAS or not, making it difficult to track which CTAS needs to be refreshed and which is current.
Today, we are introducing materialized views for Amazon Redshift. A materialized view (MV) is a database object containing the data of a query. A materialized view is like a cache for your view. Instead of building and computing the data set at run-time, the materialized view pre-computes, stores and optimizes data access at the time you create it. Data are ready and available to your queries just like regular table data.
Using materialized views in your analytics queries can speed up the query execution time by orders of magnitude because the query defining the materialized view is already executed and the data is already available to the database system.
Materialized views are especially useful for queries that are predictable and repeated over and over. Instead of performing resource-intensive queries on large tables, applications can query the pre-computed data stored in the materialized view.
When the data in the base tables are changing, you refresh the materialized view by issuing a Redshift SQL statement “
refresh materialized view“. After issuing a refresh statement, your materialized view contains the same data as would have been returned by a regular view. Refreshes can be incremental or full refreshes (recompute). When possible, Redshift incrementally refreshes data that changed in the base tables since the materialized view was last refreshed.
Let’s see how it works. I create a sample schema to store sales information : each sales transaction and details about the store where the sales took place.
To view the total amount of sales per city, I create a materialized view with the
create materialized view SQL statement. I connect to the Redshift console, select the query Editor and type the following statement to create a materialized view (
city_sales) joining records from two tables and aggregating sales amount (
sum(sales.amount)) per city (
group by city):
CREATE MATERIALIZED VIEW city_sales AS ( SELECT st.city, SUM(sa.amount) as total_sales FROM sales sa, store st WHERE sa.store_id = st.id GROUP BY st.city );
The resulting schema is below:
Now I can query the materialized view just like a regular view or table and issue statements like “
SELECT city, total_sales FROM city_sales” to get the below results. The join between the two tables and the aggregate (
group by) are already computed, resulting to significantly less data to scan.
When the data in the underlying base tables change, the materialized view is not automatically reflecting those changes. The data stored in the materialized can be refreshed on demand with latest changes from base tables using the SQL
materialized view command. Let’s see a practical example:
!-- let's add a row in the sales base table INSERT INTO sales (id, item, store_id, customer_id, amount) VALUES(8, 'Gaming PC Super ProXXL', 1, 1, 3000); SELECT city, total_sales FROM city_sales WHERE city = 'Paris' city |total_sales| -----|-----------| Paris| 690| !-- the new sale is not taken into account ! !-- let's refresh the materialized view REFRESH MATERIALIZED VIEW city_sales; SELECT city, total_sales FROM city_sales WHERE city = 'Paris' city |total_sales| -----|-----------| Paris| 3690| !-- now the view has the latest sales data
The full code for this very simple demo is available as a gist.
You can start to use materialized views today in all AWS Regions.
There is nothing to change in your existing clusters to start to use materialized views, you can start to create them today at no additional cost.
Happy building !