Building a Data Warehouse on Amazon Redshift

research@theseattledataguy.com December 20, 2019 AWS 0

Photo by Jezael Melgoza on Unsplash

As an organization grows, its data storage, monitoring and analysis requirements also exponentially increase. Traditional data warehouse don’t always easily handle massive amounts of growth. This caused a need for alternative solutions, starting from the mid 2000s. One such solution is Amazon Redshift from Amazon Web Services.

What is Amazon Redshift?

Amazon Redshift is a cloud-based data warehouse designed for large scale data sets. It was created to take on a lot of the issues previous data warehouses had and drastically improve on them. It does this with specific design decisions such as being columnar focused, having the ability to scale horizontally across multiple nodes, and massively parallel processing. Each of these specific design decisions has allowed AWS to develop a data storage system that operates very differently from your traditional RDBMS.

High-Level Architecture

Larger organizations bring in high-level architecture- which is obviously complex. How does Redshift handle it?

Column-oriented database
As opposed to traditional database management systems, Redshift stores data by column. By partitioning data column-wise each time, you can retrieve a value, which also implies that a single value per partition reduces the load from the hard disk - thus resulting in faster speeds for massive amounts of data. In addition, with Redshift there’s also the concept of compression.

By compressing columns that are the same value it further improves the I/O because all the data is the same data type per column. This does require the developer to set the correct compression type per the data type.

Redshift clustering
Every redshift cluster comprises of multiple machines which store a fraction of the data. These machines work in parallel, saving data so we can work upon it efficiently. Here, Redshift has some compute nodes that are managed by leader nodes to manage data distribution and query execution among the computing nodes. With this fraction dataset assigned to each node, Redshift performs efficiently. This also makes it horizontally scalable. Larger organizations can even use Redshift to operate with data in petabytes and beyond.

Massive Parallel Processing (MPP)
In order to process data faster, Redshift employees the use of multiple compute nodes which allows the work to be broken up across nodes (depending on distribution), which can improve performance. This allows the Redshift architecture to offer maximum processing in minimum time.

These are just a few specific architecture differences that are important to consider when deciding on whether or not Redshift is right for you.
In addition, your team will need to consider how they will be designing their tables. With that comes the concepts of sort keys and distribution.

Sort Keys

Do you use indexes in your traditional database to improve your query runtime?

Redshift also offers methods for improving your query runtime by sorting your data based on keys. In this case, they’re called sort keys. Redshift sort keys determine the order in which data rows are stored in a table. When these sort keys are correctly configured they can improve your query performance because they reduce the amount of data returned. The sort keys allow queries to skip large chunks of data while query processing is carried out, which also means that Redshift takes less processing time.
There are two specific sort keys:

Compound Sort Keys: These comprise all columns that are listed in definition of Redshift sort keys at the creation time of tables. It follows an order similar to that of clauses in SQL - specifically, sorting the first column first and then the preceding columns. This generally means the first column should be the most queried column. COMPOUND is the default type of sort key, which also speeds up joins. With the compound sort keys, you can easily create joining operations such as Group By and Order By.

Interleaved Sort Keys: The interleaved sort keys give equal weight to every column which results in improved query performance where restrictive predicates are used on secondary sort columns. If there’s no dominant column in your queries, then this key may be the right choice.

Distribution
Another crucial factor is to understand where data is stored in Redshift. The Amazon Redshift Distribution Keys answer this query by determining where the data is stored. Fundamentally, the data is stored across compute nodes by the clusters since query performance can suffer greatly if a large amount of data is stored on a single node. The automated query optimized distributes data evenly across different nodes, which can be joined later.

There are some interesting things that can happen here, which makes it important for you to think about how you are distributing. We will discuss this more below but, in particular, when you use the key distribution, you have a chance to not distribute evenly. There are three main types of distribution styles in redshift.

Even distribution: The default style, this approach distributes data evenly across all slices in the table. In their documentation AWS refers to this even distribution as a round-robin approach. It allows the cluster to distribute the whole data equally in a rational order, which makes processing simpler for the system. However, for the most part, this means that the distribution is random and isn’t partitioned by any specific field. This is where the next distribution style comes in.

Key distribution: With this approach, the data is distributed on the basis of the values that the columns hold. For example, let’s say you have a key set on the store id as well as having the sales made per day. The distribution key would break down and distribute the data so that all the data for a specific store id was on the same node. That way, if you queried that specific store id it wouldn’t have to traverse multiple nodes for the data.

All distribution: In this approach, the leader node places a copy of the table on every compute node. This takes up a lot more space and is much slower to update and insert into because every time you load data it needs to be replicated. However, it can make joins much faster. This is generally a distribution setting for a dimension style table.

Drawbacks of Amazon Redshift

Redshift offers many positives benefits develop a data warehouse. However, it’s not short of drawbacks.

Not a live app database
Sorry to burst the bubble but while Redshift is excellent at running queries, handling analytics, and creating reports, it’s not a solution to run live web applications. Redshift isn’t designed to deal with hundreds of thousands of transactions inserting, updating, and selecting all at once. For that, you need a caching layer or Postgres instance.

Developer skill sets
To work with Redshift, you must understand both these keys because they determine how your data is stored or indexed. So when you choose redshift to cater to your data warehouse needs, you must also understand the concepts behind these keys to create the optimized performance that your organization is aiming for. Since Redshift is still relatively new, not all BI and data professionals have developed a data warehouse

Support for parallel uploads only
When it comes to Redshift, parallel loading receives support from only Amazon EMR, SE, and DynamoDB. They’re the only ones to use massive parallel processes and give you high speed. If your data source is different, this feature is not supported.

Final Verdict: Should You Pick it for Your Next Project?

If your organization is looking to create a data warehouse that can easily manage large amounts of data as well as analytical queries, then it might be worth looking into Redshift. While this service has its drawbacks, it still has many use cases worth investing in.

Even with the learning curve, it can be worth it if you’re trying to manage large amounts of data. What we’ve found is that many companies will use Redshift as a base and then apply business logic that inserts data into some sort of RDBMS. This can allow data products to access data easily without bumping into analytics.

If you enjoyed this video about software engineering then consider these videos as well!
142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist

Building a Data Warehouse on Amazon Redshift