This website was created to help users of Azure SQL Databases get their data loaded into Amazon Redshift on an ongoing basis. (If this manual process is a bit more involved than you’d prefer, check out Stitch, which can do all the heavy lifting for you in just a few clicks.)
What is Azure?
Microsoft Azure is a cloud services platform that developers can use to build, deploy, and manage applications. Azure offers fully managed relational databases. Users can then store and retrieve data using other software applications which may run either on the same computer or on another computer across a network (including the Internet).
Pulling Data Out of Azure
There are several methods for extracting data from Azure, and the one you use will probably be dependent upon your needs (and skill set).
Most people who interact with databases regularly will default to writing queries as the main method for extracting data. SELECT queries have quite a few benefits, most notably around flexibility. They allow you to pull exactly the data you want by specifying filters, ordering, and limiting results. If you have a specific subset of data in mind or are looking to continuously monitor a subset of a specific table, SELECT queries may be a good fit.
If you’re just looking to export data in bulk, however, you can use some of the tools included with Azure that run on the Windows Server operating system. SQL Server Server Management Studio allows you to export entire tables and databases in a format you specify (i.e. delimited text, CSV, or SQL queries that would restore the database if run).
Preparing Azure Data for Redshift
Here’s the tricky part: for every table in your Azure Database, you need a matching table in Redshift to receive the data. Redshift’s syntax is based on the syntax of Postgres, another relational database built on the SQL standard. It’s not a perfect match, but for the most part you should be able to recreate schema’s and table structures from Azure relatively easily in Redshift.
It’s very important to understand the structure of your databases and tables. You can use GUIs like the one in SQL Server Management Studio, or you can familiarize yourself with the information_schema database that contains this information in a structured format. If you are going to attempt to automate the creation of these Redshift tables, using information_schema is a must.
When it comes to optimal configuration, Redshift is NOT the same as Azure. You’ll need to familiarize yourself with the Redshift CREATE TABLE statement and the implications of, for example, selecting certain sort keys at creation time. It’s critical that you appreciate the performance implications that your table structure can create.
Inserting Azure Data into Redshift
With a table built, it may seem like the easiest way to add your exported data (especially if there isn’t much of it), is to build INSERT statements to add data to your Redshift table row-by-row. If you have any experience with SQL, this will be your gut reaction. But beware! Redshift isn’t optimized for inserting data one row at a time, and if you have any kind of high-volume data being inserted, you would be much better off loading the data into Amazon S3 and then using the COPY command to load it into Redshift.
Keeping Data Up-To-Date
So, now what? You’ve built a script that pulls data from Azure and loads it into Redshift, but what happens tomorrow when you have new and updated records in your Azure database?
Depending on how you’ve built your script, you may be forced to load your entire database into Redshift every time new data appears. This might be slow and painful, or even have performance implications on your database instance.
The key is to build your script in such a way that it can also identify incremental updates to your data. If your Azure Database tables have fields like modified_at or auto-incrementing primary keys, you can build a script that can quickly identify records that are new or changed since your last update (or since the newest record you’ve copied into Redshift). You can set your script up as a cron job or continuous loop to keep pulling down new data as it appears.
Other Data Warehouse Options
Redshift is totally awesome, but sometimes you need to start smaller or optimize for different things. In this case, many people choose to get started with Postgres, which is an open source RDBMS that uses nearly identical SQL syntax to Redshift. If you’re interested in seeing the relevant steps for loading this data into Postgres, check out Azure to Postgres
Easier and Faster Alternatives
Chances are that if you have all the skills necessary to go through this process, building and maintaining scripts like these isn’t a very high-leverage use of your time.
Thankfully, products like Stitch were built to solve this problem automatically. With just a few clicks, Stitch starts extracting your Azure data, structuring it in a way that is optimized for analysis, and inserting that data into your Amazon Redshift data warehouse.