X
April 16, 2021

Let's Try to Build an AWS Redshift Converged Data Warehouse

By: William Nee

Share

Oracle and Amazon both agree the first step to effectively leverage massive amounts of disparate data is to consolidate that information into a data warehouse.  But the two companies drastically differ on how this should be achieved.  Oracle advocates a converged database that stores different data types in the same database, while Amazon argues that different data types require separate databases.

But Amazon’s approach is just not possible to implement.  Four AWS experts and I made an honest, good-faith attempt to implement it.  But as the output of our work in the diagram below shows, the result is an incoherent mess.  And don’t just take my word for it; try it yourself, or better yet ask an Amazon AWS Cloud Architect to design it for you. 

While Amazon’s databases may work individually (albeit with limitations), things rapidly start going downhill when one tries to unite their disparate databases into a cohesive data management architecture.  Amazon has to advocate a separate, specialized database approach because their database products simply can’t effectively work together.

As we can see from the diagram, Redshift (Amazon’s data warehousing product) can’t even connect with some Amazon databases.  Other integrations are crude and simplistic, while still others are overly complex and introduce unnecessary cost, latency and unreliability. 

Let’s take a look at each integration, starting from the top of the diagram and going clockwise.  Refer to the links to Amazon documentation and architecture examples for confirmation:

  1. Amazon Timestream (time series data) – There’s no mention in Timestream documentation on how to integrate Timestream with Redshift.  We assumed (perhaps wrongly?) custom integration might be possible.
  1. Amazon Keyspaces (wide-column data) – There’s no mention in Keyspaces documentation on how to integrate Keyspaces with Redshift.  We assumed (perhaps wrongly?) custom integration might be possible.
  1. Amazon RDS (Oracle, SQL Server. MariaDB) (relational data) – For these three RDS variants, Amazon recommends two batch (aka slow) data transfer approaches: AWS Glue and Amazon Data Pipeline
  1. Amazon RDS for MySQL, PostgreSQL (relational data) – These RDS variants support federated queries with Redshift, although this capability does has significant limitations.
  1. Amazon Aurora for MySQL, PostgreSQL (relational data) – Both Aurora variants support federated queries with Redshift, but as with RDS, significant restrictions exist.
  1. Amazon S3 (key-value data) – Although not officially classified by Amazon as a database, S3 is often used in that capacity.  Amazon says that S3 à Redshift data transfers should be performed in batch mode using the basic Redshift COPY command or Glue
  1. Amazon DynamoDB (key-value data) – Yet again, Amazon says the basic COPY command should be used to transfer data from DynamoDB to Redshift.  We used DynamoDB Streams à Kinesis Data Firehose à Redshift in an attempt to minimize latency (even though Firehose still just uses COPY to perform insertions!).  We also included an optional AWS Lambda function since those are often needed for stream processing (as per this example).
  1. Amazon DocumentDB (document data) – Guidance on how to perform DocumentDB-Redshift data transfers is sparse, but this case study implies using Glue is the sanctioned approach.  To optimize performance, the AWS experts recommended using DocumentDB Streams à Kinesis Data Streams à Kinesis Data Firehose, even though there’s no confirmation this would work.
  1. Amazon Neptune (graph data) – There’s no mention in Neptune documentation on how to integrate Neptune with Redshift.  But since changes in Neptune can be captured by Neptune Streams, we thought that Neptune Streams à Kinesis Data Streams à Kinesis Data Firehose à Redshift might work, despite the lack of guidance from Amazon.
  1. Amazon QLDB (general ledger data) – As in most of the other cases, there’s no mention in QLDB documentation on how to integrate OLDB with Redshift.  But again, we thought it might be possible to use QLDB Streams à Kinesis Data Streams à Kinesis Data Firehose à Redshift to transmit data.  We also added a Lambda function as an optional step since, per this example, stream processing is often needed.

Speaking volumes with its clarity and simplicity, Oracle’s significantly contrasting approach to data warehousing is shown in the below diagram.  With its converged database, Oracle eliminates the need for multiple special-purpose databases, ensuring operations are performed quickly, reliably, securely and at lower cost.  See here for more details.

Share