©2019 by Raghavendra Kambhampati

AWS Redshift – Data Loading Strategies

Updated: Aug 16, 2019

Amazon Redshift Loading Data Overview

Loading Data Into Amazon S3


• RDS/EC2/On-premise Database – Database migration Service

• RDS MySQL using Data Pipeline http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dptemplate-copyrdstos3.html

• Data Volume on EBS – cp Command to copy to S3

• Glacier – Restore data back to S3 from archived backups

• Kinesis – Kinesis Streams – Kinesis Client Library and Kinesis Connector/ Lambda – Kinesis Firehose – load data into S3 bucket and then use Redshift COPY


Uploading Files to Amazon S3

Loading Data From Amazon S3

Preparing Input Data Files Using Delimiters

Example of pipe (‘|’) delimited file

1|Customer#000000001|j5JsirBM9P|MOROCCO 0|MOROCCO|AFRICA|25-989-741-2988|BUILDING 2|Customer#000000002|487LW1dovn6Q4dMVym|JORDAN 1|JORDAN|MIDDLE EAST|23-768-687-3665|AUTOMOBILE 3|Customer#000000003|fkRGN8n|ARGENTINA7|ARGENTINA|AMERICA|11-719-748-3364|AUTOMOBILE 4|Customer#000000004|4u58h f|EGYPT 4|EGYPT|MIDDLE EAST|14-128-190-5944|MACHINERY

Preparing Input Data Files using Fixed-width

CREATE TABLE customer (
c_custkey integer not null,
c_name varchar(25) not null, 
c_address varchar(25) not null,
c_city varchar(10) not null, 
c_nation varchar(15) not null, 
c_region varchar(12) not null,
c_phone varchar(15) not null, 
c_mktsegment varchar(10) not null );
Copy customer from ‘s3://mydata/client.txt’ Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_key>’ fixedwidth‘0:3, 1:25, 2:25, 3:10, 4:15, 5:12, 6:15, 7:10;

Splitting Data Files

Copy customer from ‘s3://mydata/client.txt’ Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_key>’ Delimiter ‘|;


Loading –Use multiple input files to maximize throughput

Use the COPY command                                                                 

Each slice can load one file at a time

• A single input file means only one slice is ingesting data

• Instead of 100MB/s, you’re only getting 6.25MB/

Loading –Use multiple input files to maximize throughput  

• Use the COPY command

• You need at least as many input files as you have slices

• With 16 input files, all slices are working so you maximize throughput

• Get 100MB/s per node; scale linearly as you add nodes


Loading Data with Manifest Files

• Use manifest to loads all required file

• Supply JSON-formatted text file that lists the files to be loaded

• Can load files from different buckets or wit different prefix

{ “entries”: [ {“url”:”s3://mybucket-alpha/2013-10-04-custdata”, “mandatory”:true}, {“url”:”s3://mybucket-alpha/2013-10-05-custdata”, “mandatory”:true}, {“url”:”s3://mybucket-beta/2013-10-04-custdata”, “mandatory”:true}, {“url”:”s3://mybucket-beta/2013-10-05-custdata”, “mandatory”:true} ] }


Micro-Batch Loading

Loading Encrypted Data Files

Redshift COPY Command

• Loads data into a table from data files in S3 or from an Amazon DynamoDB table.

• The COPY command requires only three parameters:

– Table name

– Data Source

– Credentials

Copy table_name FROM data_source CREDENTIALS ‘aws_access_credentials’

• Optional Parameters include:

– Column mapping options – mapping source to target – Data Format Parameters

– FORMAT, CSV, DELIMITER, FIXEDWIDTH, AVRO, JSON, BZIP2, GZIP, LZOP

– Data Conversion Parameters – Data type conversion between source and target

– Data Load Operations –troubleshoot load times or reduce load times with parameters like COMROWS, COMPUPDATE, MAXERROR, NOLOAD, STATUPDATE


Loading JSON Data

• COPY uses a jsonpathstext file to parse JSON data

• JSONPath expressions specify the path to JSON name elements

• Each JSONPath expression corresponds to a column in the Amazon Redshift target table

Suppose you want to load the VENUE table with the following content

{ "id": 15, "name": "Gillette Stadium", "location": [ "Foxborough", "MA" ], "seats": 68756 } { "id": 15, "name": "McAfee Coliseum", "location": [ "Oakland", "MA" ], "seats": 63026 }

You would use the following jsonpaths file to parse the JSON data.

{ “jsonpaths”: [ “$[‘id’]”, “$[‘name’]”, “$[‘location’][0]”, “$[‘location’][1]”, “$[‘seats’]” ] }


Loading Data from Amazon Elastic MapReduce

• Load data from Amazon EMR in parallel using COPY

• Specify Amazon EMR cluster ID and HDFS file path/name

• Amazon EMR must be running until COPY completes.

copy sales from 'emr:// j-1H7OUO3B52HI5/myoutput/part*' credentials 'aws_access_key_id=<access-key id>; aws_secret_access_key=<secret-access-key>';