AWS Redshift – Data Loading Strategies

Amazon Redshift Loading Data Overview

Loading Data Into Amazon S3


• RDS/EC2/On-premise Database – Database migration Service

• RDS MySQL using Data Pipeline http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dptemplate-copyrdstos3.html

• Data Volume on EBS – cp Command to copy to S3

• Glacier – Restore data back to S3 from archived backups

• Kinesis – Kinesis Streams – Kinesis Client Library and Kinesis Connector/ Lambda – Kinesis Firehose – load data into S3 bucket and then use Redshift COPY


Uploading Files to Amazon S3

Loading Data From Amazon S3

Preparing Input Data Files Using Delimiters

Example of pipe (‘|’) delimited file

1|Customer#000000001|j5JsirBM9P|MOROCCO 0|MOROCCO|AFRICA|25-989-741-2988|BUILDING 2|Customer#000000002|487LW1dovn6Q4dMVym|JORDAN 1|JORDAN|MIDDLE EAST|23-768-687-3665|AUTOMOBILE 3|Customer#000000003|fkRGN8n|ARGENTINA7|ARGENTINA|AMERICA|11-719-748-3364|AUTOMOBILE 4|Customer#000000004|4u58h f|EGYPT 4|EGYPT|MIDDLE EAST|14-128-190-5944|MACHINERY

Preparing Input Data Files using Fixed-width

CREATE TABLE customer (
c_custkey integer not null,
c_name varchar(25) not null, 
c_address varchar(25) not null,
c_city varchar(10) not null, 
c_nation varchar(15) not null, 
c_region varchar(12) not null,
c_phone varchar(15) not null, 
c_mktsegment varchar(10) not null );
Copy customer from ‘s3://mydata/client.txt’ Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_key>’ fixedwidth‘0:3, 1:25, 2:25, 3:10, 4:15, 5:12, 6:15, 7:10;

Splitting Data Files

Copy customer from ‘s3://mydata/client.txt’ Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_key>’ Delimiter ‘|;


Loading –Use multiple input files to maximize throughput

Use the COPY command                                                                 

Each slice can load one file at a time

• A single input file means only one slice is ingesting data

• Instead of 100MB/s, you’re only getting 6.25MB/

Loading –Use multiple input files to maximize throughput  

• Use the COPY command

• You need at least as many input files as you have slices

• With 16 input files, all slices are working so you maximize throughput

• Get 100MB/s per node; scale linearly as you add nodes


Loading Data with Manifest Files

• Use manifest to loads all required file

• Supply JSON-formatted text file that lists the files to be loaded

• Can load files from different buckets or wit different prefix

{ “entries”: [ {“url”:”s3://mybucket-alpha/2013-10-04-custdata”, “mandatory”:true}, {“url”:”s3://mybucket-alpha/2013-10-05-custdata”, “mandatory”:true}, {“url”:”s3://mybucket-beta/2013-10-04-custdata”, “mandatory”:true}, {“url”:”s3://mybucket-beta/2013-10-05-custdata”, “mandatory”:true} ] }


Micro-Batch Loading

Loading Encrypted Data Files

Redshift COPY Command

• Loads data into a table from data files in S3 or from an Amazon DynamoDB table.

• The COPY command requires only three parameters:

– Table name

– Data Source

– Credentials

Copy table_name FROM data_source CREDENTIALS ‘aws_access_credentials’

• Optional Parameters include:

– Column mapping options – mapping source to target – Data Format Parameters

– FORMAT, CSV, DELIMITER, FIXEDWIDTH, AVRO, JSON, BZIP2, GZIP, LZOP

– Data Conversion Parameters – Data type conversion between source and target

– Data Load Operations –troubleshoot load times or reduce load times with parameters like COMROWS, COMPUPDATE, MAXERROR, NOLOAD, STATUPDATE


Loading JSON Data

• COPY uses a jsonpathstext file to parse JSON data

• JSONPath expressions specify the path to JSON name elements

• Each JSONPath expression corresponds to a column in the Amazon Redshift target table

Suppose you want to load the VENUE table with the following content

{ "id": 15, "name": "Gillette Stadium", "location": [ "Foxborough", "MA" ], "seats": 68756 } { "id": 15, "name": "McAfee Coliseum", "location": [ "Oakland", "MA" ], "seats": 63026 }

You would use the following jsonpaths file to parse the JSON data.

{ “jsonpaths”: [ “$[‘id’]”, “$[‘name’]”, “$[‘location’][0]”, “$[‘location’][1]”, “$[‘seats’]” ] }


Loading Data from Amazon Elastic MapReduce

• Load data from Amazon EMR in parallel using COPY

• Specify Amazon EMR cluster ID and HDFS file path/name

• Amazon EMR must be running until COPY completes.

copy sales from 'emr:// j-1H7OUO3B52HI5/myoutput/part*' credentials 'aws_access_key_id=<access-key id>; aws_secret_access_key=<secret-access-key>';



2 views0 comments

Recent Posts

See All

Users using macOS can follow the below command and steps to install the latest version of AWS CLI.The user should have sudo access to execute the below command. ​​For the latest version of the AWS CLI