How to Convert CSV/JSON to Apache Parquet using AWS Glue

Walkthrough


In this walkthrough, you define a database, configure a crawler to explore data in an Amazon S3 bucket, create a table, transform the CSV file into Parquet and create a table for the Parquet data using AWS Glue.

Here we are using TPC-H CSV data which is stored in Amazon S3 bucket.

Discover the data

Sign in to the AWS Management Console and open the AWS Glue console. You can find AWS Glue in the Analytics section.

The first step to discovering the data is to add a database. A database is a collection of tables.

In the console, choose Add database. In Database name, type tpc-h, and choose Create.

Choose Tables in the navigation pane. A table consists of the names of columns, data type definitions, and other metadata about a dataset.

Add a table to the database tpc-h.You can add a table manually or by using a crawler. A crawler is a program that connects to a data store and progresses through a prioritized list of classifiers to determine the schema for your data. AWS Glue provides classifiers for common file types like CSV, JSON, Avro, and others. You can also write your own classifier using a grok pattern.

To add a crawler, enter the data source: an Amazon S3 bucket named s3://redshift-demokloud(Location where you have stored your TPC-H data or any sample data). This S3 bucket contains the data file for all 8 tables of tpc-h.

Choose Next.

For IAM role, choose an existing IAM role AWSGlueServiceRoleDefault in the drop-down list.(This Role needs to be created first,then it reflects in the drop-down list.)

For Frequency, choose Run on demand. The crawler can be run on demand or set to run on a schedule.

For Database, choose tpc-h.It is important to understand how AWS Glue deals with schema changes so that you can select the appropriate method. In this example, the table is updated with any change.

Review the steps, and choose Finish. The crawler is ready to run. Choose Run it now.

When the crawler has finished, 8 tables have been added.

Choose Tables in the left navigation pane, and then choose customer. This screen describes the table, including schema, properties, and other valuable information.

Transform the data from CSV to Parquet format


Now you can configure and run a job to transform the data from CSV to Parquet. Parquet is a columnar format that is well suited for AWS analytics services like Amazon Athena and Amazon Redshift Spectrum.

Under ETL in the left navigation pane, choose Jobs, and then choose Add job.

For the Name, type customer-csv-parquet-> IAM role ->choose AWSGlueServiceRoleDefault.(one of the 8 tables of TPC-H)

Select Customer Table as Data Source.

Select Create tables in your data target->Select Amazon S3 as data store -> Choose Parquet as Format->Select Target Path of S3 to store the data.

Verify the schema mapping, and choose Finish.

View the job.This screen provides a complete view of the job and allows you to edit, save, and run the job.AWS Glue created this script. However, if required, you can create your own.

Choose Save, and then choose Run job.

Add the Parquet table and crawler

When the job has finished, add a new table for the Parquet data using a crawler.

For Crawler name, type tpc-h parquet crawler.

Choose S3 as the data store ->Select S3 path of parquet data stored

Choose No

For the IAM role, choose AWSGlueServiceRoleDefault

Choose database tpc-h 

Choose Frequency, choose Run on demand ->Click Run it now

After the crawler has finished we can see new 5 tables created  with classification as parquet under tpc-h database.

Select Tables and Customer table to view the data and schema details

This post demonstrates using AWS Glue to crawl your data on Amazon S3 and build an Apache Hive-compatible metadata store, you can use the metadata across the AWS analytic services and popular Hadoop ecosystem tools. This combination of AWS services is powerful and easy to use, allowing you to get to business insights faster.


©2019 by Raghavendra Kambhampati