What is AWS Glue-It’s Features,History,Components,Pricing?

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.

AWS Glue generates code that is customizable, reusable, and portable. Once your ETL job is ready, you can schedule it to run on AWS Glue’s fully managed, scale-out Apache Spark environment. AWS Glue provides a flexible scheduler with dependency resolution, job monitoring, and alerting.

AWS Glue is serverless, so there is no infrastructure to buy, set up, or manage. It automatically provisions the environment needed to complete the job, and customers pay only for the compute resources consumed while running ETL jobs. With AWS Glue, data can be available for analytics in minutes.

Features:

- Integrated data catalog:

The AWS Glue Data Catalog is your persistent metadata store for all your data assets, regardless of where they are located. The Data Catalog contains table definitions, job definitions, and other control information to help you manage your AWS Glue environment. It automatically computes statistics and registers partitions to make queries against your data efficient and cost-effective. It also maintains a comprehensive schema version history so you can understand how your data has changed over time.

- Automatic schema discovery:

AWS Glue crawlers connect to your source or target data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata in your AWS Glue Data Catalog. The metadata is stored in tables in your data catalog and used in the authoring process of your ETL jobs. You can run crawlers on a schedule, on-demand, or trigger them based on an event to ensure that your metadata is up-to-date.

- Code generation:

AWS Glue automatically generates the code to extract, transform, and load your data. Simply point AWS Glue to your data source and target, and AWS Glue creates ETL scripts to transform, flatten, and enrich your data. The code is generated in Scala or Python and written for Apache Spark.

- Developer endpoints:

If you choose to interactively develop your ETL code, AWS Glue provides development endpoints for you to edit, debug, and test the code it generates for you. You can use your favorite IDE or notebook. You can write custom readers, writers, or transformations and import them into your AWS Glue ETL jobs as custom libraries. You can also use and share code with other developers in our GitHub repository.

- Flexible job scheduler:

AWS Glue jobs can be invoked on a schedule, on-demand, or based on an event. You can start multiple jobs in parallel or specify dependencies across jobs to build complex ETL pipelines. AWS Glue will handle all inter-job dependencies, filter bad data, and retry jobs if they fail. All logs and notifications are pushed to Amazon CloudWatch so you can monitor and get alerts from a central service.

AWS Glue History

- October 15, 2018

Support for resource-level permission and resource-based policies with AWS Glue.

-October 5, 2018

Support for Amazon SageMaker notebooks with AWS Glue development endpoints.

-August 24, 2018

Support for encryption

-July 13, 2018

Support for Apache Spark job metrics for better debugging and profiling of ETL jobs. You can easily track runtime metrics such as bytes read and written, memory usage and CPU load of the driver and executors, and data shuffles among executors from the AWS Glue console.

-July 10, 2018

Support of DynamoDB as a data source and using it as a data source of ETL jobs.

-July 9, 2018

Updates to create notebook server procedure on an Amazon EC2 instance associated with a development endpoint.

-June 25, 2018

Updates now available over RSS feed to receive notifications about updates to the AWS Glue Developer Guide.

-May 25, 2018

Support delay notifications for jobs

- May 7, 2018

Configure a crawler to append new columns

- April 10, 2018

Support setting a timeout threshold when a job runs.

- January 12, 2018

Support Scala ETL script and trigger jobs based on additional run states.In addition, the trigger API now supports firing when any conditions are met (in addition to all conditions). Also, jobs can be triggered based on a “failed” or “stopped” job run (in addition to a “succeeded” job run).

- November 16, 2017

Added information about classifying XML data sources and new crawler option for partition changes.

- September 29, 2017

Added information about the map and filter transforms, support for Amazon RDS Microsoft SQL Server and Amazon RDS Oracle, and new features for development endpoints.

- August 14, 2017

AWS Glue initial release.

AWS Glue Components:

Crawlers:

-Crawlers automatically build your data catalog and keep it sync.

-Automatically discovers new data,extracts schema definitions.

-Detects schema changes and version tables.

-Detect Hive style partitons on Amazon S3.

-Built-in classifiers for popular types;custom classifiers using Grok expressions

-Run ad hoc or on a schedule;serverless-only pay when crawler runs

Data catalog:

-Manage table metadata through a Hive metastore API or Hive SQL.Supported by tools like Hive,Presto,Spark etc.Unified metadata repository across relational databases,Amazon RDS,Amazon Redshift,Amazon S3 into a single categorized list that is searchable.

Hive Metastore compatibility with enhanced functionality:

-Crawlers automatically extracts metadata and creates tables

-Integrated with Amazon Athena,Amazon Redshift Spectrum

Job Authoring:

-Auto-generates ETL code

-Build on open frameworks-Python and Spark

-Developer-Centric-editing,debugging,sharing

Job Execution:

-Run jobs on a serverless spark platform

-Provides flexible scheduling

-Handles dependency resolution,monitoring and alerting

Pricing:

With AWS Glue, you pay an hourly rate, billed by the second, for crawlers (discovering data) and ETL jobs (processing and loading data). For the AWS Glue Data Catalog, you pay a simple monthly fee for storing and accessing the metadata. The first million objects stored are free, and the first million accesses are free. If you provision a development endpoint to interactively develop your ETL code, you pay an hourly rate, billed per second.

ETL Jobs and Development endpoints

With AWS Glue, you only pay for the time your ETL job takes to run. There are no resources to manage, no upfront costs, and you are not charged for startup or shutdown time. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your ETL job. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. An AWS Glue ETL job requires a minimum of 2 DPUs. By default, AWS Glue allocates 10 DPUs to each ETL job. You are billed $0.44 per DPU-Hour in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each ETL job.

Development endpoints are optional, and billing applies only if you choose to interactively develop your ETL code. Development endpoints are charged based on the Data Processing Unit hours used for the time your development endpoints are provisioned. An AWS Glue development endpoint requires a minimum of 2 DPUs. By default, AWS Glue allocates 5 DPUs to each development endpoint. You are billed $0.44 per DPU-Hour in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each provisioned development endpoint.

Pricing

For all AWS Regions where AWS Glue is available:

$0.44 per DPU-Hour, billed per second, with a 10-minute minimum for each ETL job

$0.44 per DPU-Hour, billed per second, with a 10-minute minimum for each provisioned development endpoint

Additional charges

If you ETL data from data sources such as Amazon S3, Amazon RDS, or Amazon Redshift, you are charged standard request and data transfer rates. If you use Amazon CloudWatch, you are charged standard rates for CloudWatch logs and CloudWatch events.

Data Catalog and Storage Requests:

With the AWS Glue Data Catalog, you can store up to a million objects for free. If you store more than a million objects, you will be charged $1 per 100,000 objects over a million, per month. An object in the AWS Glue Data Catalog is a table, table version, partition, or database.

The first million access requests to the AWS Glue Data Catalog per month are free. If you exceed a million requests in a month, you will be charged $1 per million requests over the first million. Some of the common requests are CreateTable, CreatePartition, GetTable and GetPartitions.

Pricing

For all AWS Regions where AWS Glue is available:

Storage:

Free for the first million objects stored

$1 per 100,000 objects stored above 1M, per month

Requests:

Free for the first million requests per month

$1 per million requests above 1M in a month

Crawlers:

There is an hourly rate for AWS Glue crawler runtime to discover data and populate the AWS Glue Data Catalog. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your crawler. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. You are billed in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each crawl. Use of AWS Glue crawlers is optional, and you can populate the AWS Glue Data Catalog directly through the API.

Pricing

For all AWS Regions where AWS Glue is available:

$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run

Example:

ETL job example: Consider an ETL job that runs for 10 minutes and consumes 6 DPUs. The price of 1 DPU-Hour is $0.44. Since your job ran for 1/6th of an hour and consumed 6 DPUs, you will be billed 6 DPUs * 1/6 hour at $0.44 per DPU-Hour or $0.44.

Development endpoint example:

Now let’s consider that you provision a development endpoint to connect your notebook to interactively develop your ETL code. A development endpoint is provisioned with 5 DPUs. If you keep the development endpoint running for 24 minutes or 2/5th of an hour, you will be billed for 5 DPUs * 2/5 hour at $0.44 per DPU-Hour or $0.88.

AWS Glue Data Catalog free tier example:

Let’s consider that you store a million tables in your AWS Glue Data Catalog in a given month and make a million requests to access these tables. You pay $0 because your usage will be covered under the AWS Glue Data Catalog free tier. You can store the first million objects and make a million requests per month for free.

AWS Glue Data Catalog example: Now consider your storage usage remains the same at one million tables per month, but your requests double to two million requests per month. Let’s say you also use crawlers to find new tables and they run for 30 minutes and consume 2 DPUs.Your storage cost is still $0, as the storage for your first million tables is free. Your first million requests are also free. You will be billed for one million requests above the free tier, which is $1. Crawlers are billed at $0.44 per DPU-Hour, so you will pay for 2 DPUs * 1/2 hour at $0.44 per DPU-Hour or $0.44. This is a total monthly bill of $1.44.

©2019 by Raghavendra Kambhampati