Pricing Model for AWS Glue,AWS Athena & S3 for Datalake in AWS
Updated: Aug 16, 2019
S3 Cost Model:
When it comes to S3 Pricing, there are basically three factors that are used to determine the total cost of using S3:1.The amount of storage.2.The amount of data transferred every month.3.The number of requests made monthly.In most cases, only the storage amount and data transferred make much of a difference in cost. However, in the case of data transfer that occurs between S3 and AWS resources within the same region, the data transfer costs is zero.
Save More When You Use Columnar Data Formats, Partition, and Compress your Data: You can save from 30% to 90% on your per-query costs and get better performance by compressing, partitioning, and converting your data into columnar formats.
You are charged for the number of bytes scanned by Amazon Athena, rounded up to the nearest megabyte, with a 10MB minimum per query. There are no charges for Data Definition Language (DDL) statements like CREATE/ALTER/DROP TABLE, statements for managing partitions, or failed queries. Cancelled queries are charged based on the amount of data scanned.
Compressing your data allows Athena to scan less data. Converting your data to columnar formats allows Athena to selectively read only required columns to process the data. Athena supports Apache ORC and Apache Parquet. Partitioning your data also allows Athena to restrict the amount of data scanned. This leads to cost savings and improved performance. You can see the amount of data scanned per query on the Athena console. For details, see the Athena pricing example.
Amazon Athena queries data directly from Amazon S3. There are no additional storage charges for querying your data with Athena. You are charged standard S3 rates for storage, requests, and data transfer. By default, query results are stored in an S3 bucket of your choice and are also billed at standard Amazon S3 rates.
With AWS Glue, you pay an hourly rate, billed by the second, for crawlers (discovering data) and ETL jobs (processing and loading data). For the AWS Glue Data Catalog, you pay a simple monthly fee for storing and accessing the metadata. The first million objects stored are free, and the first million accesses are free. If you provision a development endpoint to interactively develop your ETL code, you pay an hourly rate, billed per second.
ETL jobs and development endpoints
With AWS Glue, you only pay for the time your ETL job takes to run. There are no resources to manage, no upfront costs, and you are not charged for startup or shutdown time. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your ETL job. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. A Glue ETL job requires a minimum of 2 DPUs. By default, AWS Glue allocates 10 DPUs to each ETL job. You are billed $0.44 per DPU-Hour in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each ETL job.
Development endpoints are optional, and billing applies only if you choose to interactively develop your ETL code. Development endpoints are charged based on the Data Processing Unit hours used for the time your development endpoints are provisioned. A Glue development endpoint requires a minimum of 2 DPUs. By default, AWS Glue allocates 5 DPUs to each development endpoint. You are billed $0.44 per DPU-Hour in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each provisioned development endpoint.
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum for each ETL job
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum for each provisioned development endpoint
If you ETL data from data sources such as Amazon S3, Amazon RDS, or Amazon Redshift, you are charged standard request and data transfer rates. If you use Amazon CloudWatch, you are charged standard rates for CloudWatch logs and CloudWatch events.
Data Catalog storage and requests
With the AWS Glue Data Catalog, you can store up to a million objects for free. If you store more than a million objects, you will be charged $1 per 100,000 objects over a million, per month. An object in the AWS Glue Data Catalog is a table, table version, partition, or database.
The first million access requests to the AWS Glue Data Catalog per month are free. If you exceed a million requests in a month, you will be charged $1 per million requests over the first million. Some of the common requests are CreateTable, CreatePartition, GetTable and GetPartitions.
There is an hourly rate for AWS Glue crawler runtime to discover data and populate the AWS Glue Data Catalog. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your crawler. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. You are billed in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each crawl. Use of AWS Glue crawlers is optional, and you can populate the AWS Glue Data Catalog directly through the API.
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
ETL job example:
Consider an ETL job that runs for 10 minutes and consumes 6 DPUs. The price of 1 DPU-Hour is $0.44. Since your job ran for 1/6th of an hour and consumed 6 DPUs, you will be billed 6 DPUs * 1/6 hour at $0.44 per DPU-Hour or $0.44.
Development endpoint example:
Now let’s consider that you provision a development endpoint to connect your notebook to interactively develop your ETL code. A development endpoint is provisioned with 5 DPUs. If you keep the development endpoint running for 24 minutes or 2/5th of an hour, you will be billed for 5 DPUs * 2/5 hour at $0.44 per DPU-Hour or $0.88.
AWS Glue Data Catalog free tier example: Let’s consider that you store a million tables in your AWS Glue Data Catalog in a given month and make a million requests to access these tables. You pay $0 because your usage will be covered under the AWS Glue Data Catalog free tier. You can store the first million objects and make a million requests per month for free.
AWS Glue Data Catalog example:
Now consider your storage usage remains the same at one million tables per month, but your requests double to two million requests per month. Let’s say you also use crawlers to find new tables and they run for 30 minutes and consume 2 DPUs.
Your storage cost is still $0, as the storage for your first million tables is free. Your first million requests are also free. You will be billed for one million requests above the free tier, which is $1. Crawlers are billed at $0.44 per DPU-Hour, so you will pay for 2 DPUs * 1/2 hour at $0.44 per DPU-Hour or $0.44. This is a total monthly bill of $1.44.