Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena does not need a server. Therefore, there is no infrastructure to manage and you only pay for the queries that you run.
Athena is easy to use. Just point to the data in Amazon S3, set the schema and start queries using standard SQL. Most results are delivered in seconds. With Athena, there is no need for complex ETL jobs to prepare data for analysis. This allows anyone with SQL experience to analyze large-scale datasets easily and quickly.
Athena features factory integration with the AWS Glue Data Catalog, which enables you to create a unified metadata repository across multiple services, crawl data sources to discover schemas, and populate the Catalog with new and modified definitions of tables and partitions, as well as maintain the schema versioning. You can also use Glue’s fully managed ETL features to transform data or convert it to columnar formats to optimize costs and improve performance.
History of AWS Athena
- October 15, 2018
Added support for fine-grained access control to databases and tables in Athena. Additionally, added policies in Athena that allow you to encrypt database and table metadata in the Data Catalog.
Added support for creating identity-based (IAM) policies that provide fine-grained access control to resources in the AWS Glue Data Catalog, such as databases and tables used in Athena.
Additionally, you can encrypt database and table metadata in the Data Catalog, by adding specific policies to Athena.
- October 10, 2018
Added support for CREATE TABLE AS SELECT statements.
- September 6, 2018
Released the ODBC driver version 1.0.3 with support for streaming results instead of fetching them in pages.
The ODBC driver version 1.0.3 supports streaming results and also includes improvements, bug fixes, and an updated documentation for “Using SSL with a Proxy Server”.
- August 16, 2018
Released the JDBC driver version 2.0.5 with default support for streaming results instead of fetching them in pages.
Released the JDBC driver 2.0.5 with default support for streaming results instead of fetching them in pages. For information, see Using Athena with the JDBC Driver.
- August 7, 2018
Updated the documentation for querying Amazon Virtual Private Cloud flow logs, which can be stored directly in Amazon S3 in a GZIP format.
Updated examples for querying ALB logs.
- June 5, 2018
Added support for views.
- May 17, 2018
Increased default query concurrency limits from five to twenty.
You can submit and run up to twenty DDL queries and twenty SELECT queries at a time.
- May 8, 2018
Added query tabs, and an ability to configure auto-complete in the Query Editor.
- April 19, 2018
Released the JDBC driver version 2.0.2.
- April 6, 2018
Added auto-complete for typing queries in the Athena console.
- March 15, 2018
Added an ability to create Athena tables for CloudTrail log files directly from the CloudTrail console.
- February 2, 2018
Added an ability to securely offload intermediate data to disk for memory-intensive queries that use the GROUP BY clause.
This improves the reliability of such queries, preventing “Query resource exhausted” errors.
- January 19, 2018
Upgraded the underlying engine in Amazon Athena to a version based on Presto version 0.172.
- November 13, 2017
Added support for connecting Athena to the ODBC Driver.
- November 1, 2017
Added support for querying geospatial data, and for Asia Pacific (Seoul), Asia Pacific (Mumbai), EU (London) regions.
- October 19, 2017
Added support for EU (Frankfurt).
- October 3, 2017
Added support for creating named Athena queries with AWS CloudFormation.
- September 25, 2017
Added support for Asia Pacific (Sydney).
- September 5, 2017
Added querying AWS Service logs and different types of data, including maps, arrays, nested data,and data containing JSON.
- August 14, 2017
Added integration with the AWS Glue Data Catalog and a migration wizard for updating from the Athena managed data catalog to the AWS Glue Data Catalog.
- August 4, 2017
Added support for Grok SerDe, which provides easier pattern matching for records in unstructured text files such as logs.
- June 22, 2017
Added support for Asia Pacific (Tokyo) and Asia Pacific (Singapore).
- June 8, 2017
Added support for EU (Ireland).
- May 19, 2017
Added an Amazon Athena API and AWS CLI support for Athena. Updated JDBC driver to version 1.1.0.
- April 4, 2017
Added support for Amazon S3 data encryption and released a JDBC driver update (version 1.0.1) with encryption support, improvements, and bug fixes.
- March 24, 2017
Added the AWS CloudTrail SerDe, improved performance, fixed partition issues.
Improved performance when scanning a large number of partitions.
Improved performance on MSCK Repair Table operation.
Added ability to query Amazon S3 data stored in regions other than your primary region.
Standard inter-region data transfer rates for Amazon S3 apply in addition to standard Athena charges.
- February 20, 2017
Added support for Avro SerDe and OpenCSVSerDe for Processing CSV, US East (Ohio), and bulk editing columns in the console wizard.
Improved performance on large Parquet tables.
- November, 2016
The initial release of the Amazon Athena User Guide.
Amazon Athena features
- Amazon Athena is serverless, so there is no infrastructure to manage. You don’t need to worry about configuration, software updates, failures or scaling your infrastructure as your datasets and number of users grow. Athena automatically takes care of all of this for you, so you can focus on the data, not the infrastructure.
To get started, log into the Athena console, define your schema using the console wizard or by entering DDL statements, and immediately start querying using the built-in query editor. You can also use AWS Glue to automatically crawl data sources to discover data and populate your Data Catalog with new and modified table and partition definitions. Results are displayed in the console within seconds, and automatically written to a location of your choice in S3. You can also download them to your desktop. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.
- Amazon Athena uses Presto, an open source, distributed SQL query engine optimized for low latency, ad hoc analysis of data. This means you can run queries against large datasets in Amazon S3 using ANSI SQL, with full support for large joins, window functions, and arrays. Athena supports a wide variety of data formats such as CSV, JSON, ORC, Avro, or Parquet. You can also connect to Athena from a wide variety of BI tools using Athena’s JDBC driver.
- With Amazon Athena, you pay only for the queries that you run. You are charged based on the amount of data scanned by each query. You can get significant cost savings and performance gains by compressing, partitioning, or converting your data to a columnar format, because each of those operations reduces the amount of data that Athena needs to scan to execute a query.
- With Amazon Athena, you don’t have to worry about managing or tuning clusters to get fast performance. Athena is optimized for fast performance with Amazon S3. Athena automatically executes queries in parallel, so that you get query results in seconds, even on large datasets.
- Amazon Athena is highly available and executes queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable. Athena uses Amazon S3 as its underlying data store, making your data highly available and durable. Amazon S3 provides durable infrastructure to store important data and is designed for durability of 99.999999999% of objects. Your data is redundantly stored across multiple facilities and multiple devices in each facility.
- Amazon Athena allows you to control access to your data by using AWS Identity and Access Management (IAM) policies, access control lists (ACLs), and Amazon S3 bucket policies. With IAM policies, you can grant IAM users fine-grained control to your S3 buckets. By controlling access to data in S3, you can restrict users from querying it using Athena. Athena also allows you to easily query encrypted data stored in Amazon S3 and write encrypted results back to your S3 bucket. Both, server-side encryption and client-side encryption are supported.
- Amazon Athena integrates out-of-the-box with AWS Glue. With Glue Data Catalog, you will be able to create a unified metadata repository across various services, crawl data sources to discover data and populate your Data Catalog with new and modified table and partition definitions, and maintain schema versioning. You can also use Glue’s fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize query performance and reduce costs.
AWS Athena Under the hood
- AWS Athena uses Presto (SQL on Hadoop Solution) is a low latency distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
- Hive Metastore is a Metadata and Table management system designed for hadoop and used for Table abstraction(schema on read).Hive DDL functionality allows working with Partitions,Complex data types and many formats.
Pricing of AWS Athena
With Amazon Athena, only queries executed are charged. The charges are made according to the amount of data checked by each query. You can achieve significant cost savings and performance gains by compressing, partitioning, or converting data to a columnar format because each of these operations reduces the amount of data Athena needs to verify and execute a query
$ 5 per TB of verified data
You can save up to 30% to 90% cost per query and get better performance by compressing, partitioning, and converting data to columnar formats.
NOTE:The charge will be made by the number of bytes checked by Amazon Athena, rounded up to the nearest megabyte, with a minimum of 10 MB per query. There are no charges for Data Definition Language (DDL) statements, such as CREATE / ALTER / DROP TABLE, partition management statements, or failed queries. Canceled queries are charged based on the amount of verified data.
Data compression allows Athena to check for less data. Converting your data into columnar formats allows Athena to selectively read only the columns needed to process the data.
Athena is compatible with Apache ORC and Apache Parquet. Partitioning your data also allows Athena to restrict the amount of data being scanned. This brings cost savings and improved performance. You can see the amount of data checked per query on the Athena console.
There are no additional storage charges for the data query on Athena. Charges will be made according to S3’s standard rates for storing, requesting and transferring data. By default, query results are stored in an S3 bucket of your choice and are also charged according to Amazon S3 default rates.
If you use the AWS Glue data catalog with Athena, the standard AWS Glue Data data rates will be charged.