Introduction to Athena

Athena is a serverless and interactive query service that makes it easier to analyze data directly from Amazon S3 using Standard SQL. It is used to query large amounts of data stored in the Amazon S3 bucket it uses Presto as its querying engine. Query results are cached in S3 by default for 45 days. It also stores the metadata of the data stored in S3 in CSV, JSON or any other format. For that, it uses AWS glue catalog metadata (hive metadata store). We pay per query (1Tb scanned = $5).

To optimize Athena we can use:

  1. Partitioning – We can create folders (or subfolders) date-wise to perform this action.
  2. Columnar storage – CSV and JSON are row-based. You can use columnar data formats of files like Parquet or ORC.
  3. Compression – You can compress the data in formats like Zlib, GZIP, etc.
  4. File size – Use big files because Athena (Presto) likes big files. (if your files are too small the execution engine might be spending additional time with the overhead of opening Amazon S3 files, listing directories, getting object metadata, setting up data transfer, reading file headers and so on.)

How it works?

  1. Load Data in S3 -> Define the Schema pointing to Data stored in Amazon S3 -> Add partition to AWS glue data catalog -> Query the Data
  2. Build on the top of Apache Presto
  3. Works with AWS glue
  4. ANSI SQL Compliant

To query S3 data using Athena, data should be partitioned in hive format.
Example: Let one partition is created in S3 on a daily basis and these partitions contain files/objects in JSON form.

partitiondate=2019-09-01 partitiondate=2019-09-02 partitiondate=2019-09-03 partitiondate=2019-09-04

Here partitiondate is key/column & 2019-09-01 is partition value. Remember key/column must be in lower case. And each partition contains several JSON files/objects that contain data in JSON format.
Example: file1.json

{ “data”: { “employees”: [ { “id”: “empId_1”, “name”: “Ram”, “salary”: 200000, “subordinates”: 10 }, { “id”: “empId_2”, “name”: “Shyam”, “salary”: 100000, “subordinates”: 5 } ] } }

Further, to Query the data stored in Amazon S3 we need to create a partition along ‘partitiondate’ as in the following DDL statement (run the following query in Athena Query Editor):

CREATE EXTERNAL TABLE partitiondatetable ( data struct < employees:array < struct < id:string, name:string, salary:int, subordinates:int > > > ) PARTITIONED BY (partitiondate string) ROW FORMAT serde ‘org.apache.hive.hcatalog.data.JsonSerDe’ with serdeproperties ( ‘paths’ = ‘data’ ) LOCATION ‘s3://myproject/’;

This table uses the Hive’s native JSON serializer-deserializer to read JSON data stored in Amazon S3.

Partition created by the above query needs to be added in the catalog so that we can query them later. To add a partition in the catalog, choose New Query and execute the following statement:

MSCK REPAIR TABLE partitiondatetable

Now data has been loaded to Athena catalog.
Now, you can query the Amazon S3 data directly to get the results:

SELECT employees.name as name, employees.salary as salary, employees.subordinates as subordinates FROM partitiondatetable CROSS JOIN UNNEST (data.employees) AS t(employees) WHERE partitiondate > ‘2019-08-31’ and partitiondate < ‘2019-09-05’ GROUP BY (employees.id) limit 2

The output of the query executed above would be like this:

NameSalarySubordinates

Ram 200000 10
Shyam 100000 5