Querying S3 using AWS Athena

5.12.2019
|
0 Comments
||
|

Introduction to Athena

Athena is a serverless and interactive query service that makes it easier to analyze data directly from Amazon S3 using Standard SQL. It is used to query large amounts of data stored in the Amazon S3 bucket it uses Presto as its querying engine. Query results are cached in S3 by default for 45 days. It also stores the metadata of the data stored in S3 in CSV, JSON or any other format. For that, it uses AWS glue catalog metadata (hive metadata store). We pay per query (1Tb scanned = $5).

To optimize Athena we can use:

  1. Partitioning – We can create folders (or subfolders) date-wise to perform this action.
  2. Columnar storage – CSV and JSON are row-based. You can use columnar data formats of files like Parquet or ORC.
  3. Compression – You can compress the data in formats like Zlib, GZIP, etc.
  4. File size – Use big files because Athena (Presto) likes big files. (if your files are too small the execution engine might be spending additional time with the overhead of opening Amazon S3 files, listing directories, getting object metadata, setting up data transfer, reading file headers and so on.)

How it works?

  1. Load Data in S3 -> Define the Schema pointing to Data stored in Amazon S3 -> Add partition to AWS glue data catalog -> Query the Data
  2. Build on the top of Apache Presto
  3. Works with AWS glue
  4. ANSI SQL Compliant

To query S3 data using Athena, data should be partitioned in hive format.
Example: Let one partition is created in S3 on a daily basis and these partitions contain files/objects in JSON form.

      partitiondate=2019-09-01
      partitiondate=2019-09-02 
      partitiondate=2019-09-03
      partitiondate=2019-09-04

Here partitiondate is key/column & 2019-09-01 is partition value. Remember key/column must be in lower case. And each partition contains several JSON files/objects that contain data in JSON format.
Example: file1.json

{
    "data": {
      "employees": [
        {
          "id": "empId_1",
          "name": "Ram",
          "salary": 200000,
          "subordinates": 10
        },
        {
          "id": "empId_2",
          "name": "Shyam",
          "salary": 100000,
          "subordinates": 5
        }
      ]
    }
  }

Further, to Query the data stored in Amazon S3 we need to create a partition along ‘partitiondate’ as in the following DDL statement (run the following query in Athena Query Editor):

CREATE EXTERNAL TABLE partitiondatetable ( data struct < employees:array < struct < id:string, name:string, salary:int, subordinates:int > > > ) PARTITIONED BY (partitiondate string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe' with serdeproperties ( 'paths' = 'data' ) LOCATION 's3://myproject/';

This table uses the Hive’s native JSON serializer-deserializer to read JSON data stored in Amazon S3.

Partition created by the above query needs to be added in the catalog so that we can query them later. To add a partition in the catalog, choose New Query and execute the following statement:

MSCK REPAIR TABLE partitiondatetable

Now data has been loaded to Athena catalog.
Now, you can query the Amazon S3 data directly to get the results:

SELECT
    employees.name as name,
    employees.salary as salary,
    employees.subordinates as subordinates 
FROM
    partitiondatetable 
    CROSS JOIN
        UNNEST (data.employees) AS t(employees) 
WHERE
    partitiondate > '2019-08-31' 
    and partitiondate < '2019-09-05' 
GROUP BY (employees.id) limit 2

The output of the query executed above would be like this:

Name Salary Subordinates
Ram 200000 10
Shyam 100000 5

Leave a Reply

Your email address will not be published.

Share on:
Share this...
Share on facebook
Facebook
Share on pinterest
Pinterest
Share on twitter
Twitter
Share on linkedin
Linkedin
Contact Us