What is Amazon Athena and How It Works?

What is Amazon Athena and How It Works? | Encaptechno

The process of analyzing data is somewhat complex in nature and includes multiple steps for simplifying things for which many tools are available. Amazon comes to the rescue by providing a service with the name of Amazon Athena that helps in analyzing data.Amazon Athena is a serverless analytics tool that allows users to query the data from S3 using the standard SQL syntax. As a leader in the world of cloud computing, AWS offers a wide range of services that offer competitive performance and affordable solutions used for running workloads as compared to on-premise architecture.AWS Athena is a service from the analytics domain that focuses on the retrieval of static data that is stored in S3 buckets using the standard SQL statements. It can be considered as a robust tool that helps customers to gain important insights on their data stored on S3 because it is serverless and there is no infrastructure for managing.

What is Amazon Athena?

Amazon launched Athena as an important service on 20th November 2016. It was launched as a serverless query service that was meant to make an analysis of data, using the standard SQL stored in Amazon S3 simpler. With just a few simple clicks in the AWS Management Console, the customers can easily point Amazon Athena at their data stored in Amazon S3 while running queries using standard SQL for generating results in seconds. With the interactive analytics service of Amazon Athena, there is no infrastructure for setting up or managing and the customers pay only for the queries that they want to run. It scales automatically while executing queries in parallel which eventually gives quick results even with a huge dataset and complex queries.Athena uses a distributed SQL engine called Presto which is useful in running the SQL queries. It is based on the popular open-source technology called Hive which further helps in storing structured, unstructured, and semi-structured data. The Apache Hive data warehouse software facilitates the reading, writing, and managing of large datasets that reside in the distributed storage using SQL.There is a simple data pipeline in which data from different sources is fetched and dumped into the S3 buckets. This is raw data which means there are no transformations applied to the data yet. At this time, Amazon Athena can be used for connecting to this data in S3 while being analyzed. This is a simple process because you do not need to set up any database or external tools to query the raw data. After you are done with the analysis and finding out desired results, an EMR cluster can be used for running the complex analytical data transformations while the data gets cleaned, processed, and stored.

Why Should You Use Athena?

Why Should You Use Amazon Athena?

An Athena user can query the encrypted data with keys managed by AWS key management service and also encrypt the query results. In fact, Athena also allows cross-account access to S3 buckets owned by another user. It uses managed data catalogs for storing information and schemas related to searches on Amazon S3 data.All in all, the interactive query service is actually an analytical tool that helps organizations in quickly analyzing important data stored in Amazon S3. It can be used in processing unstructured, structured, and semi structured data sets. With the use of Athena, it is possible to create dynamic queries for data sets. It works with the AWS Glue for giving you a much better way to store metadata in S3.Using the AWS Cloud Formation and Athena, you can use named queries that enable you to name a specific query and then also call it using the name. This is an interactive service from AWS that can be used by Data Scientists and developers for taking a peek into the table of running the query. It helps in fetching data from S3 and loads it to different datastores using the Athena JDBC driver for the log store analysis and Data Warehousing events.

Working of AWS Athena

Amazon Athena works in direct association with the S3 data. It is used as a distributed SQL engine for running the queries and it also uses Apache Hive for creating and altering tables and partitions. Some of the important standpoints needed for working with Athena include:

  1. You must have an AWS Account
  2. You should enable your account to export the cost and usage data into the S3 bucket.
  3. You can prepare buckets for Athena to connect.
  4. AWS also creates manifest files with the use of metadata each time it writes to the bucket. In fact, it creates a folder within the technology AWS billing data bucket known as Athena that contains only the data.
  5. For simplifying the setup, a region called the US-West-2 region can also be used.
  6. The last and final step is downloading the credentials for the new user because the credentials help indirectly mapping to the database credentials.

Amazon also offers a tool called Cost Explorer for dragging and dropping which comes with a set of pre-built reports such as Monthly service cost, reserved instance usage, etc. In case you are curious, you should try and recreate the query above the service costs and operation. This is in fact not impossible. You can slice the raw data while computing the growth rates each, building histograms, computing scores, etc.Some of the additional considerations to note while working with Amazon Athena include:

Pricing Model

The pricing of Athena is over $5 for scanning Terabyte data from S3 surrounded to the closest megabyte having a minimum of 10MB per query.

Reducing Cost

The trick is reducing the data that is scanned in three ways called compressing data, using columnar data, and partitioning the data.

Features of Athena

Out of the many services provided by Amazon, Athena is one of the best services. It has multiple features that make it suitable for Data Analysis. Some of the features include:

  • Quick Implementation

Amazon Athena does not need installation. It can actually be accessed directly from the AWS Console only using the AWS CLI.

  • Serverless

It is serverless so that the end-user does not have to worry about configuration, infrastructure, scaling, or failure. Athena takes care of it all easily.

  • Pay Per Query

Athena charges you just for the query you run which is the amount of data that gets managed per query. You can actually save a lot if you compress the data and format it accordingly.

  • Secure

Using the IAM policies and the AWS identity, Amazon Athena offers complete control over the data set. With the data being stored in S3 buckets the IAM policies can help in managing control to users.

  • Available

Amazon Athena is highly available and the users can execute queries round the clock.

  • Quick

Amazon Athena is a quick analytics tool because it can perform complex queries in less time by breaking the queries into simple ones and running them parallel and combining the results to offer the desired output.

  • Integration

One of the best features of Athena is that it can be easily integrated with the AWS Glue which helps users to create a unified data repository. This also helps in creating much better versioning of data, with better tables, views, etc.

  • Federated Queries

Amazon Athena federate query allows Athena to run SQL queries all over relational, object, non-relational, and custom data sources.

  • Machine Learning

The developers can use Amazon Sage Maker for creating and deploying the machine learning models in Amazon Athena.

Optimizing Techniques for AWS Athena

Optimizing Techniques for AWS Athena

While working with cloud services, one needs to take care of the services that are used for the least possible resources and the ones that offer the best result in a cost-effective manner. There are many measures that can be taken for optimizing queries within the AWS Athena so that the overall performance can be boosted and the cost can also be kept in check. Some of the common optimization techniques for the interactive analytics service of Amazon Athena are:

  • Partitioning the Data in S3
  • One of the most common practices followed for storing data in S3, partitioning is done for creating separate directories based on major dimensions such as the date dimension and region dimension. It can be used to partition by the year, month, and even day for storing files under each day’s directory. On the other hand, you can also partition by the region where data can be stored for similar regions under one directory. With partitioning, Athena is able to scan fewer data per query which makes the entire job quick and effective.
  • Data Compression Techniques
  • While compressing the data, a CPU is needed for compressing and decompressing while querying takes place. Even though there are different compression techniques available, one of the most popular ones to use with Athena is Apache Parquet or Apache ORC. This is a technique that is helpful in compressing the data with default algorithms for columnar databases.
  • Streamlining JOIN Conditions Within Queries
  • At the time of querying the data across multiple dimensions, an important thing required is joining the data from two tables for carrying out the analysis. The process of joining looks simple, but can very well be complex at times. Hence, it is always recommended to keep the tables with large data on the left and lesser data on the right. This is the way in which the data processing engine can easily distribute the smaller table on the right to the worker nodes while streaming the data from the left table and joining the two.

Using Selected Columns in Query

This is yet another mandatory optimization technique that majorly reduces the time and money taken to run Athena queries. It is always advised to explicitly mention the name of columns on which someone is performing analysis in the select query as compared to specifying a select from the table name.

Optimize Pattern Matching Technique in Query

There are many times when it is required to query the data based on patterns in the data as opposed to a keyword. In SQL, one of the easy ways to implement this is with the use of the LIKE operator where one can mention the pattern and query fetches data that again matches the pattern. In Amazon Athena, one can use REGEX for matching patterns instead of the LIKE operator as that is much faster.

Conclusion

With data becoming an important part of a company’s development, the process of gaining insights and extracting more data has become all the more important now. With the public cloud services, offering service-based analytics services such as Amazon Athena, many businesses can get more insights without complications that may come up with other analytics tools.As one of the best serverless architectures, Amazon Athena makes data queries easy to use, set up and fast to run. In fact, the pay-per-use model of Athena makes the entire thing affordable to run analytics. Moreover, since Athena works with Amazon S3 and comes with great scalability, reliability, and durability, this is one of the best suites to run analytics workloads.In case you need any support in the implementation and use of Amazon Athena, feel free to get in touch with our consultants at Encaptechno. We have a trained team to offer you extensive support all through your journey with Amazon Athena.