AWS Glue is a fully managed serverless ETL service. It makes it easy to discover, transform and load data that would be consumed by various processes and applications. If you want to learn more about AWS Glue then please refer to the video on AWS Glue Overview
In this article, we will go through the basic end-to-end CSV to Parquet transformation using AWS Glue. We will use multiple services to implement the solution like IAM, S3 and AWS Glue. As a part of AWS Glue, we will use crawlers, Data Catalog including Database & Tables and ETL jobs.
Architecture
Let’s understand the above flow.
- Create a crawler, which will connect to the S3 data store
- Post successful connection, it will infer or determine the structure of the CSV file using a built-in classifier
- The crawler will write the metadata in the form of a table in the AWS Glue Data Catalog
- After populating the data catalog, create the ETL job to transform CSV into parquet
- The data source for the…