Blog > AWS Glue: How it Works? Serverless Data Integration
What is AWS Glue?
AWS Glue is a fully manageable ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize data, clean it, enrich it, and move it reliably between various data streams. AWS Glue’s design is ideal for working with semi-structured data. Here we are going to discuss how Amazon AWS Glue works for enterprise data maintenance.
When should we use AWS Glue?
We can use AWS Glue to organize, cleanse, validate, and format data for storage in a data warehouse or data lake. It also allows to transform and move AWS Cloud data into our data lake. We can also load data from disparate static or streaming data sources into our data warehouse or data lake for regular reporting and analysis.
To store data in a data warehouse or data lake, we integrate information from different parts of our business and provide a shared data source for decision-making and analysis.
Data Sources that AWS Glue Supports
AWS Glue supports at data stores:
- Amazon S3
- Amazon Relational Database Service that is amazon RDS
- Third-party JDBC accessible databases
- Amazon DynamoDB
Data Streams Supports by AWS Glue
- Amazon Kinesis data streams
- Apache Kafka data streams
AWS Glue Environment
AWS Glue calls API operations to transform our data, create run-time logs, store user’s job logics, and create a notification to help users monitor their job runs.
They can define AWS Glue jobs to accomplish the work required, such as extract data, transform and load data from a data source to a data target. Here the user performs the action for data store sources, defines a crawler to populate AWS Glue data catalogue with metadata table definitions.
It is faster, cheaper, and easier to use. Migrate to AWS Glue is 10x faster, and it is serverless means users do not need to worry about poisoning any cluster or server.
AWS Glue Usage
- To build a data warehouse to organize, cleanse, validate, and format data.
- An enterprise connects AWS Glue to runs serverless queries against the user’s Amazon S3 data lake.
- AWS Glue allows its users to create event-driven FTI pipelines.
- To understand data assets.
AWS Glue Benefits for Enterprise
- Cost-Effective
- Less Hassle
- Easy Management
- Superior Functionality
Glue Data Catalog
AWS Glue has a data catalogue, so basically, it has all the metadata in the form of a database and tables.
AWS Glue Crawler
The crawler connects to a particular service to retrieve data; the service can be amazon S3, RDS, Redshift or dynamo DB, or any other JDBC connection. So, the crawler does it crawls through the data. For example:
Suppose an enterprise stores its data into a CSV file in S3 with like 100 million rows of data. The crawler infers the file’s schema, creates the tables, and stores it in the data catalogue. The data catalogue can then integrate with an S3 service to run the organization’s sequel queries to perform data analysis.
The AWS Data Glue catalogue can act as centralize metadata repository. This catalogue is not a database; it stores only metadata of tables such as table name, column name, and type of data. So, this metadata uses to create tables in AWS Athena. With AWS Athena, the user can run their SQL queries to perform data analysis on their organizational data.
Glue ETL Jobs
- Extract, Transform and Load
- Leverage Spark
- Can be authored using Python or Scala
- Server-less
AWS Glue Components
Extract, Transform and Load
- Server-less Execution
- Uses Apache Spark / Python shell
- Interactive Development & Auto-generate ETL code
Glue Data Catalog
- Apache Hive meta-store compatible
- Many integrated analytic services
Crawlers
- Load and maintain data catalogue
- Infer metadata schema, table structure
- Supports schema evolution
Workflow Management
- Orchestrate triggers, crawlers, and jobs
- Build and monitor complex flows
- Reliable execution
Author: SVCIT Editorial
Copyright Silicon Valley Cloud IT, LLC.