Data Engineering is all about building Data Pipelines to get data from multiple sources into Data Lakes or Data Warehouses and then from Data Lakes or Data Warehouses to downstream systems. As part of this course, I will walk you through how to build Data Engineering Pipelines using AWS Data Analytics Stack. It includes services such as Glue, Elastic Map Reduce (EMR), Lambda Functions, Athena, EMR, Kinesis, and many more.
Here are the high-level steps which you will follow as part of the course.
Data Engineering is all about building Data Pipelines to get data from multiple sources into Data Lakes or Data Warehouses and then from Data Lakes or Data Warehouses to downstream systems. As part of this course, I will walk you through how to build Data Engineering Pipelines using AWS Data Analytics Stack. It includes services such as Glue, Elastic Map Reduce (EMR), Lambda Functions, Athena, EMR, Kinesis, and many more.
Here are the high-level steps which you will follow as part of the course.
Setup Development Environment
Getting Started with AWS
Storage - All about AWS s3 (Simple Storage Service)
User Level Security - Managing Users, Roles, and Policies using IAM
Infrastructure - AWS EC2 (Elastic Cloud Compute)
Data Ingestion using AWS Lambda Functions
Overview of AWS Glue Components
Setup Spark History Server for AWS Glue Jobs
Deep Dive into AWS Glue Catalog
Exploring AWS Glue Job APIs
AWS Glue Job Bookmarks
Development Life Cycle of Pyspark
Getting Started with AWS EMR
Deploying Spark Applications using AWS EMR
Streaming Pipeline using AWS Kinesis
Consuming Data from AWS s3 using boto3 ingested using AWS Kinesis
Populating GitHub Data to AWS Dynamodb
Overview of Amazon AWS Athena
Amazon AWS Athena using AWS CLI
Amazon AWS Athena using Python boto3
Getting Started with Amazon AWS Redshift
Copy Data from AWS s3 into AWS Redshift Tables
Develop Applications using AWS Redshift Cluster
AWS Redshift Tables with Distkeys and Sortkeys
AWS Redshift Federated Queries and Spectrum
Here are the details about what you will be learning as part of this course. We will cover most of the commonly used services with hands-on practice which are available under AWS Data Analytics.
Getting Started with AWS
As part of this section, you will be going through the details related to getting started with AWS.
Introduction - AWS Getting Started
Create s3 Bucket
Create All IT professionals who would like to work on AWS should be familiar with it. We will get into quite a few common features related to AWS s3 in this section.
Getting Started with AWS S3
Setup Data Set locally to upload to AWS s3
Adding AWS S3 Buckets and Managing Objects (files and folders) in AWS s3 buckets
Version Control for AWS S3 Buckets
Cross-Region Replication for AWS S3 Buckets
Overview of AWS S3 Storage Classes
Overview of AWS S3 Glacier
Managing AWS S3 using As part of this section, you will understand the details related to AWS IAM users, groups, roles as well as policies.
Creating As part of this section, we will go through some of the basics related to
Getting Started with AWS EC2
Create
Getting Started with AWS EC2
Understanding In this section, we will understand how we can develop and deploy Lambda functions using Python as a programming language. We will also see how to maintain a bookmark or checkpoint using s3.
Hello World using AWS Lambda
Setup Project for local development of AWS Lambda Functions
Deploy Project to AWS Lambda console
Develop download functionality using requests for AWS Lambda Functions
Using 3rd party libraries in AWS Lambda Functions
Validating AWS s3 access for local development of AWS Lambda Functions
Develop upload functionality to s3 using AWS Lambda Functions
Validating AWS Lambda Functions using AWS Lambda Console
Run AWS Lambda Functions using AWS Lambda Console
Validating files incrementally downloaded using AWS Lambda Functions
Reading and Writing Bookmark to s3 using AWS Lambda Functions
Maintaining Bookmark on s3 using AWS Lambda Functions
Review the incremental upload logic developed using AWS Lambda Functions
Deploying AWS Lambda Functions
Schedule AWS Lambda Functions using AWS Event Bridge
Overview of AWS Glue Components
In this section, we will get a broad overview of all important Glue Components such as Glue Crawler, Glue Databases, Glue Tables, etc. We will also understand how to validate Glue tables using AWS Athena. AWS Glue (especially Glue Catalog) is one of the key components in the realm of AWS Data Analytics Services.
Introduction - Overview of AWS Glue Components
Create AWS Glue Crawler and AWS Glue Catalog Database as well as Table
Analyze Data using AWS Athena
Creating AWS S3 Bucket and Role to create AWS Glue Catalog Tables using Crawler on the s3 location
Create and Run the AWS Glue Job to process data in AWS Glue Catalog Tables
Validate using AWS Glue Catalog Table and by running queries using AWS Athena
Create and Run AWS Glue Trigger
Create AWS Glue Workflow
Run AWS Glue Workflow and Validate
Setup Spark History Server for AWS Glue Jobs
AWS Glue uses Apache Spark under the hood to process the data. It is important we setup Spark History Server for AWS Glue Jobs to troubleshoot any issues.
Introduction - Spark History Server for AWS Glue
Setup Spark History Server on AWS
Clone AWS Glue Samples repository
Build AWS Glue Spark UI Container
Update AWS IAM Policy Permissions
Start AWS Glue Spark UI Container
Deep Dive into AWS Glue Catalog
AWS Glue has several components, but the most important ones are nothing but AWS Glue Crawlers, Databases as well as Catalog Tables. In this section, we will go through some of the most important and commonly used features of the AWS Glue Catalog.
Prerequisites for AWS Glue Catalog Tables
Steps for Creating AWS Glue Catalog Tables
Download Data Set to use to create AWS Glue Catalog Tables
Upload data to s3 to crawl using AWS Glue Crawler to create required AWS Glue Catalog Tables
Create AWS Glue Catalog Database - itvghlandingdb
Create AWS Glue Catalog Table - ghactivity
Running Queries using AWS Athena - ghactivity
Crawling Multiple Folders using AWS Glue Crawlers
Managing AWS Glue Catalog using AWS CLI
Managing AWS Glue Catalog using Python Boto3
Exploring AWS Glue Job APIs
Once we deploy AWS Glue jobs, we can manage them using AWS Glue Job APIs. In this section we will get overview of AWS Glue Job APIs to run and manage the jobs.
Update In this section, we will go through the details related to AWS Glue Job Bookmarks.
Introduction to AWS Glue Job Bookmarks
Cleaning up the data to run AWS Glue Jobs
Overview of AWS Glue CLI and Commands
Run AWS Glue Job using AWS Glue Bookmark
Validate AWS Glue Bookmark using AWS CLI
Add new data to the landing zone to run AWS Glue Jobs using Bookmarks
Rerun AWS Glue Job using Bookmark
Validate AWS Glue Job Bookmark and Files for Incremental run
Recrawl the AWS Glue Catalog Table using We will use this application later while exploring EMR in detail.
Setup Virtual Environment and Install Pyspark
Getting Started with Pycharm
Passing Run Time Arguments
Accessing OS Environment Variables
Getting Started with Spark
Create Function for Spark Session
Setup Sample Data
Read data from files
Process data using Spark APIs
Write data to files
Validating Writing Data to Files
Productionizing the Code
Getting Started with AWS EMR (Elastic Map Reduce)
As part of this section, we will understand how to get started with AWS EMR Cluster. We will primarily focus on AWS EMR Web Console. Elastic Map Reduce is one of the key service in AWS Data Analytics Services which provide capability to run applications which process large scale data leveraging distributed computing frameworks such as Spark.
Planning for We will be using the Spark Application we deployed earlier.
Deploying Applications using AWS EMR - Introduction
Setup We will use AWS Kinesis Firehose Agent and AWS Kinesis Delivery Stream to read the data from log files and ingest it into AWS s3.
Building Streaming Pipeline using AWS Kinesis Firehose Agent and Delivery Stream
Rotating Logs so that the files are created frequently which will be eventually ingested using AWS Kinesis Firehose Agent and AWS Kinesis Firehose Delivery Stream
Set up AWS Kinesis Firehose Agent to get data from logs into AWS Kinesis Delivery Stream.
Create AWS Kinesis Firehose Delivery Stream
Planning the Pipeline to ingest data into s3 using AWS Kinesis Delivery Stream
Create AWS IAM Group and User for Streaming Pipelines using AWS Kinesis Components
Granting Permissions to
Start and Validate AWS Kinesis Firehose Agent
Conclusion - Building Simple Steaming Pipeline using AWS Kinesis Firehose
Consuming Data from AWS s3 using Python boto3 ingested using AWS Kinesis
As data is ingested into AWS S3, we will understand how data can ingested in AWS s3 can be processed using boto3.
Customizing AWS s3 folder using AWS Kinesis Delivery Stream
Create AWS IAM Policy to read from AWS s3 Bucket
Validate AWS s3 access using AWS CLI
Setup Python Virtual Environment to explore boto3
Validating access to AWS s3 using Python boto3
Read Content from AWS s3 object
Read multiple AWS s3 Objects
Get the number of AWS s3 Objects using Marker
Get the size of AWS s3 Objects using Marker
Populating GitHub Data to AWS Dynamodb
As part of this section, we will understand how we can populate data to AWS Dynamodb tables using Python as a programming language.
Install required libraries to get GitHub Data to AWS Dynamodb tables.
Understanding GitHub APIs
Setting up GitHub API Token
Understanding GitHub Rate Limit
Create New Repository for since
Extracting Required Information using Python
Processing Data using Python
Grant Permissions to create AWS dynamodb tables using boto3
Create AWS Dynamodb Tables
AWS Dynamodb CRUD Operations
Populate AWS Dynamodb Table
AWS Dynamodb Batch Operations
Overview of Amazon AWS Athena
As part of this section, we will understand how to get started with AWS Athena using AWS Web console. We will also focus on basic DDL and DML or CRUD Operations using AWS Athena Query Editor.
Getting Started with Amazon AWS Athena
Quick Recap of AWS Glue Catalog Databases and Tables
Access AWS Glue Catalog Databases and Tables using AWS Athena Query Editor
Create a Database and Table using AWS Athena
Populate Data into Table using AWS Athena
Using CTAS to create tables using AWS Athena
Overview of Amazon AWS Athena Architecture
Amazon AWS Athena Resources and relationship with Hive
Create a Partitioned Table using AWS Athena
Develop Query for Partitioned Column
Insert into Partitioned Tables using AWS Athena
Validate Data Partitioning using AWS Athena
Drop AWS Athena Tables and Delete Data Files
Drop Partitioned Table using AWS Athena
Data Partitioning in AWS Athena using CTAS
Amazon AWS Athena using AWS CLI
As part of this section, we will understand how to interact with AWS Athena using AWS CLI Commands.
Amazon AWS Athena using AWS CLI - Introduction
Get help and list AWS Athena databases using AWS CLI
Managing AWS Athena Workgroups using AWS CLI
Run AWS Athena Queries using AWS CLI
Get AWS Athena Table Metadata using AWS CLI
Run AWS Athena Queries with a custom location using AWS CLI
Drop AWS Athena table using AWS CLI
Run CTAS under AWS Athena using AWS CLI
Amazon AWS Athena using Python boto3
As part of this section, we will understand how to interact with AWS Athena using Python boto3.
Amazon AWS Athena using Python boto3 - Introduction
Getting Started with Managing AWS Athena using Python boto3
List Amazon AWS Athena Databases using Python boto3
List Amazon AWS Athena Tables using Python boto3
Run Amazon AWS Athena Queries with boto3
Review AWS Athena Query Results using boto3
Persist Amazon AWS Athena Query Results in Custom Location using boto3
Processing AWS Athena Query Results using Pandas
Run CTAS against Amazon AWS Athena using Python boto3
Getting Started with Amazon AWS Redshift
As part of this section, we will understand how to get started with AWS Redshift using AWS Web console. We will also focus on basic DDL and DML or CRUD Operations using AWS Redshift Query Editor.
Getting Started with Amazon AWS Redshift - Introduction
Create AWS Redshift Cluster using Free Trial
Connecting to Database using AWS Redshift Query Editor
Get a list of tables querying information schema
Run Queries against AWS Redshift Tables using Query Editor
Create AWS Redshift Table using Primary Key
Insert Data into AWS Redshift Tables
Update Data in AWS Redshift Tables
Delete data from AWS Redshift tables
Redshift Saved Queries using Query Editor
Deleting AWS Redshift Cluster
Restore AWS Redshift Cluster from Snapshot
Copy Data from s3 into AWS Redshift Tables
As part of this section, we will go through the details about copying data from s3 into AWS Redshift tables using the AWS Redshift Copy command.
Copy Data from s3 to AWS Redshift - Introduction
Setup Data in s3 for AWS Redshift Copy
Copy Database and Table for AWS Redshift Copy Command
Create IAM User with full access on s3 for AWS Redshift Copy
Run Copy Command to copy data from s3 to AWS Redshift Table
Troubleshoot Errors related to AWS Redshift Copy Command
Run Copy Command to copy from s3 to AWS Redshift table
Validate using queries against AWS Redshift Table
Overview of AWS Redshift Copy Command
Create IAM Role for AWS Redshift to access s3
Copy Data from s3 to AWS Redshift table using IAM Role
Setup JSON Dataset in s3 for AWS Redshift Copy Command
Copy JSON Data from s3 to AWS Redshift table using IAM Role
Develop Applications using AWS Redshift Cluster
As part of this section, we will understand how to develop applications against databases and tables created as part of AWS Redshift Cluster.
Develop application using AWS Redshift Cluster - Introduction
Allocate Elastic Ip for AWS Redshift Cluster
Enable Public Accessibility for AWS Redshift Cluster
Update Inbound Rules in Security Group to access AWS Redshift Cluster
Create Database and User in AWS Redshift Cluster
Connect to the database in AWS Redshift using psql
Change Owner on AWS Redshift Tables
Download AWS Redshift JDBC Jar file
Connect to AWS Redshift Databases using IDEs such as SQL Workbench
Setup Python Virtual Environment for AWS Redshift
Run Simple Query against AWS Redshift Database Table using Python
Truncate AWS Redshift Table using Python
Create IAM User to copy from s3 to AWS Redshift Tables
Validate Access of IAM User using Boto3
Run AWS Redshift Copy Command using Python
AWS Redshift Tables with Distkeys and Sortkeys
As part of this section, we will go through AWS Redshift-specific features such as distribution keys and sort keys to create AWS Redshift tables.
AWS Redshift Tables with Distkeys and Sortkeys - Introduction
Quick Review of AWS Redshift Architecture
Create multi-node AWS Redshift Cluster
Connect to AWS Redshift Cluster using Query Editor
Create AWS Redshift Database
Create AWS Redshift Database User
Create AWS Redshift Database Schema
Default Distribution Style of AWS Redshift Table
Grant Select Permissions on Catalog to AWS Redshift Database User
Update Search Path to query AWS Redshift system tables
Validate AWS Redshift table with
AWS Redshift Federated Queries and Spectrum - Introduction
Overview of integrating AWS RDS and AWS Redshift for Federated Queries
Create IAM Role for AWS Redshift Cluster
Setup Postgres Database Server for AWS Redshift Federated Queries
Create tables in Postgres Database for AWS Redshift Federated Queries
Creating Secret using Secrets Manager for Postgres Database
Accessing Secret Details using Python Boto3
Reading Json Data to Dataframe using Pandas
Write JSON Data to AWS Redshift Database Tables using Pandas
Create AWS IAM Policy for Secret and associate with Redshift Role
Create AWS Redshift Cluster using AWS IAM Role with permissions on secret
Create AWS Redshift External Schema to Postgres Database
Update AWS Redshift Cluster Network Settings for Federated Queries
Performing ETL using AWS Redshift Federated Queries
Clean up resources added for AWS Redshift Federated Queries
Grant Access on AWS Glue Data Catalog to AWS Redshift Cluster for Spectrum
Setup AWS Redshift Clusters to run queries using Spectrum
Quick Recap of AWS Glue Catalog Database and Tables for AWS Redshift Spectrum
Create External Schema using AWS Redshift Spectrum
Run Queries using AWS Redshift Spectrum
Cleanup the AWS Redshift Cluster
OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.
Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.
Find this site helpful? Tell a friend about us.
We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.
Your purchases help us maintain our catalog and keep our servers humming without ads.
Thank you for supporting OpenCourser.