Data Engineering using Databricks on AWS and Azure from Udemy

As part of this course, you will learn all the Data Engineering using cloud platform-agnostic technology called Databricks.

About Data Engineering

Data Engineering is nothing but processing the data depending on our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc.

About Databricks

Databricks is the most popular cloud platform-agnostic data engineering tech stack. They are the committers of the Apache Spark project. Databricks run time provide Spark leveraging the elasticity of the cloud. With Databricks, you pay for what you use. Over a period of time, they came up with the idea of Lakehouse by providing all the features that are required for traditional BI as well as AI & ML. Here are some of the core features of Databricks.

Spark - Distributed Computing
Delta Lake - Perform CRUD Operations. It is primarily used to build capabilities such as inserting, updating, and deleting the data from files in Data Lake.
cloudFiles - Get the files in an incremental fashion in the most efficient way leveraging cloud features.
Databricks SQL - A Photon-based interface that is fine-tuned for running queries submitted for reporting and visualization by reporting tools. It is also used for Ad-hoc Analysis.

Course Details

As part of this course, you will be learning Data Engineering using Databricks.

Getting Started with Databricks
Setup Local Development Environment to develop Data Engineering Applications using Databricks
Using Databricks CLI to manage files, jobs, clusters, etc related to Data Engineering Applications
Spark Application Development Cycle to build Data Engineering Applications
Databricks Jobs and Clusters
Deploy and Run Data Engineering Jobs on Databricks Job Clusters as Python Application
Deploy and Run Data Engineering Jobs on Databricks Job Clusters using Notebooks
Deep Dive into Delta Lake using Dataframes on Databricks Platform
Deep Dive into Delta Lake using Spark SQL on Databricks Platform
Building Data Engineering Pipelines using Spark Structured Streaming on Databricks Clusters
Incremental File Processing using Spark Structured Streaming leveraging Databricks Auto Loader cloudFiles
Overview of AutoLoader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between traditional Spark Structured Streaming and leveraging Databricks Auto Loader cloudFiles for incremental file processing.
Overview of Databricks SQL for Data Analysis and reporting.

We will be adding a few more modules related to Pyspark, Spark with Scala, Spark SQL, and Streaming Pipelines in the coming weeks.

Desired Audience

Here is the desired audience for this advanced course.

Experienced application developers to gain expertise related to Data Engineering with prior knowledge and experience of Spark.
Experienced Data Engineers to gain enough skills to add Databricks to their profile.
Testers to improve their testing capabilities related to Data Engineering applications using Databricks.

Prerequisites

Logistics
- Computer with decent configuration (At least
Associated Costs
As part of the training, you will only get the material. You need to practice on your own or corporate cloud account and Databricks Account.
- You need to take care of the associated AWS or Azure costs.
- You need to take care of the associated Databricks costs.
Training Approach
Here are the details related to the training approach.
- It is self-paced with reference material, code snippets, and videos provided as part of Udemy.
- One needs to sign up for their own Databricks environment to practice all the core features of Databricks.
- We would recommend completing 2 modules every week by spending 4 to 5 hours per week.
- It is highly recommended to take care of all the tasks so that one can get real experience of Databricks.
- Support will be provided through Udemy Q&A.
Here is the detailed course outline.
Getting Started with Databricks on Azure
As part of this section, we will go through the details about signing up to Azure and setup the Databricks cluster on Azure.
- Getting Started with Databricks on Azure
- Signup for the Azure Account
- Login and Increase Quotas for regional vCPUs in Azure
- Create Azure Databricks Workspace
- Launching Azure Databricks Workspace or Cluster
- Quick Walkthrough of Azure Databricks UI
- Create Azure Databricks Single Node Cluster
- Upload Data using Azure Databricks UI
- Overview of Creating Notebook and Validating Files using Azure Databricks
- Develop Spark Application using Azure Databricks Notebook
- Validate Spark Jobs using Azure Databricks Notebook
- Export and Import of Azure Databricks Notebooks
- Terminating Azure Databricks Cluster and Deleting Configuration
- Delete Azure Databricks Workspace by deleting Resource Group
Azure Essentials for Databricks - Azure CLI
As part of this section, we will go through the details about setting up Azure CLI to manage Azure resources using relevant commands.
- Azure Essentials for Databricks - Azure CLI
- Azure CLI using Azure Portal Cloud Shell
- Getting Started with Azure CLI on Mac
- Getting Started with Azure CLI on Windows
- Warming up with Azure CLI - Overview
- Create Resource Group using Azure CLI
- Create ADLS Storage Account with in Resource Group
- Add Container as part of Storage Account
- Overview of Uploading the data into ADLS File System or Container
- Setup Data Set locally to upload into ADLS File System or Container
- Upload local directory into Azure ADLS File System or Container
- Delete Azure ADLS Storage Account using Azure CLI
- Delete Azure Resource Group using Azure CLI
Mount ADLS on to Azure Databricks to access files from Azure Blob Storage
As part of this section, we will go through the details related to mounting Azure Data Lake Storage (ADLS) on to Azure Databricks Clusters.
- Mount ADLS on to Azure Databricks - Introduction
- Ensure Azure Databricks Workspace
- Setup Databricks CLI on Mac or Windows using Python Virtual Environment
- Configure Databricks CLI for new Azure Databricks Workspace
- Register an Azure Active Directory Application
- Create Databricks Secret for AD Application Client Secret
- Create ADLS Storage Account
- Assign IAM Role on Storage Account to Azure AD Application
- Setup Retail DB Dataset
- Create ADLS Container or File System and Upload Data
- Start Databricks Cluster to mount ADLS
- Mount ADLS Storage Account on to Azure Databricks
- Validate ADLS Mount Point on Azure Databricks Clusters
- Unmount the mount point from Databricks
- Delete Azure Resource Group used for Mounting ADLS on to Azure Databricks
Setup Local Development Environment for Databricks
As part of this section, we will go through the details related to setting up of local development environment for Databricks using tools such as Pycharm, Databricks dbconnect, Databricks dbutils, etc.
- Setup Single Node Databricks Cluster
- Install Databricks Connect
- Configure Databricks Connect
- Integrating Pycharm with Databricks Connect
- Integrate Databricks Cluster with Glue Catalog
- Setup AWS s3 Bucket and Grant Permissions
- Mounting s3 Buckets into Databricks Clusters
- Using Databricks dbutils from IDEs such as Pycharm
Using Databricks CLI
As part of this section, we will get an overview of Databricks CLI to interact with Databricks File System or DBFS.
- Introduction to Databricks CLI
- Install and Configure Databricks CLI
- Interacting with Databricks File System using Databricks CLI
- Getting Databricks Cluster Details using Databricks CLI
Databricks Jobs and Clusters
As part of this section, we will go through the details related to Databricks Jobs and Clusters.
- Introduction to Databricks Jobs and Clusters
- Creating Pools in Databricks Platform
- Create Cluster on Azure Databricks
- Request to Increase CPU Quota on Azure
- Creating Job on Databricks
- Submitting Jobs using Databricks Job Cluster
- Create Pool in Databricks
- Running Job using Interactive Databricks Cluster Attached to Pool
- Running Job Using Databricks Job Cluster Attached to Pool
- Exercise - Submit the application as a job using Databricks interactive cluster
Deploy and Run Spark Applications on Databricks
As part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications.
- Prepare PyCharm for Databricks
- Prepare Data Sets
- Move files to ghactivity
- Refactor Code for Databricks
- Validating Data using Databricks
- Setup Data Set for Production Deployment
- Access File Metadata using Databricks dbutils
- Build Deployable bundle for Databricks
- Running Jobs using Databricks Web UI
- Get Job and Run Details using Databricks CLI
- Submitting Databricks Jobs using CLI
- Setup and Validate Databricks Client Library
- Resetting the Job using Databricks Jobs API
- Run Databricks Job programmatically using Python
- Detailed Validation of Data using Databricks Notebooks
Deploy and Run Spark Jobs using Notebooks
As part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications using Databricks Notebooks.
- Modularizing Databricks Notebooks
- Running Job using Databricks Notebook
- Refactor application as Databricks Notebooks
- Run Notebook using Databricks Development Cluster
Deep Dive into Delta Lake using Spark Data Frames on Databricks
As part of this section, we will go through all the important details related to Databricks Delta Lake using Spark Data Frames.
- Introduction to Delta Lake using Spark Data Frames on Databricks
- Creating Spark Data Frames for Delta Lake on Databricks
- Writing Spark Data Frame using Delta Format on Databricks
- Updating Existing Data using Delta Format on Databricks
- Delete Existing Data using Delta Format on Databricks
- Merge or Upsert Data using Delta Format on Databricks
- Deleting using Merge in Delta Lake on Databricks
- Point in Snapshot Recovery using Delta Logs on Databricks
- Deleting unnecessary Delta Files using Vacuum on Databricks
- Compaction of Delta Lake Files on Databricks
Deep Dive into Delta Lake using Spark SQL on Databricks
As part of this section, we will go through all the important details related to Databricks Delta Lake using Spark SQL.
- Introduction to Delta Lake using Spark SQL on Databricks
- Create Delta Lake Table using Spark SQL on Databricks
- Insert Data to Delta Lake Table using Spark SQL on Databricks
- Update Data in Delta Lake Table using Spark SQL on Databricks
- Delete Data from Delta Lake Table using Spark SQL on Databricks
- Merge or Upsert Data into Delta Lake Table using Spark SQL on Databricks
- Using Merge Function over Delta Lake Table using Spark SQL on Databricks
- Point in Snapshot Recovery using Delta Lake Table using Spark SQL on Databricks
- Vacuuming Delta Lake Tables using Spark SQL on Databricks
- Compaction of Delta Lake Tables using Spark SQL on Databricks
Accessing Databricks Cluster Terminal via Web as well as SSH
As part of this section, we will see how to access terminal related to Databricks Cluster via Web as well as SSH.
- Enable Web Terminal in Databricks Admin Console
- Launch Web Terminal for Databricks Cluster
- Setup SSH for the Databricks Cluster Driver Node
- Validate SSH Connectivity to the Databricks Driver Node on AWS
- Limitations of SSH and comparison with Web Terminal related to Databricks Clusters
Installing Softwares on Databricks Clusters using init scripts
As part of this section, we will see how to bootstrap Databricks clusters by installing relevant 3rd party libraries for our applications.
- Setup gen_logs on Databricks Cluster
- Overview of Init Scripts for Databricks Clusters
- Create Script to install software from git on Databricks Cluster
- Copy init script to dbfs location
- Create Databricks Standalone Cluster with init script
Quick Recap of Spark Structured Streaming
As part of this section, we will get a quick recap of Spark Structured streaming.
- Validate Netcat on Databricks Driver Node
- Push log messages to Netcat Webserver on Databricks Driver Node
- Reading Web Server logs using Spark Structured Streaming
- Writing Streaming Data to Files
Incremental Loads using Spark Structured Streaming on Databricks
As part of this section, we will understand how to perform incremental loads using Spark Structured Streaming on Databricks.
- Overview of Spark Structured Streaming
- Steps for Incremental Data Processing on Databricks
- Configure Databricks Cluster with Instance Profile
- Upload GHArchive Files to AWS s3 using Databricks Notebooks
- Read JSON Data using Spark Structured Streaming on Databricks
- Write using Delta file format using Trigger Once on Databricks
- Analyze GHArchive Data in Delta files using Spark on Databricks
- Add New GHActivity JSON files on Databricks
- Load Data Incrementally to Target Table on Databricks
- Validate Incremental Load on Databricks
- Internals of Spark Structured Streaming File Processing on Databricks
Incremental Loads using autoLoader Cloud Files on Databricks
As part of this section we will see how to perform incremental loads using autoLoader cloudFiles on Databricks Clusters.
- Overview of AutoLoader cloudFiles on Databricks
- Upload GHArchive Files to s3 on Databricks
- Write Data using AutoLoader cloudFiles on Databricks
- Add New GHActivity JSON files on Databricks
- Load Data Incrementally to Target Table on Databricks
- Add New GHActivity JSON files on Databricks
- Overview of Handling S3 Events using AWS Services on Databricks
- Configure IAM Role for cloudFiles file notifications on Databricks
- Incremental Load using cloudFiles File Notifications on Databricks
- Review AWS Services for cloudFiles Event Notifications on Databricks
- Review Metadata Generated for cloudFiles Checkpointing on Databricks
Overview of Databricks SQL Clusters
As part of this section, we will get an overview of Databricks SQL Clusters.
- Overview of Databricks SQL Platform - Introduction
- Run First Query using SQL Editor of Databricks SQL
- Overview of Dashboards using Databricks SQL
- Overview of Databricks SQL Data Explorer to review Metastore Databases and Tables
- Use Databricks SQL Editor to develop scripts or queries
- Review Metadata of Tables using Databricks SQL Platform
- Overview of loading data into retail_db tables
- Configure Databricks CLI to push data into the Databricks Platform
- Copy JSON Data into DBFS using Databricks CLI
- Analyze JSON Data using Spark APIs
- Analyze Delta Table Schemas using Spark APIs
- Load Data from Spark Data Frames into Delta Tables
- Run Adhoc Queries using Databricks SQL Editor to validate data
- Overview of External Tables using Databricks SQL
- Using COPY Command to Copy Data into Delta Tables
- Manage Databricks SQL Endpoints

What's inside

Learning objectives

Data engineering leveraging databricks features
Databricks cli to manage files, data engineering jobs and clusters for data engineering pipelines
Deploying data engineering applications developed using pyspark on job clusters
Deploying data engineering applications developed using pyspark using notebooks on job clusters
Perform crud operations leveraging delta lake using spark sql for data engineering applications or pipelines
Perform crud operations leveraging delta lake using pyspark for data engineering applications or pipelines
Setting up development environment to develop data engineering applications using databricks

Building data engineering pipelines using spark structured streaming on databricks clusters
Incremental file processing using spark structured streaming leveraging databricks auto loader cloudfiles
Overview of auto loader cloudfiles file discovery modes - directory listing and file notifications
Differences between auto loader cloudfiles file discovery modes - directory listing and file notifications
Differences between traditional spark structured streaming and leveraging databricks auto loader cloudfiles for incremental file processing.
Show more
Show less

Data engineering leveraging databricks features
Databricks cli to manage files, data engineering jobs and clusters for data engineering pipelines
Deploying data engineering applications developed using pyspark on job clusters
Deploying data engineering applications developed using pyspark using notebooks on job clusters
Perform crud operations leveraging delta lake using spark sql for data engineering applications or pipelines
Perform crud operations leveraging delta lake using pyspark for data engineering applications or pipelines
Setting up development environment to develop data engineering applications using databricks
Building data engineering pipelines using spark structured streaming on databricks clusters
Incremental file processing using spark structured streaming leveraging databricks auto loader cloudfiles
Overview of auto loader cloudfiles file discovery modes - directory listing and file notifications
Differences between auto loader cloudfiles file discovery modes - directory listing and file notifications
Differences between traditional spark structured streaming and leveraging databricks auto loader cloudfiles for incremental file processing.
Show more
Show less

Syllabus

Introduction to Data Engineering using Databricks

Overview of the course - Data Engineering using Databricks

Where are the resources that are used for this course?

Traffic lights

Read about what's good

what should give you pause

and possible dealbreakers

Meets the needs of students and professionals who want to gain expertise in Data Engineering

Provides ample hands-on experience through Spark Structured Streaming, Delta Lake, and autoLoader cloudFiles

Empowers learners to develop and deploy Data Engineering applications using PySpark and Databricks Notebooks

Delivers a well-rounded understanding of Data Engineering principles and techniques using Databricks

Provides foundational knowledge for experienced application developers to transition into Data Engineering

Prerequisites include prior experience with Spark, which may limit accessibility for beginners

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Data Engineering using Databricks on AWS and Azure with these activities:

Review the basics of data modeling

Show steps

Reviewing data modeling concepts will strengthen your foundational understanding of data engineering and prepare you for the challenges in this course.

Browse courses on Data Modeling

Show steps

Review notes and materials from previous courses or online resources on data modeling.
Complete practice problems or exercises related to data modeling.
Attend a workshop or webinar on data modeling to refresh your knowledge.

Follow tutorials on Delta Lake and its features

Show steps

Exploring tutorials on Delta Lake will provide you with a deeper understanding of its capabilities and how to leverage it effectively in your data engineering projects.

Browse courses on Delta Lake

Show steps

Identify and gather relevant tutorials on Delta Lake from reputable sources.
Follow the steps and instructions provided in the tutorials to gain hands-on experience with Delta Lake.
Experiment with different features of Delta Lake, such as ACID transactions and schema enforcement.

Join a study group or participate in online forums

Show steps

Engaging with peers through study groups or online forums can enhance your learning experience by providing opportunities for knowledge sharing, problem-solving, and peer feedback.

Browse courses on Collaboration

Show steps

Identify and join a study group or online forum related to data engineering or Databricks.
Participate in discussions, ask questions, and share your knowledge with others.
Collaborate on projects or assignments with fellow participants.

Two other activities

Expand to see all activities and additional details

Show all five activities

Practice writing Spark SQL queries

Show steps

Regularly practicing Spark SQL queries will improve your proficiency and confidence in working with data using SQL.

Browse courses on Spark SQL

Show steps

Find a dataset to work with, such as a CSV file or a database table.
Write Spark SQL queries to perform basic data manipulation and analysis tasks.
Use the Databricks SQL interface or a Spark SQL notebook to execute your queries.
Validate the results of your queries and identify any errors or issues.

Build a simple data pipeline using Spark Structured Streaming

Show steps

Building a data pipeline using Spark Structured Streaming will provide practical experience in designing and implementing data pipelines that process real-time data.

Browse courses on Spark Streaming

Show steps

Define the data source and the desired transformations to be performed on the data.
Create a Spark Streaming application using the appropriate APIs and libraries.
Configure the streaming application to read data from the source and apply the necessary transformations.
Deploy the streaming application on a Databricks cluster and monitor its performance.

Career center

Learners who complete Data Engineering using Databricks on AWS and Azure will develop knowledge and skills that may be useful to these careers:

Data Engineer

A Data Engineer manages and constructs data pipelines. They build a strong foundation based on their understanding of data management, cloud computing, and software development. This course covers Delta Lake, Databricks SQL, and Databricks CLI. These tools and services are used by Data Engineers every day. It may also help someone in this role to learn about topics such as Apache Spark, Spark SQL, and Spark Structured Streaming.

See salaries and explore the career path for Data Engineer

Big Data Engineer

A Big Data Engineer designs and builds big data solutions. They need a skill set that covers topics like data management, cloud computing, and software development. This course can help build a solid foundation by teaching students about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Big Data Engineer

Data Analyst

A Data Analyst collects, analyzes, and interprets data. They should have a strong foundation in data analysis, statistics, and cloud computing. This course can help build a strong foundation by teaching a Data Analyst about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Data Analyst

Data Warehouse Architect

A Data Warehouse Architect designs and manages data warehouses. They need a strong foundation in data management, cloud computing, and software development. This course can help by teaching a Data Warehouse Architect about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Data Warehouse Architect

Data Architect

A Data Architect designs and manages data architectures. They need a strong foundation in data management, cloud computing, and software development. This course can help build a solid foundation by teaching a Data Architect about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Data Architect

Solutions Architect

A Solutions Architect designs and implements technology solutions. They need a strong foundation in cloud computing, data management, and software development. This course can help by teaching a Solutions Architect about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Solutions Architect

Business Analyst

A Business Analyst analyzes business processes and makes recommendations for improvement. They should have a strong foundation in business analysis, data analysis, and cloud computing. This course can help build a strong foundation by teaching a Business Analyst about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Business Analyst

Machine Learning Engineer

A Machine Learning Engineer designs and builds machine learning models. They need a strong foundation in machine learning, cloud computing, and data management. This course can help build a solid foundation by teaching a Machine Learning Engineer about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Machine Learning Engineer

Cloud Architect

A Cloud Architect designs and manages cloud computing solutions. They should have a firm grasp of topics like cloud computing, software development, and data management. This course can help by teaching a Cloud Architect about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Cloud Architect

DevOps Engineer

A DevOps Engineer works to bridge the gap between development and operations. Some of the essential skills for this role include software development, cloud computing, and data management. This course can help by teaching a DevOps Engineer about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for DevOps Engineer

Data Scientist

A Data Scientist extracts valuable insights from large amounts of data. Some of the knowledge that a Data Scientist needs in order to do this includes information about cloud computing, statistics, and machine learning. This course can help by teaching a Data Scientist about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Data Scientist

Project Manager

A Project Manager plans and manages projects. They need a strong foundation in project management, cloud computing, and data management. This course can help by teaching a Project Manager about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Project Manager

Software Engineer

A Software Engineer designs, develops, and maintains software applications. They need a strong foundation in computer science and software development. This course can help by teaching a Software Engineer about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Software Engineer

Product Manager

A Product Manager manages the development and launch of products. They need a strong foundation in product management, cloud computing, and data management. This course can help by teaching a Product Manager about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Product Manager

Database Administrator

A Database Administrator manages and maintains databases. This course can help by teaching a Database Administrator about the Databricks platform, Apache Spark, and Spark Structured Streaming.

See salaries and explore the career path for Database Administrator

Data Engineering using Databricks on AWS and Azure

Here's a deal for you

What's inside

Learning objectives

Syllabus

Traffic lights

Save this course

Activities

Career center

Reading list

Share

Similar courses