We may earn an affiliate commission when you visit our partners.
Course image
Udemy logo

Data Engineering Essentials using SQL, Python, and PySpark

Durga Viswanatha Raju Gadiraju, Vaishnavi Kalidindi, Naga Bhuwaneshwar, Siva Kalyan Geddada, and Kavitha Penmetsa

As part of this course, you will learn all the Data Engineering Essentials related to building Data Pipelines using SQL, Python as Hadoop, Hive, or Spark SQL as well as PySpark Data Frame APIs. You will also understand the development and deployment lifecycle of Python applications using Docker as well as PySpark on multinode clusters. You will also gain basic knowledge about reviewing Spark Jobs using Spark UI.

Read more

As part of this course, you will learn all the Data Engineering Essentials related to building Data Pipelines using SQL, Python as Hadoop, Hive, or Spark SQL as well as PySpark Data Frame APIs. You will also understand the development and deployment lifecycle of Python applications using Docker as well as PySpark on multinode clusters. You will also gain basic knowledge about reviewing Spark Jobs using Spark UI.

About Data Engineering

Data Engineering is nothing but processing the data depending on our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc.

Here are some of the challenges the learners have to face to learn key Data Engineering Skills such as Python, SQL, PySpark, etc.

  • Having an appropriate environment with Apache Hadoop, Apache Spark, Apache Hive, etc working together.

  • Good quality content with proper support.

  • Enough tasks and exercises for practice

This course is designed to address these key challenges for professionals at all levels to acquire the required Data Engineering Skills (Python, SQL, and Apache Spark).

  • Setup Environment to learn Data Engineering Essentials such as SQL (using Postgres), Python, etc.

  • Setup required tables in Postgres to practice SQL

  • Writing basic SQL Queries with practical examples using

  • Performance Tuning of SQL Queries

  • Exercises and Solutions for SQL Queries.

  • Basics of Programming using Python as Programming Language

  • Python Collections for Data Engineering

  • Data Processing or Data Engineering using Pandas

  • 2 Real Time Python Projects with explanations (File Format Converter and Database Loader)

  • Scenarios covering troubleshooting and debugging in Python Applications

  • Performance Tuning Scenarios related to Data Engineering Applications using Python

  • Getting Started with Google Cloud Platform to setup Spark Environment using Databricks

  • Writing Basic Spark SQL Queries with practical examples using WHERE, JOIN, GROUP BY, HAVING, ORDER BY, etc

  • Creating Delta Tables in Spark SQL along with CRUD Operations such as INSERT, UPDATE, DELETE, MERGE, etc

  • Advanced Spark SQL Queries with practical examples such as ranking

  • Integration of Spark SQL and Pyspark

  • In-depth coverage of Apache Spark Catalyst Optimizer for Performance Tuning

  • Reading Explain Plans of Spark SQL Queries or Pyspark Data Frame APIs

  • In-depth coverage of columnar file formats and Performance tuning using Partitioning

Enroll now

What's inside

Learning objectives

  • Setup environment to learn sql and python essentials for data engineering
  • Database essentials for data engineering using postgres such as creating tables, indexes, running sql queries, using important pre-defined functions, etc.
  • Data engineering programming essentials using python such as basic programming constructs, collections, pandas, database programming, etc.
  • Data engineering using spark dataframe apis (pyspark) using databricks. learn all important spark data frame apis such as select, filter, groupby, orderby, etc.
  • Data engineering using spark sql (pyspark and spark sql). learn how to write high quality spark sql queries using select, where, group by, order by, etc.
  • Relevance of spark metastore and integration of dataframes and spark sql
  • Ability to build data engineering pipelines using spark leveraging python as programming language
  • Use of different file formats such as parquet, json, csv etc in building data engineering pipelines
  • Setup hadoop and spark cluster on gcp using dataproc
  • Understanding complete spark application development life cycle to build spark applications using pyspark. review the applications using spark ui.

Syllabus

Detailed overview of the topics related to SQL, Python, Hadoop, Spark, etc covered as part of this course.
Introduction to Data Engineering Essentials Course
Read more
Overview of our support to Data Engineering Essentials course
Overview of SQL topics covered in the course
Overview of Python topics covered in the course
Overview of Getting Started with GCP related to the course
Overview of Spark and Databricks Environment related topics
Detailed outline of Spark SQL Topics in the course
Detailed outline of Pyspark Topics in the course
Detailed outline of ELT Data Pipelines on Databricks
Overview of Performance Tuning of Spark covered in the course
Understand the relevance of SQL for Data Engineering
Introduction to SQL for Data Engineering
Overview of Application Architecture and RDBMS
Overview of Database Technologies and relevance of SQL
Overview of Purpose Built Databases
Overview of Data Warehouse and Data Lake
Usage of RDBMS and Data Warehouse technologies
Differences and Similarities between RDBMS and Data Warehouse Technologies
Setup required tools to build applications using Python, SQL, etc
Introduction to Setting up Tools for Data Engineering Essentials
Setup VS Code on Windows
Setup Python 3.9 on Windows
Configure Environment Variable PATH for Python on Windows
Overview of learning Python using Python CLI
Integrate VSCode with Python on Windows
Install Postgres 14 on Windows 11
Getting Started with pgAdmin on Windows
Getting Started with pgAdmin on Mac
Conclusion of Setting up Tools for Data Engineering Essentials
Setup required tables and data in Postgres Database to write SQL Queries
Overview of Postgres Database Server and pgAdmin
Overview of Database Connection Details
Overview of Connecting to External Databases using pgAdmin
Create Application Database and User in Postgres Database Server
Clone Data Sets from Git Repository for Database Scripts
Register Server in pgAdmin using Application Database and User
Setup Application Tables and Data in Postgres Database
Overview of pgAdmin to write SQL Queries
Writing Basic SQL Queries
Review Data Model Diagram
Define Problem Statement for SQL Queries
Filtering Data using SQL Queries
Total Aggregations using SQL Queries
Group By Aggregations using SQL Queries
Order of Execution of SQL Queries
Rules and Restrictions to Group and Filter Data in SQL queries
Filter Data based on Aggregated Results using Group By and Having
Inner Joins using SQL Queries
Outer Joins using SQL Queries
Filter and Aggregate on Join Results using SQL
Overview of Database Views
Overview of Common Table Expressions or CTEs
Outer Join with Additional Conditions in SQL Queries
Explanation about Fix of SQL Queries with Filtering on Outer Join Results
Advanced SQL Queries with examples such as Cumulative Aggregations and Ranking
Introduction to Cumulative Aggregations and Ranking in SQL Queries
Overview of CTAS to create tables based on Query Results
Create Tables for Cumulative Aggregations and Ranking
Overview of OVER and PARTITION BY Clause in SQL Queries
Compute Total Aggregation using OVER and PARTITION BY in SQL Queries
Overview of Ranking in SQL
Compute Global Ranks using SQL
Compute Ranks based on key using SQL
Rules and Restrictions to Filter Data based on Ranks in SQL
Filtering based on Global Ranks using Nested Queries and CTEs in SQL
Filtering based on Ranks per Partition using Nested Queries and CTEs in SQL
Create Students table with Data for ranking using SQL
Difference between rank and dense rank using SQL
Troubleshooting and Debugging Database Connectivity Issues
Introduction to SQL Troubleshooting and Debugging Guide
Overview of Database Connectivity Issues
Validate and Setup Telnet on Mac or PC
Validate Connectivity to Database Server using telnet
Troubleshoot Database Connectivity Issue with Correct Host Details
Current Databases and Users in Postgres Database Server
Troubleshoot Database Credentials and Permissions Issues
Overview of Compilation of SQL Queries
Troubleshooting Syntax Errors in SQL Queries
Troubleshooting Semantec Errors in SQL Queries
Overview of Bugs in SQL Queries
Development Best Practices with tips to troubleshoot SQL bugs
Develop Initial Solution based on the requirement
Identify and Troubleshoot Bugs in SQL Queries
Develop Solution using Development Best Practices
Read Explain Plans, identify performance bottlenecks and solve performance tuning problems
Introduction to Performance Tuning of SQL Queries
Overview of SQL Compilation Process and Explain Plans
Generate Explain Plans for SQL Queries
Review Tables used for Performance Tuning of SQL Queries
Review Data Storage Internals for Tables and Indexes
Review key terms used in Explain Plans for SQL Queries
Interpret Explain Plans for Basic SQL Queries
Review the Common Application Scenarios for Performance Tuning
Write SQL Queries for Customer Orders
Performance Testing of SQL Queries using Stored Procedure
Add Required Indexes to tune performance of SQL Queries
Guidelines on adding Indexes on Tables for SQL Queries
Interpreting the explain plan for SQL Queries using Indexes
Conclusion of Performance Tuning of SQL Queries

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Teachers how to process and pipe data, which are essential to handling the scale of today's production problems
Covers SQL and Python, two of the most important tools for modern operations
Develops skills for building production-grade data pipelines that can be used in enterprise settings
Includes building blocks for both batch and streaming pipelines using PySpark
Teaches how to tune the performance of PySpark pipelines for production environments
Requires a basic understanding of Python, SQL, and data engineering principles

Save this course

Save Data Engineering Essentials using SQL, Python, and PySpark to your list so you can find it easily later:
Save

Activities

Coming soon We're preparing activities for Data Engineering Essentials using SQL, Python, and PySpark. These are activities you can do either before, during, or after a course.

Career center

Learners who complete Data Engineering Essentials using SQL, Python, and PySpark will develop knowledge and skills that may be useful to these careers:
Data Engineer
Data Engineers are responsible for building, testing, and deploying data pipelines. They work closely with data scientists and other stakeholders to ensure that data meets their needs. This course provides a comprehensive overview of the skills and knowledge needed to be a successful Data Engineer, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in data engineering and prepare you for a career in this in-demand field.
Data Scientist
Data Scientists use data to solve business problems. They work with data engineers to build data pipelines and with data analysts to analyze data and develop insights. This course provides a strong foundation in the skills and knowledge needed to be a successful Data Scientist, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in data science and prepare you for a career in this exciting field.
Data Analyst
Data Analysts use data to analyze trends and patterns. They work with data engineers and data scientists to build data pipelines and develop insights. This course provides a strong foundation in the skills and knowledge needed to be a successful Data Analyst, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in data analysis and prepare you for a career in this growing field.
Database Administrator
Database Administrators are responsible for managing and maintaining databases. They work with data engineers and data scientists to ensure that data is stored and accessed efficiently. This course provides a comprehensive overview of the skills and knowledge needed to be a successful Database Administrator, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in database administration and prepare you for a career in this essential field.
Software Engineer
Software Engineers design, build, and maintain software applications. They work with data engineers and data scientists to build data pipelines and develop insights. This course provides a strong foundation in the skills and knowledge needed to be a successful Software Engineer, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in software engineering and prepare you for a career in this in-demand field.
Data Architect
Data Architects design and implement data architectures for organizations. They work with data engineers and data scientists to ensure that data is stored and accessed efficiently. This course provides a comprehensive overview of the skills and knowledge needed to be a successful Data Architect, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in data architecture and prepare you for a career in this essential field.
Business Analyst
Business Analysts use data to solve business problems. They work with data engineers and data scientists to build data pipelines and develop insights. This course provides a strong foundation in the skills and knowledge needed to be a successful Business Analyst, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in business analysis and prepare you for a career in this growing field.
Project Manager
Project Managers plan, execute, and close projects. They work with data engineers and data scientists to ensure that data projects are delivered on time and within budget. This course provides a comprehensive overview of the skills and knowledge needed to be a successful Project Manager, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in project management and prepare you for a career in this essential field.
Data Warehouse Analyst
Data Warehouse Analysts design and implement data warehouses for organizations. They work with data engineers and data scientists to ensure that data is stored and accessed efficiently. This course provides a comprehensive overview of the skills and knowledge needed to be a successful Data Warehouse Analyst, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in data warehousing and prepare you for a career in this essential field.
Database Developer
Database Developers design and develop databases for organizations. They work with data engineers and data scientists to ensure that data is stored and accessed efficiently. This course provides a comprehensive overview of the skills and knowledge needed to be a successful Database Developer, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in database development and prepare you for a career in this essential field.
Data Visualization Analyst
Data Visualization Analysts use data to create visualizations that communicate insights to stakeholders. They work with data engineers and data scientists to build data pipelines and develop insights. This course provides a strong foundation in the skills and knowledge needed to be a successful Data Visualization Analyst, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in data visualization and prepare you for a career in this growing field.
Data Governance Analyst
Data Governance Analysts develop and implement data governance policies and procedures for organizations. They work with data engineers and data scientists to ensure that data is used in a consistent and ethical manner. This course provides a comprehensive overview of the skills and knowledge needed to be a successful Data Governance Analyst, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in data governance and prepare you for a career in this essential field.
Information Security Analyst
Information Security Analysts protect data from unauthorized access, use, disclosure, disruption, modification, or destruction. They work with data engineers and data scientists to ensure that data is stored and accessed securely. This course provides a comprehensive overview of the skills and knowledge needed to be a successful Information Security Analyst, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in information security and prepare you for a career in this essential field.
Data Privacy Analyst
Data Privacy Analysts develop and implement data privacy policies and procedures for organizations. They work with data engineers and data scientists to ensure that data is used in a compliant manner. This course provides a comprehensive overview of the skills and knowledge needed to be a successful Data Privacy Analyst, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in data privacy and prepare you for a career in this essential field.
Data Science Consultant
Data Science Consultants help organizations to use data to make better decisions. They work with data engineers and data scientists to build data pipelines and develop insights. This course provides a strong foundation in the skills and knowledge needed to be a successful Data Science Consultant, including SQL, Python, and PySpark. With hands-on exercises and real-world examples, this course will help you build a foundation in data science consulting and prepare you for a career in this growing field.

Reading list

We've selected 11 books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Data Engineering Essentials using SQL, Python, and PySpark.
For Python programming, this book provides a solid theoretical grounding for Python as well as scenario-based learning.
Serves as a comprehensive reference guide to help students scale their Apache Spark deployments.
Offers an in-depth exploration of speech and language processing, providing a theoretical foundation and practical applications.
While it covers a wider range of topics than the course, this book can supplement the learning of Python.
For those interested in natural language processing, this book provides a practical guide to using Python for NLP tasks.
If you want to dive deeper into the machine learning aspect, this book provides a comprehensive overview of the field, including Python implementations.
Is considered a classic in the field of reinforcement learning and provides a solid theoretical foundation for the subject.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Data Engineering Essentials using SQL, Python, and PySpark.
Distributed Computing with Spark SQL
Most relevant
Data Engineering using Databricks on AWS and Azure
Most relevant
Getting Started with Apache Spark on Databricks
Most relevant
Introduction to PySpark
Most relevant
Scalable Machine Learning on Big Data using Apache Spark
Most relevant
Data Engineering Capstone Project
Most relevant
Apache Spark for Data Engineering and Machine Learning
Most relevant
Big Data, Hadoop, and Spark Basics
Most relevant
Machine Learning with Apache Spark
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser