We may earn an affiliate commission when you visit our partners.
Course image
Udemy logo

Data Engineering using Kafka and Spark Structured Streaming

Durga Viswanatha Raju Gadiraju, Naga Bhuwaneshwar, and Kavitha Penmetsa

As part of this course, you will be learning to build streaming pipelines by integrating Kafka and Spark Structured Streaming. Let us go through the details about what is covered in the course.

Read more

As part of this course, you will be learning to build streaming pipelines by integrating Kafka and Spark Structured Streaming. Let us go through the details about what is covered in the course.

  • First of all, we need to have the proper environment to build streaming pipelines using Kafka and Spark Structured Streaming on top of Hadoop or any other distributed file system. As part of the course, you will start with setting up a self-support lab with all the key components such as Hadoop, Hive, Spark, and Kafka on a single node Linux-based system.

  • Once the environment is set up you will go through the details related to getting started with Kafka. As part of that process, you will create a Kafka topic, produce messages into the topic as well as consume messages from the topic.

  • You will also learn how to use Kafka Connect to ingest data from web server logs into Kafka topic as well as ingest data from Kafka topic into HDFS as a sink.

  • Once you understand Kafka from the perspective of Data Ingestion, you will get an overview of some of the key concepts of related Spark Structured Streaming.

  • After learning Kafka and Spark Structured streaming separately, you will build a streaming pipeline to consume data from Kafka topic using Spark Structured Streaming, then process and write to different targets.

  • You will also learn how to take care of incremental data processing using Spark Structured Streaming.

Course Outline

Here is a brief outline of the course. You can choose either Cloud9 or GCP to provision a server to set up the environment.

  • Setting up Environment using AWS Cloud9 or GCP

  • Setup Single Node Hadoop Cluster

  • Setup Hive and Spark on top of Single Node Hadoop Cluster

  • Setup Single Node Kafka Cluster on top of Single Node Hadoop Cluster

  • Getting Started with Kafka

  • Data Ingestion using Kafka Connect - Web server log files as a source to Kafka Topic

  • Data Ingestion using Kafka Connect - Kafka Topic to HDFS a sink

  • Overview of Spark Structured Streaming

  • Kafka and Spark Structured Streaming Integration

  • Incremental Loads using Spark Structured Streaming

Udemy based support

In case you run into technical challenges while taking the course, feel free to raise your concerns using Udemy Messenger. We will make sure that issue is resolved in 48 hours.

Enroll now

What's inside

Learning objectives

  • Setting up self support lab with hadoop (hdfs and yarn), hive, spark, and kafka
  • Overview of kafka to build streaming pipelines
  • Data ingestion to kafka topics using kafka connect using file source
  • Data ingestion to hdfs using kafka connect using hdfs 3 connector plugin
  • Overview of spark structured streaming to process data as part of streaming pipelines
  • Incremental data processing using spark structured streaming using file source and file target
  • Integration of kafka and spark structured streaming - reading data from kafka topics

Syllabus

Introduction
Introduction to Data Engineering using Kafka and Spark Structured Streaming
Important Note for first time Data Engineering Customers
Read more
Important Note for Data Engineering Essentials (Python and Spark) Customers
How to get 30 days complementary lab access?
How to access material used for this course?
Getting Started with Kafka
Overview of Kafka
Managing Topics using Kafka CLI
Produce and Consume Messages using CLI
Validate Generation of Web Server Logs
Create Web Server using nc
Produce retail logs to Kafka Topic
Consume retail logs from Kafka Topic
Clean up Kafka CLI Sessions to produce and consume messages
Define Kafka Connect to produce
Validate Kafka Connect to produce
Data Ingestion using Kafka Connect
Overview of Kafka Connect
Define Kafka Connect to Produce Messages
Validate Kafka Connect to produce messages
Cleanup Kafka Connect to produce messages
Write Data to HDFS using Kafka Connect
Setup HDFS 3 Sink Connector Plugin
Overview of Kafka Consumer Groups
Configure HDFS 3 Sink Properties
Run and Validate HDFS 3 Sink
Cleanup Kafka Connect to consume messages
Overview of Spark Structured Streaming
Understanding Streaming Context
Validate Log Data for Streaming
Push log messages to Netcat Webserver
Overview of built-in Input Sources
Reading Web Server logs using Spark Structured Streaming
Overview of Output Modes
Using append as Output Mode
Using complete as Output Mode
Using update as Output Mode
Overview of Triggers in Spark Structured Streaming
Overview of built-in Output Sinks
Previewing the Streaming Data
Kafka and Spark Structured Streaming Integration
Create Kafka Topic
Read Data from Kafka Topic
Preview data using console
Preview data using memory
Transform Data using Spark APIs
Write Data to HDFS using Spark
Validate Data in HDFS using Spark
Write Data to HDFS using Spark using Header
Cleanup Kafka Connect and Files in HDFS
Incremental Loads using Spark Structured Streaming
Overview of Spark Structured Streaming Triggers
Steps for Incremental Data Processing
Create Working Directory in HDFS
Logic to Upload GHArchive Files
Upload GHArchive Files to HDFS
Add new GHActivity JSON Files
Read JSON Data using Spark Structured streaming
Write in Parquet File Format
Analyze GHArchive Data in Parquet files using Spark
Add New GHActivity JSON files
Load Data Incrementally to Target Table
Validate Incremental Load
Using maxFilerPerTrigger and latestFirst
Incremental Load using Archival Process
Setup development environment to learn SQL using Postgres, Python as well as Spark
Getting Started with Cloud9
Creating Cloud9 Environment
Warming up with Cloud9 IDE
Overview of EC2 related to Cloud9
Opening ports for Cloud9 Instance
Associating Elastic IPs to Cloud9 Instance
Increase EBS Volume Size of Cloud9 Instance
Setup Jupyter Lab on Cloud9
[Commands] Setup Jupyter Lab on Cloud9
Setting up Environment - Overview of GCP and Provision Ubuntu VM
Signing up for GCP
Overview of GCP Web Console
Overview of GCP Pricing
Provision Ubuntu VM from GCP
Setup Docker
Validating Python
Setup Jupyter Lab
Setup Jupyter Lab locally on Mac
Setup Single Node Hadoop Cluster
Introduction to Single Node Hadoop Cluster
Material related to setting up the environment
Setup Prerequisites
Setup Password less login
Download and Install Hadoop
Configure Hadoop HDFS
Start and Validate HDFS
Configure Hadoop YARN
Start and Validate YARN
Managing Single Node Hadoop

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Provides comprehensive knowledge and hands-on experience for building streaming pipelines with Kafka and Spark Structured Streaming
Covers real-world use cases, such as data ingestion using Kafka Connect and incremental data processing
Instructors have industry experience and are recognized for their expertise in Kafka and Spark
Requires a setup lab with Hadoop, Hive, Spark, and Kafka, which may involve additional effort

Save this course

Save Data Engineering using Kafka and Spark Structured Streaming to your list so you can find it easily later:
Save

Activities

Coming soon We're preparing activities for Data Engineering using Kafka and Spark Structured Streaming. These are activities you can do either before, during, or after a course.

Career center

Learners who complete Data Engineering using Kafka and Spark Structured Streaming will develop knowledge and skills that may be useful to these careers:
Data Engineer
As a Data Engineer, your primary focus is to design, build, and maintain complex data pipelines. Data Engineers are in high demand across a variety of industries due to the increasing amount of data that businesses collect today. Their responsibilities primarily involve building, testing, and deploying data management systems and data pipelines to move and transform data across a variety of data sources and targets. This course may be useful for someone who wishes to become or advance their career as a data engineer, as it will provide a foundation in using Kafka, Spark Structured Streaming, and Hadoop to perform data ingestion, processing, and analysis. These skills are in high demand in the field of data engineering.
Data Analyst
Data Analysts play a vital role in understanding and communicating data insights. Their day-to-day work typically includes working with large datasets, conducting statistical analysis, interpreting results, and communicating these insights to stakeholders in order to inform decision-making. This course may be useful for data analysts who wish to expand their skillset and transition into a role as a data engineer, which combines data analysis with data engineering. This course will help data analysts gain valuable experience in using Kafka, Spark Structured Streaming, and Hadoop to ingest and process data. Data analysts who are interested in working with big data may find this course particularly useful.
Software Engineer
Software Engineers apply engineering principles to the design, development, deployment, and maintenance of software systems. They work on a variety of projects throughout their career, and some focus on specific domains such as data engineering, computer science, or web development. This course may be useful for Software Engineers who are interested in learning more about data engineering. Kafka, Spark Structured Streaming, and Hadoop are popular tools in the field of data engineering, and gaining proficiency in these tools will make a Software Engineer more competitive in the job market.
Data Scientist
Data Scientists use scientific methods, processes, algorithms, and systems to extract knowledge and insights from data. They may work on a variety of projects throughout their career, and some focus on specific domains such as NLP, computer vision, or speech recognition. This course may be useful for Data Scientists who want to improve their skills in data engineering. The skills learned in this course will help Data Scientists build data pipelines and perform data processing tasks more efficiently.
Business Intelligence Analyst
Business Intelligence Analysts focus on using data to make better business decisions. They collect, analyze, interpret, and present data in order to help businesses understand their performance and make better decisions. This course may be useful for Business Intelligence Analysts who are interested in learning more about data engineering. Kafka, Spark Structured Streaming, and Hadoop are popular tools in the field of data engineering, and gaining proficiency in these tools will make a Business Intelligence Analyst more competitive in the job market.
Database Administrator
Database Administrators are responsible for the installation, configuration, maintenance, and performance of database systems. Their duties include designing, implementing, and managing databases, as well as ensuring that data is secure and accessible. This course may be helpful for someone who wishes to become or advance their career as a database administrator, as it will provide a foundation in using Kafka, Spark Structured Streaming, and Hadoop to work with data at scale. These skills are in high demand in the field of database administration.
DevOps Engineer
DevOps Engineers focus on bridging the gap between development and operations teams. They work on a variety of projects throughout their career, and some focus on specific domains such as data engineering, software development, or cloud computing. This course may be useful for DevOps Engineers who are interested in learning more about data engineering. Kafka, Spark Structured Streaming, and Hadoop are popular tools in the field of data engineering, and gaining proficiency in these tools will make a DevOps Engineer more competitive in the job market.
Cloud Engineer
Cloud Engineers focus on designing, building, and maintaining cloud-based systems. They work on a variety of projects throughout their career, and some focus on specific domains such as data engineering, cloud computing, or networking. This course may be useful for Cloud Engineers who are interested in learning more about data engineering. Kafka, Spark Structured Streaming, and Hadoop are popular tools in the field of data engineering, and gaining proficiency in these tools will make a Cloud Engineer more competitive in the job market.
Data Governance Analyst
Data Governance Analysts focus on developing and implementing data governance policies and procedures. Their responsibilities include ensuring that data is used in a consistent and ethical manner, and that data is protected from unauthorized access. This course may be useful for Data Governance Analysts who are interested in learning more about data engineering. Kafka, Spark Structured Streaming, and Hadoop are popular tools in the field of data engineering, and gaining proficiency in these tools will make a Data Governance Analyst more competitive in the job market.
Machine Learning Engineer
Machine Learning Engineers focus on designing, building, and maintaining machine learning models. They work on a variety of projects throughout their career, and some focus on specific domains such as data engineering, machine learning, or artificial intelligence. This course may be useful for Machine Learning Engineers who are interested in learning more about data engineering. Kafka, Spark Structured Streaming, and Hadoop are popular tools in the field of data engineering, and gaining proficiency in these tools will make a Machine Learning Engineer more competitive in the job market.
Data Architect
Data Architects focus on designing and managing data systems. Their responsibilities include designing data models, developing data management strategies, and ensuring that data is accessible and secure. This course may be useful for Data Architects who are interested in learning more about data engineering. Kafka, Spark Structured Streaming, and Hadoop are popular tools in the field of data engineering, and gaining proficiency in these tools will make a Data Architect more competitive in the job market.
Information Security Analyst
Information Security Analysts focus on protecting data and information systems from unauthorized access, use, disclosure, disruption, modification, or destruction. Their responsibilities include developing and implementing security policies and procedures, and monitoring for and responding to security breaches. This course may be useful for Information Security Analysts who are interested in learning more about data engineering. Kafka, Spark Structured Streaming, and Hadoop are popular tools in the field of data engineering, and gaining proficiency in these tools will make an Information Security Analyst more competitive in the job market.
Product Manager
Product Managers focus on defining the vision, roadmap, and features of a product. Their responsibilities include working with engineers, designers, and other stakeholders to bring a product to market. This course may be useful for Product Managers who are interested in learning more about data engineering. Kafka, Spark Structured Streaming, and Hadoop are popular tools in the field of data engineering, and gaining proficiency in these tools will make a Product Manager more competitive in the job market.
Technical Writer
Technical Writers focus on creating user manuals, technical documentation, and other written materials. Their responsibilities include gathering and organizing technical information, and writing clear and concise documentation. This course may be useful for Technical Writers who are interested in learning more about data engineering. Kafka, Spark Structured Streaming, and Hadoop are popular tools in the field of data engineering, and gaining proficiency in these tools will make a Technical Writer more competitive in the job market.
Quality Assurance Analyst
Quality Assurance Analysts focus on testing and validating software applications to ensure that they meet quality standards. Their responsibilities include developing and executing test plans, and reporting on the results of testing. This course may be useful for Quality Assurance Analysts who are interested in learning more about data engineering. Kafka, Spark Structured Streaming, and Hadoop are popular tools in the field of data engineering, and gaining proficiency in these tools will make a Quality Assurance Analyst more competitive in the job market.

Reading list

We've selected seven books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Data Engineering using Kafka and Spark Structured Streaming.
Is essential reading for anyone who wants to learn about how Kafka works, and how to use it to build and operate real-world streaming applications. It covers everything from the basics of Kafka's architecture and APIs to advanced topics such as performance tuning and security.
Comprehensive guide to using Spark, a popular library for large-scale data processing. It covers everything from the basics of Spark to advanced topics such as machine learning and graph processing.
Comprehensive guide to using Spark, a popular library for large-scale data processing. It covers everything from the basics of Spark to advanced topics such as machine learning and graph processing.
Practical guide to building and operating streaming data pipelines using Apache Spark. It covers everything from the basics of Spark to advanced topics such as performance tuning and security.
Provides a comprehensive overview of big data analytics, including its concepts, tools, and techniques. It good choice for anyone who wants to learn more about big data analytics and how to use it for data-driven decision-making.
Provides a practical introduction to machine learning with Apache Spark, including its algorithms, techniques, and use cases. It good choice for anyone who wants to learn how to use Spark for machine learning.
Provides a comprehensive overview of Apache Hadoop, including its architecture, components, and use cases. It good choice for anyone who wants to learn more about Hadoop and how to use it for data storage and processing.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Data Engineering using Kafka and Spark Structured Streaming.
Processing Streaming Data Using Apache Spark Structured...
Most relevant
Streaming API Development and Documentation
Most relevant
Structured Streaming in Apache Spark 2
Most relevant
Windowing and Join Operations on Streaming Data with...
Most relevant
Big Data, Hadoop, and Spark Basics
Most relevant
Cloud Computing Applications, Part 2: Big Data and...
Most relevant
Apache Spark for Data Engineering and Machine Learning
Most relevant
Conceptualizing the Processing Model for Apache Spark...
Most relevant
Machine Learning with Apache Spark
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser