Conceptualizing the Processing Model for Apache Spark Structured Streaming from Pluralsight

Much real-world data is available in streams; from self-driving car sensors to weather monitors. Apache Spark 2 is a strong analytics engine with first-class support for streaming operations using micro-batch and continuous processing.

Structured Streaming in Spark 2 is a unified model that treats batch as a prefix of stream. This allows Spark to perform the same operations on streaming data as on batch data, and Spark takes care of the details involved in incrementalizing the batch operation to work on streams.

In this course, Conceptualizing the Processing Model for Apache Spark Structured Streaming, you will use the DataFrame API as well as Spark SQL to run queries on streaming sources and write results out to data sinks.

First, you will be introduced to streaming DataFrames in Spark 2 and understand how structured streaming in Spark 2 is different from Spark Streaming available in earlier versions of Spark. You will also get a high level understanding of how Spark’s architecture works, and the role of drivers, workers, executors, and tasks.

Next, you will execute queries on streaming data from a socket source as well as a file system source. You will perform basic operations on streaming data using Data frames and register your data as a temporary view to run SQL queries on input streams. You will explore the append, complete, and update modes to write data out to sinks. You will then understand how scheduling and checkpointing works in Spark and explore the differences between the micro-batch mode of execution and the new experimental continuous processing mode that Spark offers.

Finally, you will discuss the Tungsten engine optimizations which make Spark 2 so much faster than Spark 1, and discuss the stages of optimization in the Catalyst optimizer which works with SQL queries.

At the end of this course, you will be able to build and execute streaming queries on input data, write these out to reliable storage using different output modes, and checkpoint your streaming applications for fault tolerance and recovery.

What's inside

Syllabus

Course Overview

Getting Started with Structured Streaming

Executing Streaming Queries

Understanding Scheduling and Checkpointing

Traffic lights

Read about what's good

what should give you pause

and possible dealbreakers

Teaches streaming data analysis using Apache Spark 2, which is widely used in industry

Instructors Janani Ravi are recognized for their work in Spark Streaming and structured streaming

Builds a strong foundation for learners new to structured streaming

Reviews summary

Conceptual understanding of spark structured streaming

Learners say this course provides an excellent conceptual foundation for Apache Spark Structured Streaming, particularly appreciating its clear explanations of the micro-batch and continuous processing models, along with insights into the Tungsten engine and Catalyst optimizer. Many find the instructor's explanations to be clear and concise, making complex topics like scheduling and checkpointing easy to digest. However, a notable portion of students indicate it's highly theoretical, wishing for more hands-on coding examples and real-world implementation scenarios. While it excels at the 'why', those seeking 'how-to-code' for advanced production use cases may find it requires supplementary practical resources.

Benefits from existing Spark/programming basics.

"It assumes a basic understanding of Spark and Scala/Python which wasn't explicitly stated upfront. I had to brush up on some basics."

"I found myself needing prior Spark and programming knowledge to keep up with the pace and depth."

"I recommend having familiarity with core Spark concepts before diving into this course."

Instructor makes complex topics easily understandable.

"The instructor's explanations were clear and concise, making complex topics like checkpointing and query planning easy to digest."

"I loved how this course broke down the complexities of Spark Structured Streaming... The instructor made the content engaging."

"The explanation of Tungsten and Catalyst was exceptionally clear."

Offers deep insights into Spark's streaming internals.

"This course provided an excellent conceptual foundation for Spark Structured Streaming."

"Very informative for understanding how Spark Structured Streaming works under the hood."

"I needed to solidify my understanding of Spark Structured Streaming's processing model, and this course delivered perfectly."

Strong on theory, but limited practical coding examples.

"I would have liked more hands-on coding examples beyond the basic DataFrame operations, perhaps some more complex joins or stateful operations."

"The course covers the processing model well conceptually, but it felt a bit too theoretical for me. I was hoping for more practical exercises and real-world scenarios."

"While the course provides good conceptual insight, I found the lack of extensive practical examples limiting. For a topic like Spark, I prefer learning by doing."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Conceptualizing the Processing Model for Apache Spark Structured Streaming with these activities:

Review the concepts of stream processing

Show steps

Reviewing the concepts of stream processing will help you understand the purpose and benefits of using Apache Spark Structured Streaming.

Browse courses on Stream Processing

Show steps

Read articles or books about stream processing.
Watch videos or tutorials about stream processing.
Complete practice exercises on stream processing.

Practice writing SQL queries

Show steps

Practicing writing SQL queries will help you refresh your skills and prepare for using SQL in Apache Spark Structured Streaming.

Browse courses on SQL

Show steps

Find a dataset and practice writing SQL queries to extract and analyze data.
Use an online SQL editor or IDE to practice writing and executing SQL queries.
Complete practice exercises or quizzes on SQL.

Organize notes, assignments, and materials

Show steps

Organizing materials will help you identify gaps in your knowledge and prepare for the course.

Show steps

Gather all course materials, including notes, assignments, and exams.
Review materials and identify areas where you need to focus your studies.
Create a study plan that outlines your goals and how you will achieve them.

Six other activities

Expand to see all activities and additional details

Show all nine activities

Follow tutorials on Apache Spark Structured Streaming

Show steps

Following tutorials will help you learn the basics of Apache Spark Structured Streaming and how to use it to process streaming data.

Browse courses on Structured Streaming

Show steps

Find tutorials on Apache Spark Structured Streaming.
Follow the tutorials and complete the exercises.
Apply what you learn in the tutorials to your own projects.

Join a study group or online forum for Apache Spark Structured Streaming

Show steps

Joining a study group or online forum will allow you to connect with other learners and discuss Apache Spark Structured Streaming.

Browse courses on Structured Streaming

Show steps

Find a study group or online forum for Apache Spark Structured Streaming.
Introduce yourself and share your goals.
Participate in discussions and ask questions.

Solve practice problems on Apache Spark Structured Streaming

Show steps

Solving practice problems will help you test your understanding of Apache Spark Structured Streaming and identify areas where you need more practice.

Browse courses on Structured Streaming

Show steps

Find practice problems on Apache Spark Structured Streaming.
Solve the practice problems.
Review your answers and identify areas where you need more practice.

Build a streaming data application

Show steps

Building a streaming data application will help you apply the concepts you learn in the course to a real-world scenario.

Browse courses on Streaming Data

Show steps

Choose a data source and a streaming platform.
Design the architecture of your application.
Implement the application using the Spark Streaming API.
Test and deploy your application.

Attend a workshop on Apache Spark Structured Streaming

Show steps

Attending a workshop will allow you to learn from experts and get hands-on experience with Apache Spark Structured Streaming.

Browse courses on Structured Streaming

Show steps

Find a workshop on Apache Spark Structured Streaming.
Register for the workshop.
Attend the workshop and participate in the activities.

Write a blog post or article on Apache Spark Structured Streaming

Show steps

Writing a blog post or article will help you solidify your understanding of Apache Spark Structured Streaming and share your knowledge with others.

Browse courses on Structured Streaming

Show steps

Choose a topic related to Apache Spark Structured Streaming.
Research the topic and gather information.
Write your blog post or article.
Publish your blog post or article online.

Career center

Learners who complete Conceptualizing the Processing Model for Apache Spark Structured Streaming will develop knowledge and skills that may be useful to these careers:

Data Engineer

Data Engineers are responsible for designing, building, and maintaining data pipelines, and may also work on data quality and data governance. This course may help someone in this role by teaching how to build and execute streaming queries on input data, which could help them build more efficient pipelines. Additionally, this course covers how to checkpoint streaming applications for fault tolerance and recovery, which is an important consideration for production-grade data pipelines.

See salaries and explore the career path for Data Engineer

Data Scientist

Data Scientists are responsible for developing and applying statistical and machine learning models to data to extract insights and make predictions. This course may help someone in this role by teaching how to execute queries on streaming data and write results to data sinks, which could help them build more efficient data pipelines for their models. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their models.

See salaries and explore the career path for Data Scientist

Database Administrator

Database Administrators are responsible for managing and maintaining databases. This course may help someone in this role by teaching how to execute queries on streaming data and write results to data sinks, which could help them build more efficient pipelines for their databases. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their databases.

See salaries and explore the career path for Database Administrator

Cloud Engineer

Cloud Engineers are responsible for designing, building, and maintaining cloud computing solutions. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their cloud applications. Additionally, this course covers how to checkpoint streaming applications for fault tolerance and recovery, which is an important consideration for production-grade cloud applications.

See salaries and explore the career path for Cloud Engineer

Data Architect

Data Architects are responsible for designing and building data architectures. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their data architectures. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their data architectures.

See salaries and explore the career path for Data Architect

Software Engineer

Software Engineers are responsible for designing, developing, and maintaining software applications. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their applications. Additionally, this course covers how to checkpoint streaming applications for fault tolerance and recovery, which is an important consideration for production-grade applications.

See salaries and explore the career path for Software Engineer

Data Analyst

Data Analysts are responsible for analyzing data to extract meaningful insights, and may also design and build data processing pipelines. This course may help someone in this role by teaching how to execute queries on streaming data and write results to data sinks, which could help them build more efficient pipelines.

See salaries and explore the career path for Data Analyst

Business Analyst

Business Analysts are responsible for analyzing business data to identify opportunities for improvement. This course may help someone in this role by teaching how to execute queries on streaming data and write results to data sinks, which could help them build more efficient pipelines for their business analysis. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their business analysis.

See salaries and explore the career path for Business Analyst

Product Manager

Product Managers are responsible for managing and developing products. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their products. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their products.

See salaries and explore the career path for Product Manager

Statistician

Statisticians are responsible for collecting, analyzing, and interpreting data. This course may help someone in this role by teaching how to execute queries on streaming data and write results to data sinks, which could help them build more efficient pipelines for their statistical analysis. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their statistical analysis.

See salaries and explore the career path for Statistician

Machine Learning Engineer

Machine Learning Engineers are responsible for developing and deploying machine learning models. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their machine learning models. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their machine learning models.

See salaries and explore the career path for Machine Learning Engineer

Data Science Manager

Data Science Managers are responsible for leading and managing data science teams. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their teams. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their teams' work.

See salaries and explore the career path for Data Science Manager

Chief Data Officer

Chief Data Officers are responsible for overseeing the data strategy for an organization. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their organization's data strategy. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their organization's data strategy.

See salaries and explore the career path for Chief Data Officer

Data Governance Officer

Data Governance Officers are responsible for developing and implementing data governance policies. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their organization's data governance policies. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their organization's data governance policies.

See salaries and explore the career path for Data Governance Officer

Data Privacy Officer

Data Privacy Officers are responsible for protecting the privacy of an organization's data. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their organization's data privacy policies. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their organization's data privacy policies.

See salaries and explore the career path for Data Privacy Officer