We may earn an affiliate commission when you visit our partners.
Janani Ravi

Much real-world data is available in streams; from self-driving car sensors to weather monitors. Apache Spark 2 is a strong analytics engine with first-class support for streaming operations using micro-batch and continuous processing.

Structured Streaming in Spark 2 is a unified model that treats batch as a prefix of stream. This allows Spark to perform the same operations on streaming data as on batch data, and Spark takes care of the details involved in incrementalizing the batch operation to work on streams.

Read more

Much real-world data is available in streams; from self-driving car sensors to weather monitors. Apache Spark 2 is a strong analytics engine with first-class support for streaming operations using micro-batch and continuous processing.

Structured Streaming in Spark 2 is a unified model that treats batch as a prefix of stream. This allows Spark to perform the same operations on streaming data as on batch data, and Spark takes care of the details involved in incrementalizing the batch operation to work on streams.

In this course, Conceptualizing the Processing Model for Apache Spark Structured Streaming, you will use the DataFrame API as well as Spark SQL to run queries on streaming sources and write results out to data sinks.

First, you will be introduced to streaming DataFrames in Spark 2 and understand how structured streaming in Spark 2 is different from Spark Streaming available in earlier versions of Spark. You will also get a high level understanding of how Spark’s architecture works, and the role of drivers, workers, executors, and tasks.

Next, you will execute queries on streaming data from a socket source as well as a file system source. You will perform basic operations on streaming data using Data frames and register your data as a temporary view to run SQL queries on input streams. You will explore the append, complete, and update modes to write data out to sinks. You will then understand how scheduling and checkpointing works in Spark and explore the differences between the micro-batch mode of execution and the new experimental continuous processing mode that Spark offers.

Finally, you will discuss the Tungsten engine optimizations which make Spark 2 so much faster than Spark 1, and discuss the stages of optimization in the Catalyst optimizer which works with SQL queries.

At the end of this course, you will be able to build and execute streaming queries on input data, write these out to reliable storage using different output modes, and checkpoint your streaming applications for fault tolerance and recovery.

Enroll now

What's inside

Syllabus

Course Overview
Getting Started with Structured Streaming
Executing Streaming Queries
Understanding Scheduling and Checkpointing
Read more
Configuring Processing Models
Understanding Query Planning

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Teaches streaming data analysis using Apache Spark 2, which is widely used in industry
Instructors Janani Ravi are recognized for their work in Spark Streaming and structured streaming
Builds a strong foundation for learners new to structured streaming

Save this course

Save Conceptualizing the Processing Model for Apache Spark Structured Streaming to your list so you can find it easily later:
Save

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Conceptualizing the Processing Model for Apache Spark Structured Streaming with these activities:
Review the concepts of stream processing
Reviewing the concepts of stream processing will help you understand the purpose and benefits of using Apache Spark Structured Streaming.
Browse courses on Stream Processing
Show steps
  • Read articles or books about stream processing.
  • Watch videos or tutorials about stream processing.
  • Complete practice exercises on stream processing.
Practice writing SQL queries
Practicing writing SQL queries will help you refresh your skills and prepare for using SQL in Apache Spark Structured Streaming.
Browse courses on SQL
Show steps
  • Find a dataset and practice writing SQL queries to extract and analyze data.
  • Use an online SQL editor or IDE to practice writing and executing SQL queries.
  • Complete practice exercises or quizzes on SQL.
Organize notes, assignments, and materials
Organizing materials will help you identify gaps in your knowledge and prepare for the course.
Show steps
  • Gather all course materials, including notes, assignments, and exams.
  • Review materials and identify areas where you need to focus your studies.
  • Create a study plan that outlines your goals and how you will achieve them.
Six other activities
Expand to see all activities and additional details
Show all nine activities
Follow tutorials on Apache Spark Structured Streaming
Following tutorials will help you learn the basics of Apache Spark Structured Streaming and how to use it to process streaming data.
Browse courses on Structured Streaming
Show steps
  • Find tutorials on Apache Spark Structured Streaming.
  • Follow the tutorials and complete the exercises.
  • Apply what you learn in the tutorials to your own projects.
Join a study group or online forum for Apache Spark Structured Streaming
Joining a study group or online forum will allow you to connect with other learners and discuss Apache Spark Structured Streaming.
Browse courses on Structured Streaming
Show steps
  • Find a study group or online forum for Apache Spark Structured Streaming.
  • Introduce yourself and share your goals.
  • Participate in discussions and ask questions.
Solve practice problems on Apache Spark Structured Streaming
Solving practice problems will help you test your understanding of Apache Spark Structured Streaming and identify areas where you need more practice.
Browse courses on Structured Streaming
Show steps
  • Find practice problems on Apache Spark Structured Streaming.
  • Solve the practice problems.
  • Review your answers and identify areas where you need more practice.
Build a streaming data application
Building a streaming data application will help you apply the concepts you learn in the course to a real-world scenario.
Browse courses on Streaming Data
Show steps
  • Choose a data source and a streaming platform.
  • Design the architecture of your application.
  • Implement the application using the Spark Streaming API.
  • Test and deploy your application.
Attend a workshop on Apache Spark Structured Streaming
Attending a workshop will allow you to learn from experts and get hands-on experience with Apache Spark Structured Streaming.
Browse courses on Structured Streaming
Show steps
  • Find a workshop on Apache Spark Structured Streaming.
  • Register for the workshop.
  • Attend the workshop and participate in the activities.
Write a blog post or article on Apache Spark Structured Streaming
Writing a blog post or article will help you solidify your understanding of Apache Spark Structured Streaming and share your knowledge with others.
Browse courses on Structured Streaming
Show steps
  • Choose a topic related to Apache Spark Structured Streaming.
  • Research the topic and gather information.
  • Write your blog post or article.
  • Publish your blog post or article online.

Career center

Learners who complete Conceptualizing the Processing Model for Apache Spark Structured Streaming will develop knowledge and skills that may be useful to these careers:
Data Engineer
Data Engineers are responsible for designing, building, and maintaining data pipelines, and may also work on data quality and data governance. This course may help someone in this role by teaching how to build and execute streaming queries on input data, which could help them build more efficient pipelines. Additionally, this course covers how to checkpoint streaming applications for fault tolerance and recovery, which is an important consideration for production-grade data pipelines.
Data Scientist
Data Scientists are responsible for developing and applying statistical and machine learning models to data to extract insights and make predictions. This course may help someone in this role by teaching how to execute queries on streaming data and write results to data sinks, which could help them build more efficient data pipelines for their models. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their models.
Database Administrator
Database Administrators are responsible for managing and maintaining databases. This course may help someone in this role by teaching how to execute queries on streaming data and write results to data sinks, which could help them build more efficient pipelines for their databases. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their databases.
Data Architect
Data Architects are responsible for designing and building data architectures. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their data architectures. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their data architectures.
Software Engineer
Software Engineers are responsible for designing, developing, and maintaining software applications. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their applications. Additionally, this course covers how to checkpoint streaming applications for fault tolerance and recovery, which is an important consideration for production-grade applications.
Data Analyst
Data Analysts are responsible for analyzing data to extract meaningful insights, and may also design and build data processing pipelines. This course may help someone in this role by teaching how to execute queries on streaming data and write results to data sinks, which could help them build more efficient pipelines.
Cloud Engineer
Cloud Engineers are responsible for designing, building, and maintaining cloud computing solutions. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their cloud applications. Additionally, this course covers how to checkpoint streaming applications for fault tolerance and recovery, which is an important consideration for production-grade cloud applications.
Business Analyst
Business Analysts are responsible for analyzing business data to identify opportunities for improvement. This course may help someone in this role by teaching how to execute queries on streaming data and write results to data sinks, which could help them build more efficient pipelines for their business analysis. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their business analysis.
Product Manager
Product Managers are responsible for managing and developing products. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their products. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their products.
Machine Learning Engineer
Machine Learning Engineers are responsible for developing and deploying machine learning models. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their machine learning models. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their machine learning models.
Statistician
Statisticians are responsible for collecting, analyzing, and interpreting data. This course may help someone in this role by teaching how to execute queries on streaming data and write results to data sinks, which could help them build more efficient pipelines for their statistical analysis. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their statistical analysis.
Data Privacy Officer
Data Privacy Officers are responsible for protecting the privacy of an organization's data. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their organization's data privacy policies. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their organization's data privacy policies.
Chief Data Officer
Chief Data Officers are responsible for overseeing the data strategy for an organization. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their organization's data strategy. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their organization's data strategy.
Data Science Manager
Data Science Managers are responsible for leading and managing data science teams. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their teams. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their teams' work.
Data Governance Officer
Data Governance Officers are responsible for developing and implementing data governance policies. This course may help someone in this role by teaching how to build and execute streaming queries on input data and write results to data sinks, which could help them build more efficient pipelines for their organization's data governance policies. Additionally, this course covers how to configure processing models and understand query planning, which could help them optimize the performance of their organization's data governance policies.

Reading list

We've selected 13 books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Conceptualizing the Processing Model for Apache Spark Structured Streaming.
Covers the fundamentals of Spark, including its architecture, programming model, and APIs.
Provides a comprehensive overview of deep learning, covering topics such as convolutional neural networks, recurrent neural networks, and generative adversarial networks, making it a valuable resource for those who want to understand the theory and practice of deep learning.
Comprehensive guide to Apache Spark, covering both the core concepts and advanced topics, making it a useful reference for both beginners and experienced Spark users.
Focuses on using Apache Spark for machine learning, covering topics such as data preparation, feature engineering, model training, and evaluation, making it a valuable resource for those who want to use Spark for machine learning projects.
Provides a comprehensive overview of Spark, including its history, architecture, and use cases.
Provides a comprehensive guide to using Spark for advanced analytics, covering topics such as machine learning, graph processing, and natural language processing.
Provides a comprehensive overview of Apache Kafka, covering topics such as architecture, operations, and tuning, making it a valuable resource for those who want to use Kafka in their data pipelines.
Focuses on using Python for data analysis, providing a comprehensive overview of Python libraries and tools for data manipulation, visualization, and machine learning, making it a useful resource for those who want to use Python in a data science context.
Provides a comprehensive overview of R for data science, covering topics such as data manipulation, visualization, and machine learning, making it a valuable resource for those who want to use R in a data science context.
Provides a practical guide to data science, covering topics such as problem framing, data collection, data analysis, and communication, making it a useful resource for those who want to approach data science in a structured and effective way.
Provides a practical guide to using Spark for large-scale data processing, with a focus on performance optimization.
Provides a comprehensive overview of Apache Storm, covering topics such as architecture, operations, and tuning, making it a valuable resource for those who want to use Storm in their data pipelines.
Covers advanced topics in Spark, such as performance tuning, machine learning, and graph processing.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Conceptualizing the Processing Model for Apache Spark Structured Streaming.
Structured Streaming in Apache Spark 2
Most relevant
Getting Started with Stream Processing with Spark...
Most relevant
Modeling Streaming Data for Processing with Apache Spark...
Most relevant
Getting Started with Apache Spark on Databricks
Most relevant
Handling Batch Data with Apache Spark on Databricks
Most relevant
Apache Spark Fundamentals
Most relevant
Data Engineering Essentials using SQL, Python, and PySpark
Most relevant
Building Realtime Pipelines in Cloud Data Fusion
Most relevant
Apache Spark 3 Fundamentals
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser