We may earn an affiliate commission when you visit our partners.
Course image
Brooke Wenig and Conor Murphy

This course is all about big data. It’s for students with SQL experience that want to take the next step on their data journey by learning distributed computing using Apache Spark. Students will gain a thorough understanding of this open-source standard for working with large datasets. Students will gain an understanding of the fundamentals of data analysis using SQL on Spark, setting the foundation for how to combine data with advanced analytics at scale and in production environments. The four modules build on one another and by the end of the course you will understand: the Spark architecture, queries within Spark, common ways to optimize Spark SQL, and how to build reliable data pipelines.

Read more

This course is all about big data. It’s for students with SQL experience that want to take the next step on their data journey by learning distributed computing using Apache Spark. Students will gain a thorough understanding of this open-source standard for working with large datasets. Students will gain an understanding of the fundamentals of data analysis using SQL on Spark, setting the foundation for how to combine data with advanced analytics at scale and in production environments. The four modules build on one another and by the end of the course you will understand: the Spark architecture, queries within Spark, common ways to optimize Spark SQL, and how to build reliable data pipelines.

The first module introduces Spark and the Databricks environment including how Spark distributes computation and Spark SQL. Module 2 covers the core concepts of Spark such as storage vs. compute, caching, partitions, and troubleshooting performance issues via the Spark UI. It also covers new features in Apache Spark 3.x such as Adaptive Query Execution. The third module focuses on Engineering Data Pipelines including connecting to databases, schemas and data types, file formats, and writing reliable data. The final module covers data lakes, data warehouses, and lakehouses. Students build production grade data pipelines by combining Spark with the open-source project Delta Lake. By the end of this course, students will hone their SQL and distributed computing skills to become more adept at advanced analysis and to set the stage for transitioning to more advanced analytics as Data Scientists.

Enroll now

What's inside

Syllabus

Introduction to Spark
In this module, you will be able to discuss the core concepts of distributed computing and be able to recognize when and where to apply them. You'll be able to identify the basic data structure of Apache Spark™, known as a DataFrame. Additionally, you will use the collaborative Databricks workspace and write SQL code that executes against a cluster of machines.
Read more
Spark Core Concepts
In this module, you will be able to explain the core concepts of Spark. You will learn common ways to increase query performance by caching data and modifying Spark configurations. You will also use the Spark UI to analyze performance and identify bottlenecks, as well as optimize queries with Adaptive Query Execution.
Engineering Data Pipelines
In this module, you will be able to identify and discuss the general demands of data applications. You'll be able to access data in a variety of formats and compare and contrast the tradeoffs between these formats. You will explore and examine semi-structured JSON data (common in big data environments) as well as schemas and parallel data writes. You will be able to create an end-to-end pipeline that reads data, transforms it, and saves the result.
Data Lakes, Warehouses and Lakehouses
In this module, you will identify the key characteristics of data lakes, data warehouses, and lakehouses. Lakehouses combine the scalability and low-cost storage of data lakes with the speed and ACID transactional guarantees of data warehouses. You will build a production grade lakehouse by combining Spark with the open-source project, Delta Lake. Whoever said time travel isn't possible hasn't been to a lakehouse!

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Teaches distributed computing using Apache Spark, which is highly relevant in industry
Focuses on engineering data pipelines, which is a core skill for data scientists
Builds a strong foundation for learners with SQL experience
Taught by Brooke Wenig and Conor Murphy, who are recognized for their work in big data
Provides hands-on labs and interactive materials
Requires learners to have SQL experience, which may limit accessibility

Save this course

Save Distributed Computing with Spark SQL to your list so you can find it easily later:
Save

Reviews summary

Spark sql and distributed computing

Learners say this course is largely positive and a great introduction to distributed computing with Spark SQL. Using hands-on assignments in Databricks, you'll learn key concepts, including partitions, machine learning queries, and the Delta data lake. While some learners enjoyed the experience, others found the material to be superficial and lacking depth. They also mentioned that the course focused more on Databricks than on Spark SQL and distributed computing.
The hands-on assignments in Databricks were well-received.
"By checking discussion forum, I can see that both instructors check and provide helps to people who posted the questions."
"I highly recommend this course."
"I got started with course and learnt basic concepts, dos and don'ts."
For beginners, learners found the course to be a well-structured and engaging introduction.
"It's worth reviewing the materials multiple times even after getting the certificate."
"Highly recommended."
"The videos are informative and easy to follow, and they did a good job."
Learners felt that the course lacked depth and practical application.
"I believe the course is more focused on the use of a specific application."
"Some of the videos were lengthy and lacked sufficient illustrative tools."
"This courses is focus in data engineering instead of data science."
The course emphasizes Databricks and its features, which some learners found helpful, overwhelming
"The course assumed that the learner knows a lot of things about apache spark beforehand."
"It did not explain concepts very deeply, given very brief overview in a very haste manner."
"There should have been given some solid understanding of concepts."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Distributed Computing with Spark SQL with these activities:
Read 'Spark: The Definitive Guide'
Expand your knowledge beyond the course materials by reading 'Spark: The Definitive Guide' to gain a deeper understanding of Apache Spark's architecture and advanced concepts.
Show steps
  • Obtain a copy of 'Spark: The Definitive Guide'.
  • Allocate time to read and study the book.
  • Take notes or create summaries as you read.
  • Discuss your findings with classmates or the instructor to enhance your understanding.
Tutorial: Working with DataFrames in Spark
Reinforce your understanding of distributed computing concepts by following tutorials on working with DataFrames in Apache Spark.
Browse courses on Apache Spark
Show steps
  • Locate a tutorial on working with DataFrames in Spark.
  • Follow the tutorial step-by-step, completing any exercises or examples provided.
  • Apply your newfound knowledge to your own projects or coursework.
Organize Course Notes and Assignments
Improve your retention and understanding by organizing and reviewing your course materials regularly.
Show steps
  • Create a system for organizing your notes, assignments, and quizzes.
  • Regularly review and summarize your notes to reinforce your understanding.
Five other activities
Expand to see all activities and additional details
Show all eight activities
Create a Data Pipeline Using Spark SQL
Solidify your grasp of data pipelines by completing practice exercises and drills on creating pipelines using Spark SQL.
Show steps
  • Find online practice exercises or coding challenges on creating data pipelines using Spark SQL.
  • Solve the exercises and challenges, focusing on applying your knowledge of Spark SQL syntax and concepts.
Create a Data Analysis Dashboard
Showcase your understanding of data analysis by creating a data visualization dashboard using Spark to present insights from a dataset of your choice.
Show steps
  • Choose a dataset and define the problem or question you want to address.
  • Use Spark to analyze the dataset and extract insights.
  • Design and create a data visualization dashboard using a tool of your choice.
  • Present your findings and insights using your dashboard.
Develop a Data Lake Prototype
Apply your skills by creating a prototype of a data lake, utilizing Spark and Delta Lake, to enhance your practical experience.
Show steps
  • Define the scope and purpose of your data lake prototype.
  • Design the architecture and infrastructure of your prototype.
  • Implement the prototype using Spark and Delta Lake.
  • Test and evaluate the performance of your prototype.
Mentor Junior Data Engineers
Reinforce your knowledge by mentoring junior data engineers, providing guidance and support on Apache Spark concepts and projects.
Show steps
  • Identify opportunities to mentor junior data engineers through online forums, meetups, or other platforms.
  • Provide guidance on Spark architecture, programming techniques, and best practices.
  • Review their code, offer feedback, and suggest improvements.
Contribute to a Spark Open-Source Project
Gain practical experience in the Apache Spark community by contributing to an open-source project.
Show steps
  • Identify an open-source project related to Spark that aligns with your interests.
  • Review the project's documentation and codebase.
  • Identify an area where you can contribute and propose a solution.
  • Implement and test your contribution.

Career center

Learners who complete Distributed Computing with Spark SQL will develop knowledge and skills that may be useful to these careers:
Data Engineer
As a Data Engineer, you'll design, build, and maintain data pipelines to move data between systems. This course will help build a foundation in Apache Spark, a popular open-source tool for distributed computing and data analysis. You'll learn how to use Spark to read, write, and transform data, and how to optimize Spark queries for performance.
Data Analyst
Data Analysts use data to solve business problems. They collect, clean, and analyze data to identify trends and patterns. This course will help you develop the skills you need to be a successful Data Analyst. You'll learn how to use Spark to analyze large datasets, and how to use SQL to query data and generate reports.
Data Scientist
Data Scientists use data to build predictive models. They use these models to make informed decisions about business problems. This course will help you develop the skills you need to be a successful Data Scientist. You'll learn how to use Spark to build and train predictive models, and how to use SQL to query data and generate reports.
Machine Learning Engineer
Machine Learning Engineers build and deploy machine learning models. They use these models to automate tasks and improve business outcomes. This course will help you develop the skills you need to be a successful Machine Learning Engineer. You'll learn how to use Spark to build and train machine learning models, and how to use SQL to query data and generate reports.
Software Engineer
Software Engineers design, develop, and maintain software applications. They use a variety of programming languages and technologies to create software that meets the needs of users. This course will help you develop the skills you need to be a successful Software Engineer. You'll learn how to use Spark to develop distributed computing applications, and how to use SQL to query data and generate reports.
Database Administrator
Database Administrators maintain and optimize databases. They ensure that databases are running smoothly and that data is safe and secure. This course will help you develop the skills you need to be a successful Database Administrator. You'll learn how to use Spark to manage and optimize databases, and how to use SQL to query data and generate reports.
Business Analyst
Business Analysts use data to solve business problems. They help businesses understand their customers, identify opportunities, and make better decisions. This course will help you develop the skills you need to be a successful Business Analyst. You'll learn how to use Spark to analyze data and generate reports, and how to use SQL to query data.
Project Manager
Project Managers plan, execute, and close projects. They ensure that projects are completed on time, within budget, and to the required quality standards. This course may be useful for Project Managers who want to learn more about data analysis and how to use Spark to manage and analyze project data.
Product Manager
Product Managers develop and manage products. They work with engineers, designers, and marketers to create products that meet the needs of users. This course may be useful for Product Managers who want to learn more about data analysis and how to use Spark to analyze product data.
Marketing Manager
Marketing Managers develop and execute marketing campaigns. They work with a variety of stakeholders to create marketing campaigns that reach the target audience and achieve the desired results. This course may be useful for Marketing Managers who want to learn more about data analysis and how to use Spark to analyze marketing data.
Sales Manager
Sales Managers lead and motivate sales teams. They work with sales representatives to develop and execute sales strategies. This course may be useful for Sales Managers who want to learn more about data analysis and how to use Spark to analyze sales data.
Customer Success Manager
Customer Success Managers help customers achieve success with a company's products or services. They work with customers to identify their needs and develop solutions that meet those needs. This course may be useful for Customer Success Managers who want to learn more about data analysis and how to use Spark to analyze customer data.
Financial Analyst
Financial Analysts analyze financial data to make investment recommendations. They work with a variety of stakeholders to create financial models that help investors make informed decisions. This course may be useful for Financial Analysts who want to learn more about data analysis and how to use Spark to analyze financial data.
Operations Manager
Operations Managers oversee the day-to-day operations of a business. They work with a variety of stakeholders to ensure that the business runs smoothly and efficiently. This course may be useful for Operations Managers who want to learn more about data analysis and how to use Spark to analyze operational data.
Human Resources Manager
Human Resources Managers oversee the human resources department of a company. They work with a variety of stakeholders to develop and implement human resources policies and procedures. This course may be useful for Human Resources Managers who want to learn more about data analysis and how to use Spark to analyze human resources data.

Reading list

We've selected eight books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Distributed Computing with Spark SQL.
Provides a comprehensive overview of the principles and patterns for designing data-intensive applications. It valuable resource for students who want to learn how to design and build scalable, reliable, and efficient data systems.
Provides a comprehensive overview of Apache Spark, including its architecture, programming model, and ecosystem. It valuable resource for students who want to gain a deeper understanding of Spark and its applications.
Provides a comprehensive overview of using Apache Spark with Python. It valuable resource for students who want to learn how to use Spark with Python to build and manage big data applications.
Provides a comprehensive overview of Apache Hadoop, an open-source distributed computing framework. It valuable resource for students who want to learn how to use Hadoop to build and manage big data applications.
Save
Provides a comprehensive overview of using Apache Spark CV for computer vision. It valuable resource for students who want to learn how to use Spark CV to build and manage computer vision models.
Provides a comprehensive overview of Apache Kafka, an open-source distributed streaming platform. It valuable resource for students who want to learn how to use Kafka to build and manage real-time data pipelines.
Provides a collection of recipes for using Apache Spark SQL to solve common data analytics problems. It valuable resource for students who want to learn how to use Spark SQL effectively.
Provides a comprehensive introduction to big data analytics with Apache Spark. It covers topics such as data engineering, machine learning, and data visualization.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Distributed Computing with Spark SQL.
Data Engineering Essentials using SQL, Python, and PySpark
Most relevant
Apache Spark for Data Engineering and Machine Learning
Most relevant
Data Engineering using Databricks on AWS and Azure
Most relevant
Building Your First Data Lakehouse Using Azure Synapse...
Most relevant
Getting Started with Spark 2
Most relevant
Building Machine Learning Models in Spark 2
Most relevant
Data Engineering with Databricks
Most relevant
Handling Streaming Data with Azure Databricks Using Spark...
Most relevant
Building Batch Data Pipelines on Google Cloud
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser