Distributed Computing with Spark SQL from Coursera

This course is all about big data. It’s for students with SQL experience that want to take the next step on their data journey by learning distributed computing using Apache Spark. Students will gain a thorough understanding of this open-source standard for working with large datasets. Students will gain an understanding of the fundamentals of data analysis using SQL on Spark, setting the foundation for how to combine data with advanced analytics at scale and in production environments. The four modules build on one another and by the end of the course you will understand: the Spark architecture, queries within Spark, common ways to optimize Spark SQL, and how to build reliable data pipelines.

The first module introduces Spark and the Databricks environment including how Spark distributes computation and Spark SQL. Module 2 covers the core concepts of Spark such as storage vs. compute, caching, partitions, and troubleshooting performance issues via the Spark UI. It also covers new features in Apache Spark 3.x such as Adaptive Query Execution. The third module focuses on Engineering Data Pipelines including connecting to databases, schemas and data types, file formats, and writing reliable data. The final module covers data lakes, data warehouses, and lakehouses. Students build production grade data pipelines by combining Spark with the open-source project Delta Lake. By the end of this course, students will hone their SQL and distributed computing skills to become more adept at advanced analysis and to set the stage for transitioning to more advanced analytics as Data Scientists.

What's inside

Syllabus

Introduction to Spark

In this module, you will be able to discuss the core concepts of distributed computing and be able to recognize when and where to apply them. You'll be able to identify the basic data structure of Apache Spark™, known as a DataFrame. Additionally, you will use the collaborative Databricks workspace and write SQL code that executes against a cluster of machines.

Traffic lights

Read about what's good

what should give you pause

and possible dealbreakers

Teaches distributed computing using Apache Spark, which is highly relevant in industry

Focuses on engineering data pipelines, which is a core skill for data scientists

Builds a strong foundation for learners with SQL experience

Taught by Brooke Wenig and Conor Murphy, who are recognized for their work in big data

Provides hands-on labs and interactive materials

Requires learners to have SQL experience, which may limit accessibility

Reviews summary

Spark sql and distributed computing

Learners say this course is largely positive and a great introduction to distributed computing with Spark SQL. Using hands-on assignments in Databricks, you'll learn key concepts, including partitions, machine learning queries, and the Delta data lake. While some learners enjoyed the experience, others found the material to be superficial and lacking depth. They also mentioned that the course focused more on Databricks than on Spark SQL and distributed computing.

The hands-on assignments in Databricks were well-received.

"By checking discussion forum, I can see that both instructors check and provide helps to people who posted the questions."

"I highly recommend this course."

"I got started with course and learnt basic concepts, dos and don'ts."

For beginners, learners found the course to be a well-structured and engaging introduction.

"It's worth reviewing the materials multiple times even after getting the certificate."

"Highly recommended."

"The videos are informative and easy to follow, and they did a good job."

Learners felt that the course lacked depth and practical application.

"I believe the course is more focused on the use of a specific application."

"Some of the videos were lengthy and lacked sufficient illustrative tools."

"This courses is focus in data engineering instead of data science."

The course emphasizes Databricks and its features, which some learners found helpful, overwhelming

"The course assumed that the learner knows a lot of things about apache spark beforehand."

"It did not explain concepts very deeply, given very brief overview in a very haste manner."

"There should have been given some solid understanding of concepts."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Distributed Computing with Spark SQL with these activities:

Read 'Spark: The Definitive Guide'

Show steps

Expand your knowledge beyond the course materials by reading 'Spark: The Definitive Guide' to gain a deeper understanding of Apache Spark's architecture and advanced concepts.

View Spark: The Definitive Guide: Big Data... on Amazon

Show steps

Obtain a copy of 'Spark: The Definitive Guide'.
Allocate time to read and study the book.
Take notes or create summaries as you read.
Discuss your findings with classmates or the instructor to enhance your understanding.

Tutorial: Working with DataFrames in Spark

Show steps

Reinforce your understanding of distributed computing concepts by following tutorials on working with DataFrames in Apache Spark.

Browse courses on Apache Spark

Show steps

Locate a tutorial on working with DataFrames in Spark.
Follow the tutorial step-by-step, completing any exercises or examples provided.
Apply your newfound knowledge to your own projects or coursework.

Organize Course Notes and Assignments

Show steps

Improve your retention and understanding by organizing and reviewing your course materials regularly.

Show steps

Create a system for organizing your notes, assignments, and quizzes.
Regularly review and summarize your notes to reinforce your understanding.

Five other activities

Expand to see all activities and additional details

Show all eight activities

Create a Data Pipeline Using Spark SQL

Show steps

Solidify your grasp of data pipelines by completing practice exercises and drills on creating pipelines using Spark SQL.

Show steps

Find online practice exercises or coding challenges on creating data pipelines using Spark SQL.
Solve the exercises and challenges, focusing on applying your knowledge of Spark SQL syntax and concepts.

Create a Data Analysis Dashboard

Show steps

Showcase your understanding of data analysis by creating a data visualization dashboard using Spark to present insights from a dataset of your choice.

Show steps

Choose a dataset and define the problem or question you want to address.
Use Spark to analyze the dataset and extract insights.
Design and create a data visualization dashboard using a tool of your choice.
Present your findings and insights using your dashboard.

Develop a Data Lake Prototype

Show steps

Apply your skills by creating a prototype of a data lake, utilizing Spark and Delta Lake, to enhance your practical experience.

Show steps

Define the scope and purpose of your data lake prototype.
Design the architecture and infrastructure of your prototype.
Implement the prototype using Spark and Delta Lake.
Test and evaluate the performance of your prototype.

Mentor Junior Data Engineers

Show steps

Reinforce your knowledge by mentoring junior data engineers, providing guidance and support on Apache Spark concepts and projects.

Show steps

Identify opportunities to mentor junior data engineers through online forums, meetups, or other platforms.
Provide guidance on Spark architecture, programming techniques, and best practices.
Review their code, offer feedback, and suggest improvements.

Contribute to a Spark Open-Source Project

Show steps

Gain practical experience in the Apache Spark community by contributing to an open-source project.

Show steps

Identify an open-source project related to Spark that aligns with your interests.
Review the project's documentation and codebase.
Identify an area where you can contribute and propose a solution.
Implement and test your contribution.

Career center

Learners who complete Distributed Computing with Spark SQL will develop knowledge and skills that may be useful to these careers:

Data Engineer

As a Data Engineer, you'll design, build, and maintain data pipelines to move data between systems. This course will help build a foundation in Apache Spark, a popular open-source tool for distributed computing and data analysis. You'll learn how to use Spark to read, write, and transform data, and how to optimize Spark queries for performance.

See salaries and explore the career path for Data Engineer

Data Analyst

Data Analysts use data to solve business problems. They collect, clean, and analyze data to identify trends and patterns. This course will help you develop the skills you need to be a successful Data Analyst. You'll learn how to use Spark to analyze large datasets, and how to use SQL to query data and generate reports.

See salaries and explore the career path for Data Analyst

Data Scientist

Data Scientists use data to build predictive models. They use these models to make informed decisions about business problems. This course will help you develop the skills you need to be a successful Data Scientist. You'll learn how to use Spark to build and train predictive models, and how to use SQL to query data and generate reports.

See salaries and explore the career path for Data Scientist

Machine Learning Engineer

Machine Learning Engineers build and deploy machine learning models. They use these models to automate tasks and improve business outcomes. This course will help you develop the skills you need to be a successful Machine Learning Engineer. You'll learn how to use Spark to build and train machine learning models, and how to use SQL to query data and generate reports.

See salaries and explore the career path for Machine Learning Engineer

Software Engineer

Software Engineers design, develop, and maintain software applications. They use a variety of programming languages and technologies to create software that meets the needs of users. This course will help you develop the skills you need to be a successful Software Engineer. You'll learn how to use Spark to develop distributed computing applications, and how to use SQL to query data and generate reports.

See salaries and explore the career path for Software Engineer

Database Administrator

Database Administrators maintain and optimize databases. They ensure that databases are running smoothly and that data is safe and secure. This course will help you develop the skills you need to be a successful Database Administrator. You'll learn how to use Spark to manage and optimize databases, and how to use SQL to query data and generate reports.

See salaries and explore the career path for Database Administrator

Business Analyst

Business Analysts use data to solve business problems. They help businesses understand their customers, identify opportunities, and make better decisions. This course will help you develop the skills you need to be a successful Business Analyst. You'll learn how to use Spark to analyze data and generate reports, and how to use SQL to query data.

See salaries and explore the career path for Business Analyst

Project Manager

Project Managers plan, execute, and close projects. They ensure that projects are completed on time, within budget, and to the required quality standards. This course may be useful for Project Managers who want to learn more about data analysis and how to use Spark to manage and analyze project data.

See salaries and explore the career path for Project Manager

Product Manager

Product Managers develop and manage products. They work with engineers, designers, and marketers to create products that meet the needs of users. This course may be useful for Product Managers who want to learn more about data analysis and how to use Spark to analyze product data.

See salaries and explore the career path for Product Manager

Marketing Manager

Marketing Managers develop and execute marketing campaigns. They work with a variety of stakeholders to create marketing campaigns that reach the target audience and achieve the desired results. This course may be useful for Marketing Managers who want to learn more about data analysis and how to use Spark to analyze marketing data.

See salaries and explore the career path for Marketing Manager

Sales Manager

Sales Managers lead and motivate sales teams. They work with sales representatives to develop and execute sales strategies. This course may be useful for Sales Managers who want to learn more about data analysis and how to use Spark to analyze sales data.

See salaries and explore the career path for Sales Manager

Customer Success Manager

Customer Success Managers help customers achieve success with a company's products or services. They work with customers to identify their needs and develop solutions that meet those needs. This course may be useful for Customer Success Managers who want to learn more about data analysis and how to use Spark to analyze customer data.

See salaries and explore the career path for Customer Success Manager

Financial Analyst

Financial Analysts analyze financial data to make investment recommendations. They work with a variety of stakeholders to create financial models that help investors make informed decisions. This course may be useful for Financial Analysts who want to learn more about data analysis and how to use Spark to analyze financial data.

See salaries and explore the career path for Financial Analyst

Operations Manager

Operations Managers oversee the day-to-day operations of a business. They work with a variety of stakeholders to ensure that the business runs smoothly and efficiently. This course may be useful for Operations Managers who want to learn more about data analysis and how to use Spark to analyze operational data.

See salaries and explore the career path for Operations Manager

Human Resources Manager

Human Resources Managers oversee the human resources department of a company. They work with a variety of stakeholders to develop and implement human resources policies and procedures. This course may be useful for Human Resources Managers who want to learn more about data analysis and how to use Spark to analyze human resources data.

See salaries and explore the career path for Human Resources Manager