We may earn an affiliate commission when you visit our partners.
Sundog Education by Frank Kane, Frank Kane, and Sundog Education Team

New.  Updated for Spark 3, more hands-on exercises, and a stronger focus on DataFrames and Structured Streaming.

“Big data" analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark and specifically PySpark. Employers including Amazon, EBay, NASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You'll learn those same techniques, using your own Windows system right at home. It's easier than you might think.

Read more

New.  Updated for Spark 3, more hands-on exercises, and a stronger focus on DataFrames and Structured Streaming.

“Big data" analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark and specifically PySpark. Employers including Amazon, EBay, NASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You'll learn those same techniques, using your own Windows system right at home. It's easier than you might think.

Learn and master the art of framing data analysis problems as Spark problems through over 20 hands-on examples, and then scale them up to run on cloud computing services in this course. You'll be learning from an ex-engineer and senior manager from Amazon and IMDb.

  • Learn the concepts of Spark's DataFrames and Resilient Distributed Datastores

  • Develop and run Spark jobs quickly using Python and pyspark

  • Translate complex analysis problems into iterative or multi-stage Spark scripts

  • Scale up to larger data sets using Amazon's Elastic MapReduce service

  • Understand how Hadoop YARN distributes Spark across computing clusters

  • Learn about other Spark technologies, like Spark SQL, Spark Streaming, and GraphX

By the end of this course, you'll be running code that analyzes gigabytes worth of information – in the cloud – in a matter of minutes. 

This course uses the familiar Python programming language; if you'd rather use Scala to get the best performance out of Spark, see my "Apache Spark with Scala - Hands On with Big Data" course instead.

We'll have some fun along the way. You'll get warmed up with some simple examples of using Spark to analyze movie ratings data and text in a book. Once you've got the basics under your belt, we'll move to some more complex and interesting tasks. We'll use a million movie ratings to find movies that are similar to each other, and you might even discover some new movies you might like in the process. We'll analyze a social graph of superheroes, and learn who the most “popular" superhero is – and develop a system to find “degrees of separation" between superheroes. Are all Marvel superheroes within a few degrees of being connected to The Incredible Hulk? You'll find the answer.

This course is very hands-on; you'll spend most of your time following along with the instructor as we write, analyze, and run real code together – both on your own system, and in the cloud using Amazon's Elastic MapReduce service. 7 hours of video content is included, with over 20 real examples of increasing complexity you can build, run and study yourself. Move through them at your own pace, on your own schedule. The course wraps up with an overview of other Spark-based technologies, including Spark SQL, Spark Streaming, and GraphX.

Wrangling big data with Apache Spark is an important skill in today's technical world. Enroll now.

  • " I studied "Taming Big Data with Apache Spark and Python" with Frank Kane, and helped me build a great platform for Big Data as a Service for my company. I recommend the course.   " - Cleuton Sampaio De Melo Jr.

Enroll now

Here's a deal for you

We found an offer that may be relevant to this course.
Save money when you learn. All coupon codes, vouchers, and discounts are applied automatically unless otherwise noted.

What's inside

Syllabus

Set up a working development environment for Spark with Python on your desktop.

Meet your instructor, and we'll review what this course will cover and what you need to get started.

Read more

How to find the scripts and data associated with the lectures in this course.

We'll install Anaconda, a JDK, and Apache Spark on your Windows system. When we're done, we'll run a simple little Spark script on your desktop to test it out!

Before we can analyze data with Spark, we need some data to analyze! Let's install the MovieLens dataset of movie ratings, which we'll use throughout the course.

We'll run a simple Spark script using Python, and analyze the 100,000 movie ratings you installed in the previous lecture. What is the breakdown of the rating scores in this data set? You'll find it's easy to find out!

Apache Spark 3 was released in early 2020 - here's what's new, what's improved, and what's deprecated.

This high-level introduction will help you understand what Spark is for, who's using it, and why it's such a big deal.

Understand the core object of Spark: the Resilient Distributed Dataset (RDD), and how you can use Spark to transform and perform actions upon RDD's.

We'll dissect our original ratings histogram Spark example, and understand exactly how every line of it works!

You'll learn how to use key/value pairs in RDD's, and special operations you can perform on them. To make it real, we'll introduce a new example: computing the average number of friends by age using a fake social network data set.

We'll take another look at our "average number of friends by age" example script, actually run it, and examine the results.

Learn how the filter() operation works on RDD's, and apply this toward finding the minimum temperatures from a real-world weather data set.

We'll look at the minimum temperatures by location example as a whole, and actually run it! Then, you've got an activity: modify this script to find the maximum temperatures instead. This lecture reinforces using filters and key/value RDD's.

Check your results for writing a maximum temperature Spark script to my own.

We'll do the standard "count the number of occurrences of each word in a book" exercise here, and review the differences between map() and flatmap() in the process.

You'll learn how to use regular expressions in Python, and use them to improve the results of our word count script.

Finally, we'll learn how to implement countByValue() in a way that returns a new RDD, and sort that RDD to produce our final results for word frequency.

Write your first Spark script on your own! I'll give you the strategy and tips you need to be successful. You're given a fake e-commerce data set, and your task is to find the total amount spent, broken down by customer ID.

Compare your code to my solution for finding the total spent by customer - and take on a new challenge! Modify your script to sort your final results by the amount spent, and find the biggest spender.

Compare your solution to sorting the customers by total amount ordered to mine, and check your results.

We'll cover the concepts of SparkSQL, DataFrames, and DataSets, and why they are so important in Spark 2.0 and above.

We'll dive into a real example, revisiting our fake social network data and analyzing it with DataFrames through a SparkSession object.

Let's revisit our "most popular movie" example, and implement it using a DataFrame instead of with RDD's. DataFrames are the preferred API in Spark 2.0+.

We'll write and run a simple script to find the most-rated movie in the MovieLens data set, which we'll build upon later.

You'll learn how to use "broadcast variables" in Spark to efficiently distribute large objects to every node your Spark program may be running on, and apply this to looking up movie names in our "most popular movie" script.

We introduce the Marvel superhero social graph data set, and write a Spark job to find the superhero with the most co-occurrences with other heroes in comic books.

Review the source code of our script to discover the most popular superhero, run it, and reveal the answer!

We'll introduce the Breadth-First Search (BFS) algorithm, and how we can use it to discover degrees of separation between superheroes.

We'll learn how to turn breadth-first search into a Spark problem, and craft our strategy for writing the code. Along the way, we'll cover Spark accumulators and how we can use them to signal our driver script when it's done.

We'll get our hands on the code to actually implement breadth-first search, and run it to discover the degrees of separation between any two superheroes!

Learn one technique for finding similar movies based on the MovieLens rating data, and how we can frame it as a Spark problem. We'll also introduce the importance of using cache() or persist() on rdd's that will have more than one action performed on them.

We'll review the code for finding similar movies in Spark with the MovieLens ratings data, run it on every available core of your desktop computer, and review the results.

Get your hands dirty! I'll give you some ideas on improving the quality of your similar movie results - go try some out, and mess around with our movie similarity code.

Learn how Amazon's Elastic MapReduce makes it easy to rent time on your very own Spark cluster, running on top of Hadoop YARN

Learn how to set up your AWS account, create a key pair for logging into your Spark / Hadoop cluster, and set up PuTTY to connect to your instances from a Windows desktop.

We'll see what needs to be done to our Movie Similarities script in order to get it to run successfully with one million ratings, on a cluster, by introducing the partitionBy() function.

We'll study the code of our modified movie similarities script, and get it ready to run on a cluster.

We'll launch a Hadoop cluster with Spark using Amazon's Elastic MapReduce service, and kick off our script to produce similar movies to Star Wars given one million movie ratings.

We'll look at our results from similar movies from one million ratings, and discuss them.

We'll look at the Spark console UI and the information it offers to help understand how to diagnose problems and optimize your large Spark jobs.

I'll share some more troubleshooting tips when running Spark on a cluster, and talk about how to manage dependencies your code may have.

We'll briefly cover the capabilities of Spark's MLLib machine learning library, and how it can help you solve data mining, machine learning, and statistical problems you may encounter. We'll go into more depth on MLLib's Alternating Least Squares (ALS) recommendation engine, and how we can use it to produce movie recommendations with the MovieLens data set.

We'll run MLLib's Alternating Least Squares recommender system on the MovieLens 100K dataset.

We'll finish running Alternating Least Squares recommendations on the MovieLens ratings data set using MLLib, and evaluate the results.

DataFrames are the preferred API for MLLib in Spark 2.0+. Let's look at an example of using linear regression with DataFrames.

An overview of how Spark Streaming lets you process continual streams on input data and aggregate it over time, and how GraphX lets you compute properties of networks.

We'll run an example of Spark structured streaming in Python to keep track of status code counts in a directory that receives Apache access logs.

GraphX isn't currently supported in Python, but you should at least know what it is.

Some suggested resources for learning more about Apache Spark, and data mining and machine learning in general.

Traffic lights

Read about what's good
what should give you pause
and possible dealbreakers
Uses Apache Spark 3, which is a recent version, suggesting the course is up-to-date with current industry standards and practices in big data processing
Emphasizes hands-on exercises and real-world examples, allowing learners to immediately apply their knowledge and build a portfolio of practical projects using Spark and Python
Covers Spark SQL, Spark Streaming, and GraphX, which are essential tools for handling structured data, real-time data streams, and graph-based analysis in big data environments
Includes exercises using Amazon's Elastic MapReduce service, which allows learners to gain experience with cloud-based big data processing and scaling Spark applications
Requires installing Anaconda, a JDK, and Apache Spark on a Windows system, which may present a barrier to learners who do not have access to these resources or are not familiar with the installation process
Uses Python, but suggests that Scala offers better performance with Spark, which may lead some learners to feel they are not using the optimal language for the task

Save this course

Create your own learning path. Save this course to your list so you can find it easily later.
Save

Reviews summary

Hands-on spark with python

According to learners, this course offers a strong practical introduction to Apache Spark using Python (PySpark), emphasizing hands-on exercises and real-world examples like analyzing movie data and social graphs. Many students praise the clear explanations and the instructor's ability to demystify complex topics. The course structure, which moves from RDDs to DataFrames and Spark SQL, is seen as well-paced and logical. While some mention initial setup difficulties or that the course might be challenging for complete programming novices, the emphasis on practical application and the inclusion of updated content for Spark 3 are frequently highlighted as major strengths.
Suitable for those with some coding background.
"This course is excellent for anyone with basic Python who wants to get into Spark. It builds knowledge gradually."
"If you have some programming experience, especially with Python, you will find this course very accessible."
"Might be a bit fast-paced if you are completely new to programming and big data concepts simultaneously."
"Provides a solid foundation for users with some prior technical experience."
Content is updated and relevant (Spark 3).
"I appreciate that the course is updated for Spark 3, keeping the content current with the latest versions."
"The updates to include DataFrames and Structured Streaming with Spark 3 are very valuable."
"Good to see the course is maintained and updated to reflect newer versions of Spark."
"The course content feels relevant and uses modern Spark features like DataFrames effectively."
Code examples are useful and well-explained.
"The code examples provided are very helpful and directly applicable to the concepts being taught."
"I liked that the instructor walked through the code line by line, making it easy to understand how it works."
"The scripts are well-structured and serve as a great starting point for building my own Spark applications."
"The exercises and code examples are well-designed and reinforce learning effectively."
Experienced instructor is knowledgeable.
"The instructor is knowledgeable and presents the material in an engaging way. You can tell he has real-world experience."
"Frank Kane is an excellent instructor with deep understanding of the subject matter."
"The instructor's background at Amazon adds credibility and valuable insights to the lectures."
"I really like the instructor's teaching style and his ability to anticipate common issues."
Concepts are explained clearly and simply.
"Frank has a gift for explaining complicated concepts in a simple and understandable way. I never felt lost."
"The instructor's explanations are very clear, breaking down complex Spark ideas into manageable pieces."
"I found the way RDDs and DataFrames were explained to be particularly clear and easy to follow."
"His explanation of RDD concepts is awesome... easy to understand the core concept."
Focus on practical application and coding.
"The hands-on coding and projects are the strongest part of the course for me. I really enjoyed working through the examples."
"I appreciate the emphasis on practical exercises; it helped solidify the concepts much better than just theoretical lectures."
"This course is incredibly hands-on, giving me practical code examples to work through right from the start."
"It provided me with practical tools and strategies I could immediately apply to big data problems."
Initial setup environment can be tricky.
"Getting the local environment set up correctly with Spark and Python was a bit of a hurdle at the beginning."
"I struggled a bit with the initial setup instructions, had to troubleshoot for a while to get everything running."
"While the content is great, the setup process could be smoother. There were some version conflicts I had to resolve."
"Initial setup requires attention to detail, but the provided updates were helpful in overcoming issues."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Taming Big Data with Apache Spark and Python - Hands On! with these activities:
Review Python Fundamentals
Solidify your understanding of Python fundamentals to ensure a smooth transition into PySpark.
Browse courses on Python Basics
Show steps
  • Review Python syntax and data structures.
  • Practice writing basic Python functions.
  • Work through online Python tutorials.
Review 'Spark: The Definitive Guide'
Deepen your understanding of Spark concepts and architecture with a comprehensive guide.
Show steps
  • Read the chapters relevant to the course syllabus.
  • Experiment with the code examples provided in the book.
PySpark DataFrame Exercises
Reinforce your understanding of PySpark DataFrames through targeted practice exercises.
Show steps
  • Find online resources with PySpark DataFrame exercises.
  • Complete exercises on data filtering and aggregation.
  • Practice writing SQL-style queries using DataFrames.
Four other activities
Expand to see all activities and additional details
Show all seven activities
Blog Post: Spark Optimization Tips
Solidify your knowledge by writing a blog post summarizing Spark optimization techniques.
Show steps
  • Research common Spark optimization strategies.
  • Write a clear and concise blog post explaining these techniques.
  • Include code examples and practical tips.
Analyze Public Datasets with Spark
Apply your Spark skills to a real-world problem by analyzing a public dataset.
Show steps
  • Choose a public dataset relevant to your interests.
  • Use Spark to clean, transform, and analyze the data.
  • Visualize your findings and present your results.
Review 'High Performance Spark'
Explore advanced techniques for optimizing Spark performance and scalability.
Show steps
  • Read the chapters on data partitioning and serialization.
  • Experiment with different optimization techniques on your Spark projects.
Contribute to a Spark Project
Deepen your understanding of Spark by contributing to an open-source project.
Show steps
  • Identify a Spark-related open-source project on GitHub.
  • Find an issue to work on or propose a new feature.
  • Submit a pull request with your changes.

Career center

Learners who complete Taming Big Data with Apache Spark and Python - Hands On! will develop knowledge and skills that may be useful to these careers:
Data Engineer
A Data Engineer designs, builds, and maintains data pipelines and infrastructure. This course helps build a foundation for developing expertise in big data technologies like Apache Spark and Python, which are essential tools for a successful data engineer. The course covers Spark's DataFrames, Resilient Distributed Datasets, and how to scale up to larger datasets using Amazon's Elastic MapReduce service. Someone wishing to become a data engineer should especially note that this course focuses on structuring data analysis problems as Spark problems, and running them on cloud computing services.
Machine Learning Engineer
A Machine Learning Engineer focuses on developing and deploying machine learning models at scale. The course provides a practical grounding in using Apache Spark with Python, a common combination for large-scale machine learning tasks. As a machine learning engineer you will find that this course covers Spark's MLLib, which is useful for implementing machine learning algorithms, and provides hands-on experience with real-world examples. Aspiring machine learning engineers will find this course useful in order to learn how to scale their models using Spark, which is a must in today's machine learning landscape.
Data Scientist
A Data Scientist analyzes and interprets complex data sets to uncover insights and solve business challenges. This course may be useful as it provides an introduction to Apache Spark and Python, powerful tools for handling large datasets. As a data scientist, using this course, one can learn how to frame analysis problems as Spark problems, use Spark SQL, and even apply machine learning algorithms with Spark's MLLib. This course gives hands-on experience in setting up a development environment and running Spark jobs, which is a key skill for aspiring data scientists.
Big Data Architect
A Big Data Architect designs and oversees the implementation of big data solutions for organizations. This course builds a foundation by teaching the core concepts of Apache Spark and how it integrates with Hadoop YARN. The course touches on scaling Spark across computing clusters using Amazon's Elastic MapReduce service. Big data architects will benefit from learning how to structure complex analysis problems as Spark problems and gaining hands-on experience with setting up and running Spark jobs on real-world data.
Cloud Solutions Architect
A Cloud Solutions Architect designs and implements cloud-based solutions for businesses. This course covers how to scale up data analysis using Amazon's Elastic MapReduce service. The course gives hands-on knowledge of setting up an AWS account and launching a Hadoop cluster with Spark, which are valuable skills for a cloud solutions architect. As a Cloud Solutions Architect, you will find that understanding how to optimize Spark jobs in the cloud can lead to better performance and cost efficiency for your company.
Data Analyst
A Data Analyst collects, cleans, and analyzes data to identify trends and insights. This course may be useful in teaching how to use Apache Spark and Python to handle and process large datasets. As a data analyst, using this course, you can learn to translate analysis problems into Spark scripts and gain hands-on experience with analyzing movie ratings and social network data. This course also covers Spark SQL and DataFrames, which can be useful when working with structured data.
Business Intelligence Analyst
A Business Intelligence Analyst uses data to help organizations make better business decisions. The course teaches skills valuable for processing and analyzing large datasets using Apache Spark and Python. As a business intelligence analyst you can use this course to learn how to use Spark SQL to query and analyze data, translate complex analysis problems into Spark scripts, and use Spark Streaming to process real-time data. This course will provide you with practical experience in using Spark to extract insights from data.
Research Scientist
A Research Scientist conducts research and experiments to advance scientific knowledge. This course may be useful as it helps one learn how to use Apache Spark and Python to process and analyze large datasets. Research scientists can use the skills learned in this course to scale their data analysis pipelines and apply machine learning algorithms with Spark's MLLib. The course provides hands-on experience in framing analysis problems as Spark problems and running them on cloud computing services.
Statistician
A Statistician collects, analyzes, and interprets numerical data to identify significant trends, patterns, and relationships. The course may be useful as it covers the use of Apache Spark and Python for handling large datasets. As a statistician, the knowledge of Spark's MLLib can be useful for applying statistical models and machine learning algorithms to big data. The course provides hands-on experience in setting up a development environment and running Spark jobs, which is a key asset when working with large datasets.
Software Engineer
A Software Engineer designs, develops, and tests software applications. This course may be useful if the software engineer needs to work with big data technologies or integrate Spark into their applications. As a software engineer this course may provide you with the skills to develop Spark jobs using Python and scale them up using Amazon's Elastic MapReduce service. It also covers Spark SQL and DataFrames, which are important for handling structured data.
Database Administrator
A Database Administrator is responsible for managing and maintaining databases. This course may be useful as it relates to understanding how big data technologies like Apache Spark interact with databases. Database administrators can use this course to gain insight into how Spark SQL can be used for querying and analyzing data stored in databases. This course provides an overview of Spark's capabilities in handling structured data and integrating with other data processing technologies.
Quantitative Analyst
A Quantitative Analyst develops and implements mathematical or statistical models for financial analysis and risk management. The concepts of data analysis, and using Spark to scale data-driven analytics can be useful tools to learn. This course may introduce you to these tools and help you think about how to conduct analysis. However, please note that this course does not focus on finance, which is the main subject of focus for a Quantitative Analyst.
Bioinformatician
A Bioinformatician analyzes biological data using computational tools and techniques. The concepts of data analysis, and using Spark to scale data-driven analytics can be useful tools to learn. This course may introduce you to these tools and help you think about how to conduct analysis. However, please note that this course does not focus on the biological domain.
Econometrician
An Econometrician applies statistical methods to economic data to test and quantify economic theories and models. The concepts of data analysis, and using Spark to scale data-driven analytics can be useful tools to learn. This course may introduce you to these tools and help you think about how to conduct analysis. However, please note that this course does not focus on the economic domain.
Financial Analyst
A Financial Analyst provides guidance to businesses and individuals making investment decisions. This course may be useful as it helps to learn about data analysis, and using Spark to scale data-driven analytics can be useful tools to learn. This course may introduce you to these tools and help you think about how to conduct analysis. However, please note that this course does not focus on the finance domain.

Reading list

We've selected two books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Taming Big Data with Apache Spark and Python - Hands On!.
Provides a comprehensive overview of Apache Spark, covering everything from basic concepts to advanced techniques. It serves as an excellent reference for understanding Spark's architecture, data processing capabilities, and various APIs. It is particularly useful for gaining a deeper understanding of Spark's internals and optimization strategies. This book is commonly used as a textbook at academic institutions and by industry professionals.
Delves into the performance tuning aspects of Apache Spark. It covers topics such as data partitioning, serialization, and memory management. It is more valuable as additional reading than it is as a current reference. This book is helpful in providing background knowledge on how to optimize Spark applications for speed and efficiency. It is commonly used by industry professionals.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Similar courses are unavailable at this time. Please try again later.
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2025 OpenCourser