Taming Big Data with Apache Spark and Python

What's inside

Syllabus

Set up a working development environment for Spark with Python on your desktop.

Meet your instructor, and we'll review what this course will cover and what you need to get started.

How to find the scripts and data associated with the lectures in this course.

We'll install Anaconda, a JDK, and Apache Spark on your Windows system. When we're done, we'll run a simple little Spark script on your desktop to test it out!

Before we can analyze data with Spark, we need some data to analyze! Let's install the MovieLens dataset of movie ratings, which we'll use throughout the course.

We'll run a simple Spark script using Python, and analyze the 100,000 movie ratings you installed in the previous lecture. What is the breakdown of the rating scores in this data set? You'll find it's easy to find out!

Apache Spark 3 was released in early 2020 - here's what's new, what's improved, and what's deprecated.

This high-level introduction will help you understand what Spark is for, who's using it, and why it's such a big deal.

Understand the core object of Spark: the Resilient Distributed Dataset (RDD), and how you can use Spark to transform and perform actions upon RDD's.

We'll dissect our original ratings histogram Spark example, and understand exactly how every line of it works!

You'll learn how to use key/value pairs in RDD's, and special operations you can perform on them. To make it real, we'll introduce a new example: computing the average number of friends by age using a fake social network data set.

We'll take another look at our "average number of friends by age" example script, actually run it, and examine the results.

Learn how the filter() operation works on RDD's, and apply this toward finding the minimum temperatures from a real-world weather data set.

We'll look at the minimum temperatures by location example as a whole, and actually run it! Then, you've got an activity: modify this script to find the maximum temperatures instead. This lecture reinforces using filters and key/value RDD's.

Check your results for writing a maximum temperature Spark script to my own.

We'll do the standard "count the number of occurrences of each word in a book" exercise here, and review the differences between map() and flatmap() in the process.

You'll learn how to use regular expressions in Python, and use them to improve the results of our word count script.

Finally, we'll learn how to implement countByValue() in a way that returns a new RDD, and sort that RDD to produce our final results for word frequency.

Write your first Spark script on your own! I'll give you the strategy and tips you need to be successful. You're given a fake e-commerce data set, and your task is to find the total amount spent, broken down by customer ID.

Compare your code to my solution for finding the total spent by customer - and take on a new challenge! Modify your script to sort your final results by the amount spent, and find the biggest spender.

Compare your solution to sorting the customers by total amount ordered to mine, and check your results.

We'll cover the concepts of SparkSQL, DataFrames, and DataSets, and why they are so important in Spark 2.0 and above.

We'll dive into a real example, revisiting our fake social network data and analyzing it with DataFrames through a SparkSession object.

Let's revisit our "most popular movie" example, and implement it using a DataFrame instead of with RDD's. DataFrames are the preferred API in Spark 2.0+.

We'll write and run a simple script to find the most-rated movie in the MovieLens data set, which we'll build upon later.

You'll learn how to use "broadcast variables" in Spark to efficiently distribute large objects to every node your Spark program may be running on, and apply this to looking up movie names in our "most popular movie" script.

We introduce the Marvel superhero social graph data set, and write a Spark job to find the superhero with the most co-occurrences with other heroes in comic books.

Review the source code of our script to discover the most popular superhero, run it, and reveal the answer!

We'll introduce the Breadth-First Search (BFS) algorithm, and how we can use it to discover degrees of separation between superheroes.

We'll learn how to turn breadth-first search into a Spark problem, and craft our strategy for writing the code. Along the way, we'll cover Spark accumulators and how we can use them to signal our driver script when it's done.

We'll get our hands on the code to actually implement breadth-first search, and run it to discover the degrees of separation between any two superheroes!

Learn one technique for finding similar movies based on the MovieLens rating data, and how we can frame it as a Spark problem. We'll also introduce the importance of using cache() or persist() on rdd's that will have more than one action performed on them.

We'll review the code for finding similar movies in Spark with the MovieLens ratings data, run it on every available core of your desktop computer, and review the results.

Get your hands dirty! I'll give you some ideas on improving the quality of your similar movie results - go try some out, and mess around with our movie similarity code.

Learn how Amazon's Elastic MapReduce makes it easy to rent time on your very own Spark cluster, running on top of Hadoop YARN

Learn how to set up your AWS account, create a key pair for logging into your Spark / Hadoop cluster, and set up PuTTY to connect to your instances from a Windows desktop.

We'll see what needs to be done to our Movie Similarities script in order to get it to run successfully with one million ratings, on a cluster, by introducing the partitionBy() function.

We'll study the code of our modified movie similarities script, and get it ready to run on a cluster.

We'll launch a Hadoop cluster with Spark using Amazon's Elastic MapReduce service, and kick off our script to produce similar movies to Star Wars given one million movie ratings.

We'll look at our results from similar movies from one million ratings, and discuss them.

We'll look at the Spark console UI and the information it offers to help understand how to diagnose problems and optimize your large Spark jobs.

I'll share some more troubleshooting tips when running Spark on a cluster, and talk about how to manage dependencies your code may have.

We'll briefly cover the capabilities of Spark's MLLib machine learning library, and how it can help you solve data mining, machine learning, and statistical problems you may encounter. We'll go into more depth on MLLib's Alternating Least Squares (ALS) recommendation engine, and how we can use it to produce movie recommendations with the MovieLens data set.

We'll run MLLib's Alternating Least Squares recommender system on the MovieLens 100K dataset.

We'll finish running Alternating Least Squares recommendations on the MovieLens ratings data set using MLLib, and evaluate the results.

DataFrames are the preferred API for MLLib in Spark 2.0+. Let's look at an example of using linear regression with DataFrames.

An overview of how Spark Streaming lets you process continual streams on input data and aggregate it over time, and how GraphX lets you compute properties of networks.

We'll run an example of Spark structured streaming in Python to keep track of status code counts in a directory that receives Apache access logs.

GraphX isn't currently supported in Python, but you should at least know what it is.

Some suggested resources for learning more about Apache Spark, and data mining and machine learning in general.

Traffic lights

Read about what's good

what should give you pause

and possible dealbreakers

Uses Apache Spark 3, which is a recent version, suggesting the course is up-to-date with current industry standards and practices in big data processing

Emphasizes hands-on exercises and real-world examples, allowing learners to immediately apply their knowledge and build a portfolio of practical projects using Spark and Python

Covers Spark SQL, Spark Streaming, and GraphX, which are essential tools for handling structured data, real-time data streams, and graph-based analysis in big data environments

Includes exercises using Amazon's Elastic MapReduce service, which allows learners to gain experience with cloud-based big data processing and scaling Spark applications

Requires installing Anaconda, a JDK, and Apache Spark on a Windows system, which may present a barrier to learners who do not have access to these resources or are not familiar with the installation process

Uses Python, but suggests that Scala offers better performance with Spark, which may lead some learners to feel they are not using the optimal language for the task

Reviews summary

Hands-on spark with python

According to learners, this course offers a strong practical introduction to Apache Spark using Python (PySpark), emphasizing hands-on exercises and real-world examples like analyzing movie data and social graphs. Many students praise the clear explanations and the instructor's ability to demystify complex topics. The course structure, which moves from RDDs to DataFrames and Spark SQL, is seen as well-paced and logical. While some mention initial setup difficulties or that the course might be challenging for complete programming novices, the emphasis on practical application and the inclusion of updated content for Spark 3 are frequently highlighted as major strengths.

Suitable for those with some coding background.

"This course is excellent for anyone with basic Python who wants to get into Spark. It builds knowledge gradually."

"If you have some programming experience, especially with Python, you will find this course very accessible."

"Might be a bit fast-paced if you are completely new to programming and big data concepts simultaneously."

"Provides a solid foundation for users with some prior technical experience."

Content is updated and relevant (Spark 3).

"I appreciate that the course is updated for Spark 3, keeping the content current with the latest versions."

"The updates to include DataFrames and Structured Streaming with Spark 3 are very valuable."

"Good to see the course is maintained and updated to reflect newer versions of Spark."

"The course content feels relevant and uses modern Spark features like DataFrames effectively."

Code examples are useful and well-explained.

"The code examples provided are very helpful and directly applicable to the concepts being taught."

"I liked that the instructor walked through the code line by line, making it easy to understand how it works."

"The scripts are well-structured and serve as a great starting point for building my own Spark applications."

"The exercises and code examples are well-designed and reinforce learning effectively."

Experienced instructor is knowledgeable.

"The instructor is knowledgeable and presents the material in an engaging way. You can tell he has real-world experience."

"Frank Kane is an excellent instructor with deep understanding of the subject matter."

"The instructor's background at Amazon adds credibility and valuable insights to the lectures."

"I really like the instructor's teaching style and his ability to anticipate common issues."

Concepts are explained clearly and simply.

"Frank has a gift for explaining complicated concepts in a simple and understandable way. I never felt lost."

"The instructor's explanations are very clear, breaking down complex Spark ideas into manageable pieces."

"I found the way RDDs and DataFrames were explained to be particularly clear and easy to follow."

"His explanation of RDD concepts is awesome... easy to understand the core concept."

Focus on practical application and coding.

"The hands-on coding and projects are the strongest part of the course for me. I really enjoyed working through the examples."

"I appreciate the emphasis on practical exercises; it helped solidify the concepts much better than just theoretical lectures."

"This course is incredibly hands-on, giving me practical code examples to work through right from the start."

"It provided me with practical tools and strategies I could immediately apply to big data problems."

Initial setup environment can be tricky.

"Getting the local environment set up correctly with Spark and Python was a bit of a hurdle at the beginning."

"I struggled a bit with the initial setup instructions, had to troubleshoot for a while to get everything running."

"While the content is great, the setup process could be smoother. There were some version conflicts I had to resolve."

"Initial setup requires attention to detail, but the provided updates were helpful in overcoming issues."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Taming Big Data with Apache Spark and Python - Hands On! with these activities:

Review Python Fundamentals

Show steps

Solidify your understanding of Python fundamentals to ensure a smooth transition into PySpark.

Browse courses on Python Basics

Show steps

Review Python syntax and data structures.
Practice writing basic Python functions.
Work through online Python tutorials.

Review 'Spark: The Definitive Guide'

Show steps

Deepen your understanding of Spark concepts and architecture with a comprehensive guide.

View Spark: The Definitive Guide on Amazon

Show steps

Read the chapters relevant to the course syllabus.
Experiment with the code examples provided in the book.

PySpark DataFrame Exercises

Show steps

Reinforce your understanding of PySpark DataFrames through targeted practice exercises.

Show steps

Find online resources with PySpark DataFrame exercises.
Complete exercises on data filtering and aggregation.
Practice writing SQL-style queries using DataFrames.

Four other activities

Expand to see all activities and additional details

Show all seven activities

Blog Post: Spark Optimization Tips

Show steps

Solidify your knowledge by writing a blog post summarizing Spark optimization techniques.

Show steps

Research common Spark optimization strategies.
Write a clear and concise blog post explaining these techniques.
Include code examples and practical tips.

Analyze Public Datasets with Spark

Show steps

Apply your Spark skills to a real-world problem by analyzing a public dataset.

Show steps

Choose a public dataset relevant to your interests.
Use Spark to clean, transform, and analyze the data.
Visualize your findings and present your results.

Review 'High Performance Spark'

Show steps

Explore advanced techniques for optimizing Spark performance and scalability.

View High Performance Spark: Best Practices for... on Amazon

Show steps

Read the chapters on data partitioning and serialization.
Experiment with different optimization techniques on your Spark projects.

Contribute to a Spark Project

Show steps

Deepen your understanding of Spark by contributing to an open-source project.

Show steps

Identify a Spark-related open-source project on GitHub.
Find an issue to work on or propose a new feature.
Submit a pull request with your changes.

Career center

Learners who complete Taming Big Data with Apache Spark and Python - Hands On! will develop knowledge and skills that may be useful to these careers:

Data Engineer

A Data Engineer designs, builds, and maintains data pipelines and infrastructure. This course helps build a foundation for developing expertise in big data technologies like Apache Spark and Python, which are essential tools for a successful data engineer. The course covers Spark's DataFrames, Resilient Distributed Datasets, and how to scale up to larger datasets using Amazon's Elastic MapReduce service. Someone wishing to become a data engineer should especially note that this course focuses on structuring data analysis problems as Spark problems, and running them on cloud computing services.

See salaries and explore the career path for Data Engineer

Machine Learning Engineer

A Machine Learning Engineer focuses on developing and deploying machine learning models at scale. The course provides a practical grounding in using Apache Spark with Python, a common combination for large-scale machine learning tasks. As a machine learning engineer you will find that this course covers Spark's MLLib, which is useful for implementing machine learning algorithms, and provides hands-on experience with real-world examples. Aspiring machine learning engineers will find this course useful in order to learn how to scale their models using Spark, which is a must in today's machine learning landscape.

See salaries and explore the career path for Machine Learning Engineer

Data Scientist

A Data Scientist analyzes and interprets complex data sets to uncover insights and solve business challenges. This course may be useful as it provides an introduction to Apache Spark and Python, powerful tools for handling large datasets. As a data scientist, using this course, one can learn how to frame analysis problems as Spark problems, use Spark SQL, and even apply machine learning algorithms with Spark's MLLib. This course gives hands-on experience in setting up a development environment and running Spark jobs, which is a key skill for aspiring data scientists.

See salaries and explore the career path for Data Scientist

Big Data Architect

A Big Data Architect designs and oversees the implementation of big data solutions for organizations. This course builds a foundation by teaching the core concepts of Apache Spark and how it integrates with Hadoop YARN. The course touches on scaling Spark across computing clusters using Amazon's Elastic MapReduce service. Big data architects will benefit from learning how to structure complex analysis problems as Spark problems and gaining hands-on experience with setting up and running Spark jobs on real-world data.

See salaries and explore the career path for Big Data Architect

Cloud Solutions Architect

A Cloud Solutions Architect designs and implements cloud-based solutions for businesses. This course covers how to scale up data analysis using Amazon's Elastic MapReduce service. The course gives hands-on knowledge of setting up an AWS account and launching a Hadoop cluster with Spark, which are valuable skills for a cloud solutions architect. As a Cloud Solutions Architect, you will find that understanding how to optimize Spark jobs in the cloud can lead to better performance and cost efficiency for your company.

See salaries and explore the career path for Cloud Solutions Architect

Data Analyst

A Data Analyst collects, cleans, and analyzes data to identify trends and insights. This course may be useful in teaching how to use Apache Spark and Python to handle and process large datasets. As a data analyst, using this course, you can learn to translate analysis problems into Spark scripts and gain hands-on experience with analyzing movie ratings and social network data. This course also covers Spark SQL and DataFrames, which can be useful when working with structured data.

See salaries and explore the career path for Data Analyst

Business Intelligence Analyst

A Business Intelligence Analyst uses data to help organizations make better business decisions. The course teaches skills valuable for processing and analyzing large datasets using Apache Spark and Python. As a business intelligence analyst you can use this course to learn how to use Spark SQL to query and analyze data, translate complex analysis problems into Spark scripts, and use Spark Streaming to process real-time data. This course will provide you with practical experience in using Spark to extract insights from data.

See salaries and explore the career path for Business Intelligence Analyst

Research Scientist

A Research Scientist conducts research and experiments to advance scientific knowledge. This course may be useful as it helps one learn how to use Apache Spark and Python to process and analyze large datasets. Research scientists can use the skills learned in this course to scale their data analysis pipelines and apply machine learning algorithms with Spark's MLLib. The course provides hands-on experience in framing analysis problems as Spark problems and running them on cloud computing services.

See salaries and explore the career path for Research Scientist

Statistician

A Statistician collects, analyzes, and interprets numerical data to identify significant trends, patterns, and relationships. The course may be useful as it covers the use of Apache Spark and Python for handling large datasets. As a statistician, the knowledge of Spark's MLLib can be useful for applying statistical models and machine learning algorithms to big data. The course provides hands-on experience in setting up a development environment and running Spark jobs, which is a key asset when working with large datasets.

See salaries and explore the career path for Statistician

Software Engineer

A Software Engineer designs, develops, and tests software applications. This course may be useful if the software engineer needs to work with big data technologies or integrate Spark into their applications. As a software engineer this course may provide you with the skills to develop Spark jobs using Python and scale them up using Amazon's Elastic MapReduce service. It also covers Spark SQL and DataFrames, which are important for handling structured data.

See salaries and explore the career path for Software Engineer

Database Administrator

A Database Administrator is responsible for managing and maintaining databases. This course may be useful as it relates to understanding how big data technologies like Apache Spark interact with databases. Database administrators can use this course to gain insight into how Spark SQL can be used for querying and analyzing data stored in databases. This course provides an overview of Spark's capabilities in handling structured data and integrating with other data processing technologies.

See salaries and explore the career path for Database Administrator

Quantitative Analyst

A Quantitative Analyst develops and implements mathematical or statistical models for financial analysis and risk management. The concepts of data analysis, and using Spark to scale data-driven analytics can be useful tools to learn. This course may introduce you to these tools and help you think about how to conduct analysis. However, please note that this course does not focus on finance, which is the main subject of focus for a Quantitative Analyst.

See salaries and explore the career path for Quantitative Analyst

Bioinformatician

A Bioinformatician analyzes biological data using computational tools and techniques. The concepts of data analysis, and using Spark to scale data-driven analytics can be useful tools to learn. This course may introduce you to these tools and help you think about how to conduct analysis. However, please note that this course does not focus on the biological domain.

See salaries and explore the career path for Bioinformatician

Econometrician

An Econometrician applies statistical methods to economic data to test and quantify economic theories and models. The concepts of data analysis, and using Spark to scale data-driven analytics can be useful tools to learn. This course may introduce you to these tools and help you think about how to conduct analysis. However, please note that this course does not focus on the economic domain.

See salaries and explore the career path for Econometrician

Financial Analyst

A Financial Analyst provides guidance to businesses and individuals making investment decisions. This course may be useful as it helps to learn about data analysis, and using Spark to scale data-driven analytics can be useful tools to learn. This course may introduce you to these tools and help you think about how to conduct analysis. However, please note that this course does not focus on the finance domain.

See salaries and explore the career path for Financial Analyst

Taming Big Data with Apache Spark and Python - Hands On!

Here's a deal for you

What's inside

Syllabus

Traffic lights

Save this course

Reviews summary

Hands-on spark with python

Activities

Career center

Reading list

Share

Similar courses