PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. This course starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.
PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. This course starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.
You'll learn to work with Apache Spark and perform ML tasks more smoothly than before. Gathering and querying data using Spark SQL, to overcome challenges involved in reading it. You'll use the DataFrame API to operate with Spark MLlib and learn about the Pipeline API. Finally, we provide tips and tricks for deploying your code and performance tuning.
By the end of this course, you will not only be able to perform efficient data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in your organization.
About the Author
Danny Meijer works as the Lead Data Engineer in the Netherlands for the Data and Analytics department of a leading sporting goods retailer. He is a Business Process Expert, big data scientist and additionally a data engineer, which gives him a unique mix of skills—the foremost of which is his business-first approach to data science and data engineering.
He has over 13-years' IT experience across various domains and skills ranging from (big) data modeling, architecture, design, and development as well as project and process management; he also has extensive experience with process mining, data engineering on big data, and process improvement.
As a certified data scientist and big data professional, he knows his way around data and analytics, and is proficient in various types of programming language. He has extensive experience with various big data technologies and is fluent in everything: NoSQL, Hadoop, Python, and of course Spark.
Danny is a driven person, motivated by everything data and big-data. He loves math and machine learning and tackling difficult problems.
This video gives an entire overview of the course.
One might wonder, why Spark, and where does Python fit in? In this video we will cover why Python is a good pick when working with Spark.
• Compare various programming languages; understand how Spark interacts with them
• Explore how Spark creates jobs
• Get a good understanding of where Python fits in
Here, we prepare for the course by downloading the data and exploring how the lab environment will look like.
• Downloading all the courseware
• Familiarizing oneself with the layout of the courseware
• Learn how to use Docker and Jupyter
To follow along the labs of the course, it is important to do some setting up.
• Setup the local development environment
• Run the first PySpark ‘Hello World’ script
What is Spark and where did it come from? Understanding where Spark came from and why it was created will help you understand why Spark is so good at what it does.
• Understand Spark’s history and where it came from
• Compare Spark with Hadoop’s MapReduce
• Introduce you to Spark’s ecosystem and walk through Spark components
How does Spark work internally, and how it does cluster computing? Familiarizing you with Spark’s architecture and how it manages cluster computing, will help build your understanding of how Spark works and how you work with it.
• Get familiarized with some of Spark’s architecture
• Walk you through Spark’s support for cluster managers
• Explain how Spark applications interact with a cluster
How does Spark fit into a Data Scientists workflow? In-order to understand why Spark fits well into a Data Scientists workflow, we will explore Spark’s machine learning library.
• Look at what is MLlib, cover the (high-level) internal components of MLlib
• Glance at companies using Spark MLlib and use cases
One cannot do analytics without data. Hence it is important to understand how Spark handles data at its core – we will learn about Spark’s powerful DataFrame API.
• Cover Spark’s core abstraction, see how Spark handles data internally
• Look at how RDDs, datasets, DataFrames, and pandas work (together)
• Understand data immutability and how that affects Spark operations
How does Spark handle data operations? In this video we will continue zooming in on the Spark DataFrame, but this time we will focus on data operations, how Spark plans out its executions, and how it optimizes resources and execution.
• Look at why Spark is lazy and it is a good thing
• Zoom in on the differences between different kind of operations
• Understand how the catalyst query optimizer optimizes resources and execution
Apache Spark’s MLlib has a built-in way of handling parameters; allowing to set, tune, read, and deal with them centrally. In this video, we will explore this concept and see how Spark has unified APIs across its rich set of algorithms (the features module) as well as learning about pipeline persistence.
• Look at how parameters have self-contained documentation and are unified across algorithms
• Learn about feature extractors, transformers, selectors, and LSH
• Glance at how to save and load ML instances from and to disk
We’ve learned a lot about Spark’s internal workings, but how do you actually load data with it? Get hands-on with Spark’s SQL module, and learn how to load data from a CSV file.
• Get introduced to Spark’s SQL module
• Learn how to load data using Spark
• Look at how to handle data schemas
Hands-on introduction on using Spark functions and DataFrame operations to wrangle data and fix data issues that we might encounter with the specific focus on data types and timestamps (part 1 of 2).
• Explore Spark SQL functions
• Learn how to fix issues in our data using Spark
• Learn how to rename, add, manipulate, and drop columns
Hands-on introduction on using Spark functions and DataFrame operations to wrangle data and fix data issues that we might encounter with the specific focus on data types and timestamps (part 2 of 2).
• Look at complex data types Learn about converting data to arrays
• Glance at advanced cleaning techniques on how to extract data from existing data
• Apply some advanced filtering techniques
Grouping, joining, and aggregating are important parts of the Data Wrangling process. In this video, we will cover these topics in detail (part 1 of 2).
• Look at Spark’s join types in a nutshell
• Learn how to apply grouping
• Look at how to avoid ambiguous columns in your output when joining
Grouping, joining, and aggregating are important parts of the Data Wrangling process. In this video, we will cover these topics in detail (part 2 of 2).
• Zoom in on aggregations and aggregate functions
• Look at how to apply aliases to columns and DataFrames
• Glance at list of important classes in PySpark SQL
Spark ships with two machine learning libraries. How does one tell them apart? When to use which? In this video we set zoom in on these two libraries and explore the difference and how to properly use them in code.
• Zoom in on Spark’s machine learning libraries
• Explore how to use MLlib properly in code
• Explore Spark MLllb and the documentation
Our use case is to build a system that recommends movies to users using PySpark. In this video, we start by understanding how to build a recommender system and how Spark goes about this.
• Explore what a recommender system consists of
• Explore collaborative filtering and look at how Spark implements this
• Look at how to go about building a recommender system
Our use case is to build a system that recommends movies to users using PySpark. In this video, we get hands-on with MLlib and explore what a recommender system looks like in code.
• Zoom in ond ALS algorithm and how it looks in code
• Learn about hyperparameters
• Start building a recommender system in a hands-on way
Our use case is to build a system that recommends movies to users using PySpark. In this video, we get hands-on with MLlib, focusing on model performance.
• Determine models performance using an evaluator
• Tune the model and find the best fit for our data
• Apply hyperparameter tuning, use parameter grid and apply cross validation
Our use case is to build a system that recommends movies to users using PySpark. In this video, we will finalize our recommender system and use it to make some recommendations.
• Assemble all the parts into an end-to-end solution
• Learn how a recommendation model can be used and integrated
• Let the model make some actual recommendations
You have now built your first use case using Spark MLlib. Along the way, you should have acquired a basic understanding of how to use Spark MLlib and get familiar with its syntax. In this video, we will recap these learning and explore some additional things to take learning from.
• Explore explicit versus implicit feedback strategies
• Look at how Spark MLlib tackle scalability
• Learn how to tackle the so-called cold start problem
In the last section you got to use Spark’s machine learning library, specifically the recommendation part of it. There is, however, so much more to learn about MLlib. Here, we set out to discover which things (about MLlib) are important but not explicitly or easily available in the official documentation.
• Revisit MLlib documentation, focusing on the implicit details
• See where to find linear algebra (vectors and matrices)
• Explore built-in data sources, sample data, utilities, and mixins
When one does machine learning, it practically never happens that a single algorithm is enough for your analysis. So, how does Spark handle chaining multiple algorithms? MLlib’s pipeline API makes it easy to combine multiple algorithms. We will focus on this topic in this video.
• Understand the key concepts of pipelines–estimator, transformer, and parameter
• Build a solid understanding of how pipelines consist of stages
• Look at how are pipelines and pipeline models assembled
We have learned about pipelines and pipeline models. In this video, we will put our learnings to test and get hands-on with pipelines. See how a pipeline is defined and how regressions and classifications are handled.
• Get hands-on with pipeline syntax
• Look at how a pipeline stage needs to be unique
• Glance at how to and where to find regression and classification modules
We will round off our journey through Spark’s rich machine learning library by deep diving into the subject of frequent pattern mining and statistics while briefly revisiting hyperparameter tuning; applying it to a pipeline this time.
• Learn about algorithms to perform basic statistics and frequent pattern mining
• Apply hyperparameter tuning to a machine learning pipeline
• Look at high-level rundown of all remaining MLlib algorithms and modules
Throughout this section, we will be exploring and preparing for our sentiment analysis use case. We start with taking a structured approach to big data analysis. Additionally, we will explore how Spark handles natural language processing and how to do sentiment analysis with Spark.
• Take a structured approach to big data analysis
• Explore how Spark handles natural language processing
• Look at how are we going to handle sentiment analysis with Spark
A model can’t be trained without the right data. Here, we will identify the data source we will use for training our model. We focus on data identification and data acquisition and then start collecting the data we need to train a sentiment analysis model on.
• Identify datasets required for our use case
• Start exploring the data using PySpark
• Learn how to set a schema that is reusable across data sources
Continuing the structured approach, you will learn additional tips and tricks on how the Spark lab environment can be effectively used for exploring a dataset and focusing heavily on the interaction between pandas and PySpark. We will also get to see some data visualization using seaborn.
• Explore the data in detail using the Spark lab
• Record (initial) findings for the data cleaning step
• Glance at data visualization using seaborn
Next thing to do is ensuring our data is ready for future use and cleaning. Here, we end our exploration and acquisition phase by verifying our data is ready for cleaning.
• Ensure that the data is ‘stable’ (checking for null data)
• Make the data fit for use by applying partitioning operators
• Write the ‘RAW’ data to a specified location
No natural language processing is complete without understanding regular expressions. Here, we will start our data cleaning process by taking the findings from the acquire phase of the previous section and applying those to the data that we prepared using regular expressions.
• Explore and learn about regular expressions
• Prepare our data by applying cleaning logic on it
Here, we will finish off our data cleaning process by applying Spark SQL functions and RegEx on our data set. By the end of this video, we will have stored the cleaned data set in a new location, ready for the next step.
• Continue preparing our data by applying cleaning logic on it
• Explore regular expressions using Spark
• Store our clean/transformed data in a new location
The data preparation and wrangling steps are now finished. Next step is to do the data analysis and train a sentiment analysis model. In this video, we start this process by selecting algorithms, and training and tuning them.
• Select which algorithms to use
• Split the data for training, validation, and test
• Train and tune our sentiment analysis ML pipeline
Here, we explore the results from the previous model training step and persist the resulting model to disk for future use.
• Explore the results from the previous model training step
• Use ML persistence to store the model for future use
To perform real-time and streaming, we are going to need a source capable of providing data in a fast and steady stream. For this reason, we are going to be using the twitter API to grab data in real-time.
• Set up twitter developer API credentials
• Use the Twitter API to grab data in real-time
Next, it is important to understand how Spark handles streaming data. Here, we will focus on understanding Spark (Structured) Streaming.
• Understand difference between Spark Streaming and Spark Structured Streaming
• Learn about structured streaming unbounded tables
• Look at structure streaming programming model–triggers, inputs, queries, results, and outputs
Next, we continue our focus on structured streaming, this time getting more hands-on with it. Looking at how to manage streams, and how to convert between static and streaming applications.
• Learn about streaming DataFrames
• Convert a static job to a streaming job
• Interact with a Twitter datastream using Spark structured streaming
We will round off our use case by putting all we have learned across the last three sections together and finally assemble a real-time sentiment analysis solution.
• Deploy sentiment analysis model
• Implement and integrate MLlib with Spark SQL and structure streaming
Here, we will round off our deep-dive journey to Spark MLlib and spark streaming, recapping all we have gone through in the previous three sections, focusing on the structure approach we took to get there.
• Recap the structured approach across the sections
So far, we have used Spark only in a lab environment. How do we go beyond that? Here, we will share with you how Spark can be run in a production setting. You will learn about submitting and packaging Spark applications.
• Understand how to submit an application to Spark
• Use spark-submit to submit PySpark applications
• Learn about how you can package your PySpark applications
What is required to scale Spark up and where you could do that? Here, let us talk about running Spark at scale. Sharing with you vendors that can run Spark in the cloud as well as showing various things to keep in mind while scaling up, such as configuration mappings.
• Cover Spark configuration mappings
• Explore various commercial cloud vendors that offer Spark
Here, we will round off this section by sharing tips, tricks, and take-aways. Also, recapping all we have learned throughout the course.
• Look at tips and tricks for your further journey into Spark
• Deep dive into Databricks
• Highlight all the take-aways as a closure to the course
OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.
Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.
Find this site helpful? Tell a friend about us.
We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.
Your purchases help us maintain our catalog and keep our servers humming without ads.
Thank you for supporting OpenCourser.