Mastering Big Data Analytics with PySpark from Udemy

What's inside

Learning objectives

Gain a solid knowledge of vital data analytics concepts via practical use cases
Create elegant data visualizations using jupyter
Run, process, and analyze large chunks of datasets using pyspark
Utilize spark sql to easily load big data into dataframes

Create fast and scalable machine learning applications using mllib with spark
Perform exploratory data analysis in a scalable way
Achieve scalable, high-throughput and fault-tolerant processing of data streams using spark streaming

Gain a solid knowledge of vital data analytics concepts via practical use cases
Create elegant data visualizations using jupyter
Run, process, and analyze large chunks of datasets using pyspark
Utilize spark sql to easily load big data into dataframes
Create fast and scalable machine learning applications using mllib with spark
Perform exploratory data analysis in a scalable way
Achieve scalable, high-throughput and fault-tolerant processing of data streams using spark streaming

Syllabus

Python and Spark: A Match Made in Heaven

This video gives an entire overview of the course.

One might wonder, why Spark, and where does Python fit in? In this video we will cover why Python is a good pick when working with Spark.

• Compare various programming languages; understand how Spark interacts with them

• Explore how Spark creates jobs

• Get a good understanding of where Python fits in

Here, we prepare for the course by downloading the data and exploring how the lab environment will look like.

• Downloading all the courseware

• Familiarizing oneself with the layout of the courseware

• Learn how to use Docker and Jupyter

To follow along the labs of the course, it is important to do some setting up.

• Setup the local development environment

• Run the first PySpark ‘Hello World’ script

Test Your Knowledge

Working with PySpark

What is Spark and where did it come from? Understanding where Spark came from and why it was created will help you understand why Spark is so good at what it does.

• Understand Spark’s history and where it came from

• Compare Spark with Hadoop’s MapReduce

• Introduce you to Spark’s ecosystem and walk through Spark components

How does Spark work internally, and how it does cluster computing? Familiarizing you with Spark’s architecture and how it manages cluster computing, will help build your understanding of how Spark works and how you work with it.

• Get familiarized with some of Spark’s architecture

• Walk you through Spark’s support for cluster managers

• Explain how Spark applications interact with a cluster

How does Spark fit into a Data Scientists workflow? In-order to understand why Spark fits well into a Data Scientists workflow, we will explore Spark’s machine learning library.

• Look at what is MLlib, cover the (high-level) internal components of MLlib

• Glance at companies using Spark MLlib and use cases

One cannot do analytics without data. Hence it is important to understand how Spark handles data at its core – we will learn about Spark’s powerful DataFrame API.

• Cover Spark’s core abstraction, see how Spark handles data internally

• Look at how RDDs, datasets, DataFrames, and pandas work (together)

• Understand data immutability and how that affects Spark operations

How does Spark handle data operations? In this video we will continue zooming in on the Spark DataFrame, but this time we will focus on data operations, how Spark plans out its executions, and how it optimizes resources and execution.

• Look at why Spark is lazy and it is a good thing

• Zoom in on the differences between different kind of operations

• Understand how the catalyst query optimizer optimizes resources and execution

Apache Spark’s MLlib has a built-in way of handling parameters; allowing to set, tune, read, and deal with them centrally. In this video, we will explore this concept and see how Spark has unified APIs across its rich set of algorithms (the features module) as well as learning about pipeline persistence.

• Look at how parameters have self-contained documentation and are unified across algorithms

• Learn about feature extractors, transformers, selectors, and LSH

• Glance at how to save and load ML instances from and to disk

Preparing Data Using Spark SQL

We’ve learned a lot about Spark’s internal workings, but how do you actually load data with it? Get hands-on with Spark’s SQL module, and learn how to load data from a CSV file.

• Get introduced to Spark’s SQL module

• Learn how to load data using Spark

• Look at how to handle data schemas

Hands-on introduction on using Spark functions and DataFrame operations to wrangle data and fix data issues that we might encounter with the specific focus on data types and timestamps (part 1 of 2).

• Explore Spark SQL functions

• Learn how to fix issues in our data using Spark

• Learn how to rename, add, manipulate, and drop columns

Hands-on introduction on using Spark functions and DataFrame operations to wrangle data and fix data issues that we might encounter with the specific focus on data types and timestamps (part 2 of 2).

• Look at complex data types Learn about converting data to arrays

• Glance at advanced cleaning techniques on how to extract data from existing data

• Apply some advanced filtering techniques

Grouping, joining, and aggregating are important parts of the Data Wrangling process. In this video, we will cover these topics in detail (part 1 of 2).

• Look at Spark’s join types in a nutshell

• Learn how to apply grouping

• Look at how to avoid ambiguous columns in your output when joining

Grouping, joining, and aggregating are important parts of the Data Wrangling process. In this video, we will cover these topics in detail (part 2 of 2).

• Zoom in on aggregations and aggregate functions

• Look at how to apply aliases to columns and DataFrames

• Glance at list of important classes in PySpark SQL

Machine Learning with Spark MLlib

Spark ships with two machine learning libraries. How does one tell them apart? When to use which? In this video we set zoom in on these two libraries and explore the difference and how to properly use them in code.

• Zoom in on Spark’s machine learning libraries

• Explore how to use MLlib properly in code

• Explore Spark MLllb and the documentation

Our use case is to build a system that recommends movies to users using PySpark. In this video, we start by understanding how to build a recommender system and how Spark goes about this.

• Explore what a recommender system consists of

• Explore collaborative filtering and look at how Spark implements this

• Look at how to go about building a recommender system

Our use case is to build a system that recommends movies to users using PySpark. In this video, we get hands-on with MLlib and explore what a recommender system looks like in code.

• Zoom in ond ALS algorithm and how it looks in code

• Learn about hyperparameters

• Start building a recommender system in a hands-on way

Our use case is to build a system that recommends movies to users using PySpark. In this video, we get hands-on with MLlib, focusing on model performance.

• Determine models performance using an evaluator

• Tune the model and find the best fit for our data

• Apply hyperparameter tuning, use parameter grid and apply cross validation

Our use case is to build a system that recommends movies to users using PySpark. In this video, we will finalize our recommender system and use it to make some recommendations.

• Assemble all the parts into an end-to-end solution

• Learn how a recommendation model can be used and integrated

• Let the model make some actual recommendations

You have now built your first use case using Spark MLlib. Along the way, you should have acquired a basic understanding of how to use Spark MLlib and get familiar with its syntax. In this video, we will recap these learning and explore some additional things to take learning from.

• Explore explicit versus implicit feedback strategies

• Look at how Spark MLlib tackle scalability

• Learn how to tackle the so-called cold start problem

Classification and Regression

In the last section you got to use Spark’s machine learning library, specifically the recommendation part of it. There is, however, so much more to learn about MLlib. Here, we set out to discover which things (about MLlib) are important but not explicitly or easily available in the official documentation.

• Revisit MLlib documentation, focusing on the implicit details

• See where to find linear algebra (vectors and matrices)

• Explore built-in data sources, sample data, utilities, and mixins

When one does machine learning, it practically never happens that a single algorithm is enough for your analysis. So, how does Spark handle chaining multiple algorithms? MLlib’s pipeline API makes it easy to combine multiple algorithms. We will focus on this topic in this video.

• Understand the key concepts of pipelines–estimator, transformer, and parameter

• Build a solid understanding of how pipelines consist of stages

• Look at how are pipelines and pipeline models assembled

We have learned about pipelines and pipeline models. In this video, we will put our learnings to test and get hands-on with pipelines. See how a pipeline is defined and how regressions and classifications are handled.

• Get hands-on with pipeline syntax

• Look at how a pipeline stage needs to be unique

• Glance at how to and where to find regression and classification modules

We will round off our journey through Spark’s rich machine learning library by deep diving into the subject of frequent pattern mining and statistics while briefly revisiting hyperparameter tuning; applying it to a pipeline this time.

• Learn about algorithms to perform basic statistics and frequent pattern mining

• Apply hyperparameter tuning to a machine learning pipeline

• Look at high-level rundown of all remaining MLlib algorithms and modules

Analyzing Big Data

Throughout this section, we will be exploring and preparing for our sentiment analysis use case. We start with taking a structured approach to big data analysis. Additionally, we will explore how Spark handles natural language processing and how to do sentiment analysis with Spark.

• Take a structured approach to big data analysis

• Explore how Spark handles natural language processing

• Look at how are we going to handle sentiment analysis with Spark

A model can’t be trained without the right data. Here, we will identify the data source we will use for training our model. We focus on data identification and data acquisition and then start collecting the data we need to train a sentiment analysis model on.

• Identify datasets required for our use case

• Start exploring the data using PySpark

• Learn how to set a schema that is reusable across data sources

Continuing the structured approach, you will learn additional tips and tricks on how the Spark lab environment can be effectively used for exploring a dataset and focusing heavily on the interaction between pandas and PySpark. We will also get to see some data visualization using seaborn.

• Explore the data in detail using the Spark lab

• Record (initial) findings for the data cleaning step

• Glance at data visualization using seaborn

Next thing to do is ensuring our data is ready for future use and cleaning. Here, we end our exploration and acquisition phase by verifying our data is ready for cleaning.

• Ensure that the data is ‘stable’ (checking for null data)

• Make the data fit for use by applying partitioning operators

• Write the ‘RAW’ data to a specified location

Processing Natural Language in Spark

No natural language processing is complete without understanding regular expressions. Here, we will start our data cleaning process by taking the findings from the acquire phase of the previous section and applying those to the data that we prepared using regular expressions.

• Explore and learn about regular expressions

• Prepare our data by applying cleaning logic on it

Here, we will finish off our data cleaning process by applying Spark SQL functions and RegEx on our data set. By the end of this video, we will have stored the cleaned data set in a new location, ready for the next step.

• Continue preparing our data by applying cleaning logic on it

• Explore regular expressions using Spark

• Store our clean/transformed data in a new location

The data preparation and wrangling steps are now finished. Next step is to do the data analysis and train a sentiment analysis model. In this video, we start this process by selecting algorithms, and training and tuning them.

• Select which algorithms to use

• Split the data for training, validation, and test

• Train and tune our sentiment analysis ML pipeline

Here, we explore the results from the previous model training step and persist the resulting model to disk for future use.

• Explore the results from the previous model training step

• Use ML persistence to store the model for future use

Machine Learning in Real-Time

To perform real-time and streaming, we are going to need a source capable of providing data in a fast and steady stream. For this reason, we are going to be using the twitter API to grab data in real-time.

• Set up twitter developer API credentials

• Use the Twitter API to grab data in real-time

Next, it is important to understand how Spark handles streaming data. Here, we will focus on understanding Spark (Structured) Streaming.

• Understand difference between Spark Streaming and Spark Structured Streaming

• Learn about structured streaming unbounded tables

• Look at structure streaming programming model–triggers, inputs, queries, results, and outputs

Next, we continue our focus on structured streaming, this time getting more hands-on with it. Looking at how to manage streams, and how to convert between static and streaming applications.

• Learn about streaming DataFrames

• Convert a static job to a streaming job

• Interact with a Twitter datastream using Spark structured streaming

We will round off our use case by putting all we have learned across the last three sections together and finally assemble a real-time sentiment analysis solution.

• Deploy sentiment analysis model

• Implement and integrate MLlib with Spark SQL and structure streaming

Here, we will round off our deep-dive journey to Spark MLlib and spark streaming, recapping all we have gone through in the previous three sections, focusing on the structure approach we took to get there.

• Recap the structured approach across the sections

The Power of PySpark

So far, we have used Spark only in a lab environment. How do we go beyond that? Here, we will share with you how Spark can be run in a production setting. You will learn about submitting and packaging Spark applications.

• Understand how to submit an application to Spark

• Use spark-submit to submit PySpark applications

• Learn about how you can package your PySpark applications

What is required to scale Spark up and where you could do that? Here, let us talk about running Spark at scale. Sharing with you vendors that can run Spark in the cloud as well as showing various things to keep in mind while scaling up, such as configuration mappings.

• Cover Spark configuration mappings

• Explore various commercial cloud vendors that offer Spark

Here, we will round off this section by sharing tips, tricks, and take-aways. Also, recapping all we have learned throughout the course.

• Look at tips and tricks for your further journey into Spark

• Deep dive into Databricks

• Highlight all the take-aways as a closure to the course

Good to know

Know what's good

, what to watch for

, and possible dealbreakers

Provides hands-on experience with PySpark, enabling learners to efficiently analyze large datasets and build scalable machine learning applications, which are essential skills in data science

Explores Spark's MLlib, covering key concepts like estimators, transformers, and parameters, which are fundamental for building machine learning pipelines and performing advanced data analysis

Includes a section on machine learning in real-time, using the Twitter API to grab data and implement real-time sentiment analysis, which is highly relevant for current industry applications

Requires learners to set up a local development environment using Docker and Jupyter, which may pose a barrier for some learners without prior experience with these tools

Uses Spark SQL to load data from CSV files and perform data wrangling, which is a standard practice, but may require familiarity with SQL concepts for effective utilization

Features content on submitting and packaging Spark applications, as well as scaling Spark in the cloud with vendors, which is useful for deploying PySpark projects in production environments

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Mastering Big Data Analytics with PySpark with these activities:

Review Python Fundamentals

Show steps

Strengthen your Python foundation to better understand PySpark's Python API and coding examples.

Browse courses on Python Basics

Show steps

Review basic Python syntax and data structures.
Practice writing simple Python functions.
Familiarize yourself with Python's standard library.

Review 'Spark: The Definitive Guide'

Show steps

Gain a deeper understanding of Spark's core concepts and architecture to enhance your PySpark skills.

View Spark: The Definitive Guide on Amazon

Show steps

Read the chapters on Spark architecture and data processing.
Study the examples related to DataFrames and Spark SQL.
Explore the sections on MLlib and Spark Streaming.

Practice DataFrame Operations

Show steps

Reinforce your understanding of PySpark DataFrame operations through hands-on exercises.

Show steps

Load a sample dataset into a PySpark DataFrame.
Perform filtering, grouping, and aggregation operations.
Practice joining multiple DataFrames.
Implement user-defined functions (UDFs).

Four other activities

Expand to see all activities and additional details

Show all seven activities

Write a Blog Post on PySpark Optimization

Show steps

Deepen your understanding of PySpark performance tuning by researching and writing a blog post.

Show steps

Research PySpark optimization techniques.
Write a clear and concise blog post explaining the techniques.
Include code examples and performance benchmarks.
Publish your blog post on a platform like Medium.

Build a Data Pipeline with PySpark

Show steps

Apply your PySpark knowledge to create a complete data pipeline from data ingestion to analysis.

Show steps

Choose a real-world dataset for your project.
Design a data pipeline architecture.
Implement data ingestion, transformation, and loading steps.
Perform data analysis and visualization.
Deploy your pipeline to a cloud platform.

Review 'High Performance Spark'

Show steps

Learn advanced techniques for optimizing PySpark applications for performance and scalability.

View High Performance Spark: Best Practices for... on Amazon

Show steps

Read the chapters on data partitioning and caching.
Study the examples related to query optimization.
Experiment with different optimization techniques on your own projects.

Contribute to a PySpark Project

Show steps

Enhance your PySpark skills by contributing to an open-source project.

Show steps

Find an open-source PySpark project on GitHub.
Identify a bug or feature to work on.
Submit a pull request with your changes.
Participate in code reviews.

Career center

Learners who complete Mastering Big Data Analytics with PySpark will develop knowledge and skills that may be useful to these careers:

Data Engineer

A Data Engineer is responsible for building and maintaining the infrastructure that allows for data analysis. This course on PySpark helps build a foundation for the skills required of a data engineer. You will learn how to use Apache Spark to process large datasets, utilize Spark SQL to load data, and create scalable machine learning applications. The course's emphasis on using PySpark for efficient data analytics at-scale directly translates to the responsibilities of a data engineer. Specifically, the course covers using the DataFrame API and the Pipeline API, which are essential for building robust data pipelines. This also enables data engineers to build systems that can process large and streaming datasets.

See salaries and explore the career path for Data Engineer

Machine Learning Engineer

A Machine Learning Engineer develops and deploys machine learning models. This course on PySpark is particularly relevant to this role because it focuses on building scalable machine learning applications using MLlib with Spark. The course introduces how to use Spark's machine learning library (MLlib) and how to implement recommender systems, which are common use cases of machine learning. You will learn how to use Spark to handle large datasets, which greatly improves the efficiency of training large-scale ML models. By the end of the course you will have a deep understanding of using PySpark to perform ML tasks, train ML pipelines, and improve data processing speed. Machine Learning Engineers who want to work with large datasets should take this course.

See salaries and explore the career path for Machine Learning Engineer

Big Data Architect

A Big Data Architect designs and oversees the implementation of big data solutions; this course on PySpark helps develop valuable skills for this role, as it covers a significant portion of the big data ecosystem. The course provides instruction on the architecture of Spark, and how to use it effectively on large datasets. A Big Data Architect will implement systems that must process large data streams, and this course covers how to use Spark to perform high-throughput and fault-tolerant processing of data, which is critical for a career in big data. Also, you will learn how to deploy code and tune performance, an important part of the job for a Big Data Architect. A professional in this role will benefit from this course's focus on scaling and deploying Spark jobs in production settings.

See salaries and explore the career path for Big Data Architect

Data Scientist

A Data Scientist uses data to discover patterns and build predictive models; this course on PySpark may be useful as it provides the tools to handle large-scale data analysis. Data Scientists often need to perform exploratory data analysis, and this course teaches you how to do it in a scalable way using PySpark. The course covers relevant topics such as using Spark SQL to load big data into DataFrames and creating machine learning applications with MLlib. Data scientists who need to work with big data will find the skills taught in this course beneficial; in particular, the course teaches how to use PySpark for data wrangling, machine learning pipelines, and feature engineering. This course can help you become more effective at transforming large datasets into insights.

See salaries and explore the career path for Data Scientist

Analytics Engineer

An Analytics Engineer focuses on transforming data for analysis and reporting; this course may be useful for an analytics engineer because it provides practical skills in using PySpark for data manipulation and processing. You will learn how to use Spark SQL to query and load data, as well as how to use the DataFrame API to handle large datasets. The course teaches the use of Spark SQL functions and DataFrame operations to wrangle data for analysis. An Analytics Engineer will find the techniques in the course relevant for transforming data efficiently at scale and preparing it for reporting. Those who want to use PySpark to process data for analysis will benefit from this course.

See salaries and explore the career path for Analytics Engineer

Business Intelligence Developer

A Business Intelligence Developer creates systems for data analysis and reporting; this course may be helpful to a Business Intelligence Developer because it provides scalable data processing techniques using PySpark. This course helps those in this role to learn how to use PySpark to analyze large datasets, utilize Spark SQL to easily load big data into DataFrames, and create data visualizations using Jupyter. The course's focus on efficient data analytics and scalable data processing is important for a Business Intelligence Developer looking to handle large datasets. Those who want to learn tools for large-scale data analytics will find this course beneficial.

See salaries and explore the career path for Business Intelligence Developer

Cloud Solutions Architect

A Cloud Solutions Architect designs and implements cloud-based infrastructure solutions; this course on PySpark may be helpful because it teaches how to use cloud-based data processing tools. This course teaches how to scale Spark for production environments, and it includes information about cloud vendors that can run Spark. The course provides familiarity with key big data concepts, such as handling large datasets, processing data streams, and using Spark for data analysis, all of which can be done on the cloud. A professional in this role can benefit from learning how to deploy and scale Spark applications within a cloud environment. Those who want to work with large scale data on the cloud may find this course useful.

See salaries and explore the career path for Cloud Solutions Architect

Database Administrator

A Database Administrator manages and maintains databases; this course on PySpark may be useful as it introduces ways to interact with databases for large-scale data operations. The course teaches how to use Spark SQL to load big data into DataFrames, which is a critical skill for managing large datasets. The course also provides hands-on experience with data manipulation and cleaning, which is frequently encountered by those in this role. A Database Administrator who wants to understand how data analysis tools integrate with databases will find the course helpful; in particular, those who work with large datasets that require distributed processing or those who are interested in moving into data engineering.

See salaries and explore the career path for Database Administrator

Research Scientist

A Research Scientist conducts experiments and analyzes data; this course on PySpark is helpful because it develops skills in scalable data analysis. This course introduces various Spark components and its architecture, allowing a research scientist to work with extremely large datasets. It focuses on using the DataFrame API for data handling and MLlib for machine learning, which can be beneficial for processing experimental data. The course also covers how to use exploratory data analysis in a scalable way, which is important for research purposes. A research scientist who works with significant amounts of data, or wants to process data at-scale, will find the skills offered in this course to be valuable.

See salaries and explore the career path for Research Scientist

Quantitative Analyst

A Quantitative Analyst develops and implements mathematical models for financial analysis, for example. This course on PySpark may be useful because it provides the ability to analyze large datasets using Apache Spark. This role often requires processing and analyzing large amounts of data, for which this course will be helpful, teaching how to load, query, and manipulate data using Spark SQL and the DataFrame API. The course's emphasis on scalable data analysis with PySpark may be useful for a quantitative analyst who handles large financial datasets. Those who want to expand their toolkit to include big data processing may find this course beneficial.

See salaries and explore the career path for Quantitative Analyst

Statistician

A Statistician applies statistical methods to solve problems. This course on PySpark may be helpful because it teaches scalable data processing. Statisticians often work with large datasets, and the course teaches how to use Spark SQL to load and query large datasets and how to perform exploratory data analysis in a scalable way. Additionally, this course provides experience in applying machine learning techniques, which may be useful for statistical analysis. Statisticians who seek to expand their skills to include machine learning techniques or to better deal with big data may find this course valuable.

See salaries and explore the career path for Statistician

Financial Analyst

A Financial Analyst analyzes financial data to provide insights. This course on PySpark may be useful as it teaches techniques for handling and analyzing large datasets. The course focuses on using PySpark for efficient data analytics, which can help a financial analyst to process, manipulate, and gain insights from large financial datasets. The course teaches how to use Spark SQL to load data into DataFrames and how to use the DataFrame API to operate with PySpark, which are useful skills for those who want to analyze large datasets and make data-driven financial assessments. Though the course does not focus exclusively on finance, those who want to expand their skills by learning to work with big data will find this course helpful.

See salaries and explore the career path for Financial Analyst

Marketing Analyst

A Marketing Analyst analyzes marketing data to measure performance and identify trends. This course on PySpark may be useful because it provides experience with large-scale data analysis using Apache Spark. The course will help a marketing analyst gain hands-on experience using PySpark to perform efficient and scalable analytics, including the use of Spark SQL to load data into DataFrames. The skills learned in this course, particularly using the DataFrame API and understanding how Spark works internally, may be useful for a marketing analyst who often handles large marketing datasets. Those who want to analyze data at scale may find this course beneficial.

See salaries and explore the career path for Marketing Analyst

Operations Analyst

An Operations Analyst analyzes operational data to improve efficiency; this course on PySpark may be useful, as it gives provides training in scalable data processing. The course focuses on using PySpark to process large datasets and utilizes Spark SQL to load data, providing a practical skillset for an Operations Analyst who deals with large volumes of operational data. Also, the course demonstrates how to perform exploratory data analysis, enabling better operational insights. An operations analyst who seeks to gain more insight from their data using analytics will find the course helpful. Those interested in using Spark for data processing may find this course beneficial.

See salaries and explore the career path for Operations Analyst

Bioinformatician

A Bioinformatician analyzes biological data using computational techniques. This course on PySpark may be useful because it provides experience in using data processing pipelines. The course provides experience in using PySpark to easily analyze large datasets and with methods to create scalable machine learning applications using MLlib with Spark. Additionally, the course provides techniques for both exploratory data analysis and scalable data processing, which are critical for a bioinformatician who handles large amounts of genomic or other biological data. Bioinformaticians who wish to expand their skills to include big data processing techniques may find this course beneficial.

See salaries and explore the career path for Bioinformatician

Mastering Big Data Analytics with PySpark

What's inside

Learning objectives

Syllabus

Good to know

Save this course

Activities

Career center

Reading list

Share

Similar courses