Apache Spark with Scala - Hands On with Big Data! from Udemy

What's inside

Learning objectives

Develop distributed code using the scala programming language
Transform structured data using sparksql, datasets, and dataframes
Frame big data analysis problems as apache spark scripts
Optimize spark jobs through partitioning, caching, and other techniques

Build, deploy, and run spark scripts on hadoop clusters
Process continual streams of data with spark streaming
Traverse and analyze graph structures using graphx
Analyze massive data set with machine learning on spark

Develop distributed code using the scala programming language
Transform structured data using sparksql, datasets, and dataframes
Frame big data analysis problems as apache spark scripts
Optimize spark jobs through partitioning, caching, and other techniques
Build, deploy, and run spark scripts on hadoop clusters
Process continual streams of data with spark streaming
Traverse and analyze graph structures using graphx
Analyze massive data set with machine learning on spark

Syllabus

Install a complete Spark / Scala development environment, and run a simple Scala program in Spark.

Udemy 101: Getting the Most From This Course

A brief introduction to the course, and then we'll get your development environment for Spark and Scala all set up on your desktop, using IntelliJ and SBT. A quick test application will confirm Spark is working on your system!

Let's review some of the high-level material you've learned about Apache Spark so far.

We'll go over the basic syntax and structure of Scala code with lots of examples. It's backwards from most other languages, but you quickly get used to it.

We'll go over the basic syntax and structure of Scala code with lots of examples. It's backwards from most other languages, but you quickly get used to it. Part 2 of 2, with some hands-on practice at the end.

Scala is a functional programming language, and so functions are central to the language. We'll go over the many ways functions can be declared and used in Scala, and practice what you've learned.

We'll cover the common data structures in Scala such as Map and List, and put them into practice.

The core object of Spark programming is the Resilient Distributed Dataset, or RDD. Once you know how to use RDD's, you know how to use Spark. We'll go over what they are, and what you can do with them.

Now that we understand Scala and have the theory of Spark behind us, let's start with a simple example of using RDD's to count up how many of each rating exists in the MovieLens data set.

How does Spark convert your script into a Directed Acyclic Graph and figure out how to distribute it on a cluster? Understanding how this process works under the hood can be important in writing optimal Spark driver scripts.

RDD's that contain a tuple of two values are key/value RDD's, and you can use them much like you might use a NoSQL data store. We'll use key/value RDD's to figure out the average number of friends by age in some fake social network data.

We'll run the average friends by age example on your desktop, and give you some ideas for further extending this script on your own.

We'll cover how to filter data out of an RDD efficiently, and illustrate this with a new example that finds the minimum temperature by location using real weather data.

We'll run our minimum temperature by location example, and modify it to find maximum temperatures as well. Plus, some ideas for extending this script on your own.

flatmap() on an RDD can return variable amounts of new entries into the resulting RDD. We'll use this as part of a hands-on example that finds how often each word is used inside a real book's text.

We extend the previous lecture's example by using regular expressions to better extract words from our book.

Finally, we sort the final results to see what the most common words in this book really are! And some ideas to extend this script on your own.

Your assignment: write a script that finds the total amount spent per customer using some fabricated e-commerce data, using what you've learned so far.

We'll review my solution to the previous lecture's assignment, and challenge you further to sort your results to find the biggest spenders.

Check your results for finding the biggest spenders in our e-commerce data against my own solution.

Understand SparkSQL and the DataFrame and DataSet API's used for querying structured data in an efficient, scalable manner.

We'll revisit our fabricated social network data, but load it into a DataFrame and analyze it with actual SQL queries!

We'll analyze our social network data another way - this time using SQL-like functions on a DataSet, instead of actual SQL query strings.

Earlier we broke down the average number of friends by age using RDD's - see if you can do it using DataSets instead!

We'll revisit our movie ratings data set, and start off with a simple example to find the most-rated movie.

Broadcast variables can be used to share small amounts of data to all of the machines on your cluster. We'll use them to share a lookup table of movie ID's to movie names, and use that to get movie names in our final results.

We introduce the Marvel superhero social network data set, and write a script to find the most-connected superhero in it. It's not who you might think!

As a more complex example, we'll apply a breadth-first-search (BFS) algorithm to the Marvel dataset to compute the degrees of separation between any two superheroes. In this lecture, we go over how BFS works.

We'll go over our strategy for implementing BFS within a Spark script that can be distributed, and introduce the use of Accumulators to maintain running totals that are synced across a cluster.

Finally, we'll review the code for finding the degrees of separation using breadth-first-search, run it, and see the results!

Back to our movie ratings data - we'll discover movies that are similar to each other just based on user ratings. We'll cover the algorithm, and how to implement it as a Spark script.

We'll run our movie similarties script and see the results.

Your challenge: make the movie similarity results even better! Here are some ideas for you to try out.

In a production environment, you'll use spark-submit to start your driver scripts from a command line, cron job, or the like. We'll cover the details on what you need to do differently in this case.

Spark / Scala scripts that have external dependencies can be bundled up into self-contained packages using the SBT tool. We'll use SBT to package up our movie similarities script as an exercise.

Amazon Web Services (AWS) offers the Elastic MapReduce service (EMR,) which gives us a way to rent time on a Hadoop cluster of our choosing - with Spark pre-installed on it. We'll use EMR to illustrate running a Spark script on a real cluster, so let's go over what EMR is and how it works first.

Let's compute movie similarities on a real cluster in the cloud, using one million user ratings!

Explicitly partitioning your Datasets and RDD's can be an important optimization; we'll go over when and how to do this.

Other tips and tricks for taking your script to a real cluster and getting it to run as you expect.

How to troubleshoot Spark jobs on a cluster using the Spark UI and logs, and more on managing dependencies of your script and data.

MLLib offers several distributed machine learning algorithms that you can run on a Spark cluster. We'll cover what MLLib can do and how it fits in.

We'll use MLLib's Alternating Least Squares recommender algorithm to produce movie recommendations using our MovieLens ratings data. The results are... unexpected!

A brief overview of what linear regression is and how it works, followed by a hands-on example of finding a regression and applying it to fabricated page speed vs. revenue data.

We'll run our Spark ML example of linear regression, using DataFrames.

Spark Streaming allows you create Spark driver scripts that run indefinitely, continually processing data as it streams in! We'll cover how it works and what it can do, using the original DStream micro-batch API.

Structured Streaming is a newer DataFrame-based API in Spark for writing continuous applications.

We cover Spark's GraphX library and how it works.

We'll revisit our "superhero degrees of separation" example, and see how its breadth-first-search algorithm could be implemented using Pregel and GraphX.

We'll use GraphX and Pregel to recreate our earlier results analyzing the superhero social network data - but with a lot less code!

You made it to the end! Here are some book recommendations if you want to learn more, as well as some career advice on landing a job in "big data".

Traffic lights

Read about what's good

what should give you pause

and possible dealbreakers

Provides a comprehensive study of Scala, a programming language used in data science and big data

Explores SparkSQL, DataSets, and DataFrames, which are essential for working with structured data in Big Data

Covers advanced Spark features like partitioning, caching, and tuning for optimal performance

Introduces Spark Streaming for processing data in real time

Teaches GraphX for analyzing and visualizing graph data

Emphasizes hands-on practice with over 20 real-world examples

Reviews summary

Hands-on spark & scala for big data

According to learners, this course provides a solid foundation and practical, hands-on experience with Apache Spark using Scala. Students particularly praise the instructor's ability to provide clear explanations and break down complex topics into manageable parts, making big data concepts accessible. The course covers a wide range of Spark features, including RDDs, DataFrames, Datasets, Spark SQL, Streaming, MLlib, and GraphX, and is updated for Spark 3. The inclusion of a Scala crash course is helpful for those new to the language. While some learners found the initial environment setup challenging, the real-world examples and emphasis on coding are seen as highly valuable for professional application. Overall, reviews indicate this course is an excellent starting point for anyone looking to work with Spark and Scala.

Basic Scala intro is included.

"The Scala crash course was sufficient to get started with Spark."

"As someone new to Scala, the intro helped, but I needed supplemental resources."

"The Scala basics were covered well enough for the course examples."

"It's good that they include Scala basics, as it's required for the course."

"The Scala section gives you just enough to follow the Spark examples."

"A quick Scala primer is included which is useful if you are not already familiar."

Content reflects newer Spark versions.

"Glad to see the course is updated for Spark 3."

"The updated content using Structured Streaming was very relevant."

"It's great that the course keeps up with recent versions of Spark."

"The focus on DataFrames and Datasets is current practice."

"Instructor clearly put effort into updating the material."

"The course feels current and relevant."

Covers various Spark modules.

"The course touches on RDDs, DataFrames, Spark SQL, Streaming, and MLlib."

"It covers core Spark concepts thoroughly before moving to other libraries."

"Good overview of different Spark APIs like DataFrames and Datasets."

"I liked that it introduced GraphX and Spark Streaming briefly."

"Comprehensive coverage of Spark's main components."

"Provides a good taste of Spark SQL, MLlib, etc."

Excellent intro for beginners.

"This was a great starting point for learning Spark with Scala."

"Highly recommend this course as an introduction to Spark."

"Perfect for someone with some programming background new to big data."

"It gives you all the fundamentals you need to get started."

"An excellent overview of Spark concepts and implementation."

"A very good beginner to intermediate level course on Spark."

Instructor simplifies complex topics.

"Frank Kane is an excellent instructor, simplifying many complex Big Data topics."

"Frank explains things very well and clearly."

"The explanations are very clear and easy to follow."

"The instructor has a very clear teaching style."

"I really appreciate how the instructor breaks down the material."

"Great instructor, makes complex topics easy to grasp."

Practical coding activities are key.

"This course is very hands on which is great for learning."

"I liked all the hands-on parts of the course very much. It helped to understand Spark and Scala better."

"The coding examples were well demonstrated and easy to follow along."

"Doing the exercises helped cement the concepts presented in the lectures."

"The hands-on approach with real data sets makes learning engaging and effective."

"Loved the practicality of the course and the hands on exercises."

Setting up environment can be tricky.

"Getting the environment set up correctly was a bit tricky initially."

"The setup process required careful attention to versions (Java, Scala, Spark)."

"Debugging the IntelliJ and SBT setup took some time for me."

"Sometimes wrestling with the setup was more challenging than the Spark concepts."

"Had some issues with version compatibility during installation."

"The environment setup section could be smoother for absolute beginners."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Apache Spark with Scala - Hands On with Big Data! with these activities:

Find a mentor who can help you with Spark

Show steps

Finding a mentor who can help you with Spark will provide you with guidance and support as you learn.

Show steps

Identify someone in your network who has experience with Spark.
Ask them if they would be willing to mentor you.

Review Scala basics

Show steps

Refresh your understanding of Scala's syntax and constructs to prepare for the course's emphasis on Scala for Spark development.

Browse courses on Scala

Show steps

Review the basic syntax of Scala, including data types, variables, and expressions.
Practice writing simple Scala functions and methods.
Familiarize yourself with Scala's object-oriented features, such as classes and inheritance.

Review the syntax for Scala

Show steps

Reviewing the syntax for Scala will refresh your memory and ensure a smoother learning experience.

Browse courses on Scala

Show steps

Go over the basic syntax of Scala.
Practice writing simple Scala programs.

11 other activities

Expand to see all activities and additional details

Show all 14 activities

Practice RDD operations

Show steps

Practice the basic RDD operations for filtering, mapping, and reducing data to gain proficiency in working with Spark's core data structure.

Show steps

Create an RDD from a list of numbers and filter out even numbers.
Map the remaining numbers to their squares.
Reduce the squared numbers to find their sum.

Attend a Spark meetup or conference

Show steps

Attending a Spark meetup or conference will allow you to connect with other Spark users and learn about the latest developments in Spark.

Show steps

Find a Spark meetup or conference in your area.
Register for the event.
Attend the event and participate in the discussions.

Create a glossary of Spark terms

Show steps

Creating a glossary of Spark terms will help you understand and remember the key concepts of Spark.

Show steps

Identify the key terms in Spark.
Write a definition for each term.
Organize the terms into a glossary.

Build a DataFrame from scratch

Show steps

Gain hands-on experience in constructing a DataFrame from raw data, allowing you to manipulate and analyze structured data efficiently.

Browse courses on DataFrames

Show steps

Create a DataFrame from a CSV file.
Filter the DataFrame based on specific criteria.
Group the DataFrame by a column and calculate aggregate values.

Participate in a Spark coding challenge

Show steps

Challenge yourself and test your Spark skills by participating in coding competitions, pushing your limits and gaining valuable experience.

Browse courses on Challenges

Show steps

Find a suitable Spark coding challenge.
Analyze the problem statement and design an efficient solution.
Implement your solution and submit it for evaluation.
Review the results and learn from your experience.

Solve Spark coding challenges

Show steps

Solving Spark coding challenges will help you improve your skills in applying Spark to real-world problems.

Show steps

Find a website or platform that offers Spark coding challenges.
Choose a challenge and attempt to solve it.
Review your solution and identify areas for improvement.

Write a blog post on Spark optimization techniques

Show steps

Deepen your understanding of Spark's performance by researching and writing about optimization techniques, reinforcing your knowledge and helping others learn.

Show steps

Research different Spark optimization techniques.
Choose a specific technique and write a detailed blog post explaining its benefits and implementation.
Share your blog post with the community and gather feedback.

Mentor a junior developer on Spark

Show steps

Deepen your understanding of Spark by mentoring a junior developer, reinforcing your knowledge while helping others grow their skills.

Show steps

Find a junior developer who is interested in learning Spark.
Establish regular mentoring sessions.
Share your knowledge of Spark concepts and best practices.
Provide guidance on projects and assignments.

Participate in a Spark competition or hackathon

Show steps

Participating in a Spark competition or hackathon will challenge you to apply your skills to solve real-world problems and learn from others.

Show steps

Find a Spark competition or hackathon that interests you.
Form a team or work individually.
Develop a solution to the problem.
Submit your solution and present it to the judges.

Develop a Spark application for real-time data analysis

Show steps

Apply your Spark skills to a practical project, building a real-time data analysis application that demonstrates your proficiency in handling streaming data.

Browse courses on Spark Streaming

Show steps

Design the architecture of your application.
Implement data ingestion and processing pipelines.
Visualize and analyze the results in real-time.
Deploy and monitor your application.

Attend a Spark workshop on advanced topics

Show steps

Expand your knowledge of Spark by attending workshops focused on advanced topics, such as machine learning or graph processing.

Show steps

Identify a workshop that aligns with your interests.
Register and attend the workshop.
Actively participate in discussions and hands-on exercises.
Network with other attendees and experts.

Career center

Learners who complete Apache Spark with Scala - Hands On with Big Data! will develop knowledge and skills that may be useful to these careers:

Data Analyst

A Data Analyst uses data to create reports, solve problems, and improve decision-making. This course can help build a foundation in Spark, a big data platform widely used by Data Analysts. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.

See salaries and explore the career path for Data Analyst

Data Engineer

A Data Engineer builds and maintains big data infrastructure. This course covers how to build and deploy Spark scripts on Hadoop clusters, a common platform used by Data Engineers. You will also learn how to use Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.

See salaries and explore the career path for Data Engineer

Data Scientist

A Data Scientist uses data to build predictive models and solve business problems. This course can help build a foundation in Spark, a popular platform for Data Scientists. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets, as well as how to use Spark ML to perform machine learning algorithms.

See salaries and explore the career path for Data Scientist

Software Developer

A Software Developer builds and maintains software applications. This course covers how to use Scala, a popular programming language for big data, to develop Spark applications. You will also learn how to use Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.

See salaries and explore the career path for Software Developer

Data Architect

A Data Architect designs and manages data systems. This course can help build a foundation in Spark, a big data platform commonly used by Data Architects. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets, as well as how to deploy Spark scripts on Hadoop clusters.

See salaries and explore the career path for Data Architect

Big Data Engineer

A Big Data Engineer designs and implements big data solutions. This course covers how to use Scala, a popular programming language for big data, to develop Spark applications. You will also learn how to use Spark SQL, DataFrames, and Datasets to transform and analyze large datasets, as well as how to deploy Spark scripts on Hadoop clusters.

See salaries and explore the career path for Big Data Engineer

Machine Learning Engineer

A Machine Learning Engineer builds and deploys machine learning models. This course covers how to use Spark ML, a machine learning library for Spark, to build and deploy machine learning models. You will also learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.

See salaries and explore the career path for Machine Learning Engineer

Cloud Architect

A Cloud Architect designs and manages cloud computing systems. This course covers how to deploy Spark scripts on Hadoop clusters, a common platform used by Cloud Architects. You will also learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.

See salaries and explore the career path for Cloud Architect

Data Warehouse Architect

A Data Warehouse Architect designs and manages data warehouses. This course can help build a foundation in Spark, a big data platform commonly used by Data Warehouse Architects. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets, as well as how to deploy Spark scripts on Hadoop clusters.

See salaries and explore the career path for Data Warehouse Architect

Business Intelligence Analyst

A Business Intelligence Analyst uses data to improve business decision-making. This course can help build a foundation in Spark, a big data platform widely used by Business Intelligence Analysts. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.

See salaries and explore the career path for Business Intelligence Analyst

Database Administrator

A Database Administrator manages and maintains databases. This course covers how to use Spark to transform and analyze data in databases, a common task for Database Administrators. You will also learn how to use Scala, Spark SQL, DataFrames, and Datasets to work with data in databases.

See salaries and explore the career path for Database Administrator

Quantitative Analyst

A Quantitative Analyst uses data to make investment decisions. This course can help build a foundation in Spark, a big data platform commonly used by Quantitative Analysts. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.

See salaries and explore the career path for Quantitative Analyst

Statistician

A Statistician uses data to solve problems and make decisions. This course can help build a foundation in Spark, a big data platform commonly used by Statisticians. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.

See salaries and explore the career path for Statistician

Market Researcher

A Market Researcher uses data to understand consumer behavior. This course can help build a foundation in Spark, a big data platform commonly used by Market Researchers. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.

See salaries and explore the career path for Market Researcher

Actuary

An Actuary uses data to assess risk and uncertainty. This course can help build a foundation in Spark, a big data platform commonly used by Actuaries. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.

See salaries and explore the career path for Actuary