We may earn an affiliate commission when you visit our partners.
Course image
Sundog Education by Frank Kane, Frank Kane, and Sundog Education Team

New. Completely updated and re-recorded for Spark 3, IntelliJ, Structured Streaming, and a stronger focus on the DataSet API.

Read more

New. Completely updated and re-recorded for Spark 3, IntelliJ, Structured Streaming, and a stronger focus on the DataSet API.

“Big data" analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark. Employers including Amazon, EBay, NASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You'll learn those same techniques, using your own Windows system right at home. It's easier than you might think, and you'll be learning from an ex-engineer and senior manager from Amazon and IMDb.

Spark works best when using the Scala programming language, and this course includes a crash-course in Scala to get you up to speed quickly. For those more familiar with Python however, a Python version of this class is also available: "Taming Big Data with Apache Spark and Python - Hands On".

Learn and master the art of framing data analysis problems as Spark problems through over 20 hands-on examples, and then scale them up to run on cloud computing services in this course.

  • Learn the concepts of Spark's Resilient Distributed Datasets, DataFrames, and Datasets.

  • Get a crash course in the Scala programming language

  • Develop and run Spark jobs quickly using Scala, IntelliJ, and SBT

  • Translate complex analysis problems into iterative or multi-stage Spark scripts

  • Scale up to larger data sets using Amazon's Elastic MapReduce service

  • Understand how Hadoop YARN distributes Spark across computing clusters

  • Practice using other Spark technologies, like Spark SQL, DataFrames, DataSets, Spark Streaming, Machine Learning, and GraphX

By the end of this course, you'll be running code that analyzes gigabytes worth of information – in the cloud – in a matter of minutes. 

We'll have some fun along the way. You'll get warmed up with some simple examples of using Spark to analyze movie ratings data and text in a book. Once you've got the basics under your belt, we'll move to some more complex and interesting tasks. We'll use a million movie ratings to find movies that are similar to each other, and you might even discover some new movies you might like in the process. We'll analyze a social graph of superheroes, and learn who the most “popular" superhero is – and develop a system to find “degrees of separation" between superheroes. Are all Marvel superheroes within a few degrees of being connected to SpiderMan? You'll find the answer.

This course is very hands-on; you'll spend most of your time following along with the instructor as we write, analyze, and run real code together – both on your own system, and in the cloud using Amazon's Elastic MapReduce service. over 8 hours of video content is included, with over 20 real examples of increasing complexity you can build, run and study yourself. Move through them at your own pace, on your own schedule. The course wraps up with an overview of other Spark-based technologies, including Spark SQL, Spark Streaming, and GraphX.

Enroll now, and enjoy the course.

"I studied Spark for the first time using Frank's course "Apache Spark 2 with Scala - Hands On with Big Data. ". It was a great starting point for me,  gaining knowledge in Scala and most importantly practical examples of Spark applications. It gave me an understanding of all the relevant Spark core concepts,  RDDs, Dataframes & Datasets, Spark Streaming, AWS EMR. Within a few months of completion, I used the knowledge gained from the course to propose in my current company to  work primarily on Spark applications. Since then I have continued to work with Spark. I would highly recommend any of Franks courses as he simplifies concepts well and his teaching manner is easy to follow and continue with.   " - Joey Faherty

Enroll now

What's inside

Learning objectives

  • Develop distributed code using the scala programming language
  • Transform structured data using sparksql, datasets, and dataframes
  • Frame big data analysis problems as apache spark scripts
  • Optimize spark jobs through partitioning, caching, and other techniques
  • Build, deploy, and run spark scripts on hadoop clusters
  • Process continual streams of data with spark streaming
  • Traverse and analyze graph structures using graphx
  • Analyze massive data set with machine learning on spark

Syllabus

Install a complete Spark / Scala development environment, and run a simple Scala program in Spark.
Udemy 101: Getting the Most From This Course
Read more
Alternate download link for the ml-100k dataset
WARNING: DO NOT INSTALL JAVA 21+ IN THE NEXT LECTURE

A brief introduction to the course, and then we'll get your development environment for Spark and Scala all set up on your desktop, using IntelliJ and SBT. A quick test application will confirm Spark is working on your system!

Introduction to Apache Spark

Let's review some of the high-level material you've learned about Apache Spark so far.

Important note
Understand the basics of Scala and code simple Scala programs.

We'll go over the basic syntax and structure of Scala code with lots of examples. It's backwards from most other languages, but you quickly get used to it.

We'll go over the basic syntax and structure of Scala code with lots of examples. It's backwards from most other languages, but you quickly get used to it. Part 2 of 2, with some hands-on practice at the end.

Scala is a functional programming language, and so functions are central to the language. We'll go over the many ways functions can be declared and used in Scala, and practice what you've learned.

We'll cover the common data structures in Scala such as Map and List, and put them into practice.

Understand the Resilient Distributed Dataset (RDD) and how to use it.

The core object of Spark programming is the Resilient Distributed Dataset, or RDD. Once you know how to use RDD's, you know how to use Spark. We'll go over what they are, and what you can do with them.

Now that we understand Scala and have the theory of Spark behind us, let's start with a simple example of using RDD's to count up how many of each rating exists in the MovieLens data set.

How does Spark convert your script into a Directed Acyclic Graph and figure out how to distribute it on a cluster? Understanding how this process works under the hood can be important in writing optimal Spark driver scripts.

RDD's that contain a tuple of two values are key/value RDD's, and you can use them much like you might use a NoSQL data store. We'll use key/value RDD's to figure out the average number of friends by age in some fake social network data.

We'll run the average friends by age example on your desktop, and give you some ideas for further extending this script on your own.

We'll cover how to filter data out of an RDD efficiently, and illustrate this with a new example that finds the minimum temperature by location using real weather data.

We'll run our minimum temperature by location example, and modify it to find maximum temperatures as well. Plus, some ideas for extending this script on your own.

flatmap() on an RDD can return variable amounts of new entries into the resulting RDD. We'll use this as part of a hands-on example that finds how often each word is used inside a real book's text.

We extend the previous lecture's example by using regular expressions to better extract words from our book.

Finally, we sort the final results to see what the most common words in this book really are! And some ideas to extend this script on your own.

Your assignment: write a script that finds the total amount spent per customer using some fabricated e-commerce data, using what you've learned so far.

We'll review my solution to the previous lecture's assignment, and challenge you further to sort your results to find the biggest spenders.

Check your results for finding the biggest spenders in our e-commerce data against my own solution.

Quiz: RDD's
Use higher-level API's in Spark 2 to execute queries on massive, structured data.

Understand SparkSQL and the DataFrame and DataSet API's used for querying structured data in an efficient, scalable manner.

We'll revisit our fabricated social network data, but load it into a DataFrame and analyze it with actual SQL queries!

We'll analyze our social network data another way - this time using SQL-like functions on a DataSet, instead of actual SQL query strings.

Earlier we broke down the average number of friends by age using RDD's - see if you can do it using DataSets instead!

Exercise Solution: Friends by Age, with Datasets.
[Activity] Word Count example, using Datasets
[Activity] Revisiting the Minimum Temperature example, with Datasets
[Exercise] Implement the "Total Spent by Customer" problem with Datasets
Exercise Solution: Total Spent by Customer with Datasets
Quiz: SparkSQL
Practice framing complex problems as Spark problems, and use advanced features of Spark.

We'll revisit our movie ratings data set, and start off with a simple example to find the most-rated movie.

Broadcast variables can be used to share small amounts of data to all of the machines on your cluster. We'll use them to share a lookup table of movie ID's to movie names, and use that to get movie names in our final results.

We introduce the Marvel superhero social network data set, and write a script to find the most-connected superhero in it. It's not who you might think!

[Exercise] Find the Most Obscure Superheroes
Exercise Solution: Find the Most Obscure Superheroes

As a more complex example, we'll apply a breadth-first-search (BFS) algorithm to the Marvel dataset to compute the degrees of separation between any two superheroes. In this lecture, we go over how BFS works.

We'll go over our strategy for implementing BFS within a Spark script that can be distributed, and introduce the use of Accumulators to maintain running totals that are synced across a cluster.

Finally, we'll review the code for finding the degrees of separation using breadth-first-search, run it, and see the results!

Back to our movie ratings data - we'll discover movies that are similar to each other just based on user ratings. We'll cover the algorithm, and how to implement it as a Spark script.

We'll run our movie similarties script and see the results.

Your challenge: make the movie similarity results even better! Here are some ideas for you to try out.

Learn how to run Spark on a cluster, using Amazon's Elastic MapReduce and Hadoop YARN

In a production environment, you'll use spark-submit to start your driver scripts from a command line, cron job, or the like. We'll cover the details on what you need to do differently in this case.

Spark / Scala scripts that have external dependencies can be bundled up into self-contained packages using the SBT tool. We'll use SBT to package up our movie similarities script as an exercise.

[Exercise] Package a Script with SBT and Run it Locally with spark-submit
Exercise solution: Using SBT and spark-submit

Amazon Web Services (AWS) offers the Elastic MapReduce service (EMR,) which gives us a way to rent time on a Hadoop cluster of our choosing - with Spark pre-installed on it. We'll use EMR to illustrate running a Spark script on a real cluster, so let's go over what EMR is and how it works first.

Let's compute movie similarities on a real cluster in the cloud, using one million user ratings!

Explicitly partitioning your Datasets and RDD's can be an important optimization; we'll go over when and how to do this.

Other tips and tricks for taking your script to a real cluster and getting it to run as you expect.

How to troubleshoot Spark jobs on a cluster using the Spark UI and logs, and more on managing dependencies of your script and data.

Quiz: Spark on a Cluster
Use Spark's MLLib library to perform machine learning algorithms across a cluster!

MLLib offers several distributed machine learning algorithms that you can run on a Spark cluster. We'll cover what MLLib can do and how it fits in.

We'll use MLLib's Alternating Least Squares recommender algorithm to produce movie recommendations using our MovieLens ratings data. The results are... unexpected!

A brief overview of what linear regression is and how it works, followed by a hands-on example of finding a regression and applying it to fabricated page speed vs. revenue data.

We'll run our Spark ML example of linear regression, using DataFrames.

[Exercise] Predict Real Estate Values with Decision Trees in Spark
Exercise Solution: Predicting Real Estate with Decision Trees in Spark
Quiz: Spark ML
Use Spark Streaming to develop continuous applications that process ongoing streams of data in real time!

Spark Streaming allows you create Spark driver scripts that run indefinitely, continually processing data as it streams in! We'll cover how it works and what it can do, using the original DStream micro-batch API.

[Activity] Real-time Monitoring of the Most Popular Hashtags on Twitter

Structured Streaming is a newer DataFrame-based API in Spark for writing continuous applications.

[Activity] Using Structured Streaming for real-time log analysis
[Exercise] Windowed Operations with Structured Streaming
Exercise Solution: Top URL's in a 30-second Window
Quiz: Spark Streaming
Analyze and traverse graphs with Spark and GraphX

We cover Spark's GraphX library and how it works.

We'll revisit our "superhero degrees of separation" example, and see how its breadth-first-search algorithm could be implemented using Pregel and GraphX.

We'll use GraphX and Pregel to recreate our earlier results analyzing the superhero social network data - but with a lot less code!

Continue learning through books, websites, and additional courses.

You made it to the end! Here are some book recommendations if you want to learn more, as well as some career advice on landing a job in "big data".

Bonus Lecture: More courses to explore!

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Provides a comprehensive study of Scala, a programming language used in data science and big data
Explores SparkSQL, DataSets, and DataFrames, which are essential for working with structured data in Big Data
Covers advanced Spark features like partitioning, caching, and tuning for optimal performance
Introduces Spark Streaming for processing data in real time
Teaches GraphX for analyzing and visualizing graph data
Emphasizes hands-on practice with over 20 real-world examples

Save this course

Save Apache Spark with Scala - Hands On with Big Data! to your list so you can find it easily later:
Save

Reviews summary

Positive experience with spark and scala

According to learners, this five-star rated course offers a great introduction to programming in Spark and Scala through hands-on activities. The instructor is described as knowledgeable and engaging.
The course offers a solid grounding in Apache Spark.
"you can get a nice initial contact to Apache Spark"
Exercises help to reinforce learning.
"There are several exercices to be done, so you can learn by doing."
The instructor is highly praised.
"Frank Kane is such great instructor!!"

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Apache Spark with Scala - Hands On with Big Data! with these activities:
Find a mentor who can help you with Spark
Finding a mentor who can help you with Spark will provide you with guidance and support as you learn.
Show steps
  • Identify someone in your network who has experience with Spark.
  • Ask them if they would be willing to mentor you.
Review Scala basics
Refresh your understanding of Scala's syntax and constructs to prepare for the course's emphasis on Scala for Spark development.
Browse courses on Scala
Show steps
  • Review the basic syntax of Scala, including data types, variables, and expressions.
  • Practice writing simple Scala functions and methods.
  • Familiarize yourself with Scala's object-oriented features, such as classes and inheritance.
Review the syntax for Scala
Reviewing the syntax for Scala will refresh your memory and ensure a smoother learning experience.
Browse courses on Scala
Show steps
  • Go over the basic syntax of Scala.
  • Practice writing simple Scala programs.
11 other activities
Expand to see all activities and additional details
Show all 14 activities
Practice RDD operations
Practice the basic RDD operations for filtering, mapping, and reducing data to gain proficiency in working with Spark's core data structure.
Show steps
  • Create an RDD from a list of numbers and filter out even numbers.
  • Map the remaining numbers to their squares.
  • Reduce the squared numbers to find their sum.
Attend a Spark meetup or conference
Attending a Spark meetup or conference will allow you to connect with other Spark users and learn about the latest developments in Spark.
Show steps
  • Find a Spark meetup or conference in your area.
  • Register for the event.
  • Attend the event and participate in the discussions.
Create a glossary of Spark terms
Creating a glossary of Spark terms will help you understand and remember the key concepts of Spark.
Show steps
  • Identify the key terms in Spark.
  • Write a definition for each term.
  • Organize the terms into a glossary.
Build a DataFrame from scratch
Gain hands-on experience in constructing a DataFrame from raw data, allowing you to manipulate and analyze structured data efficiently.
Browse courses on DataFrames
Show steps
  • Create a DataFrame from a CSV file.
  • Filter the DataFrame based on specific criteria.
  • Group the DataFrame by a column and calculate aggregate values.
Participate in a Spark coding challenge
Challenge yourself and test your Spark skills by participating in coding competitions, pushing your limits and gaining valuable experience.
Browse courses on Challenges
Show steps
  • Find a suitable Spark coding challenge.
  • Analyze the problem statement and design an efficient solution.
  • Implement your solution and submit it for evaluation.
  • Review the results and learn from your experience.
Solve Spark coding challenges
Solving Spark coding challenges will help you improve your skills in applying Spark to real-world problems.
Show steps
  • Find a website or platform that offers Spark coding challenges.
  • Choose a challenge and attempt to solve it.
  • Review your solution and identify areas for improvement.
Write a blog post on Spark optimization techniques
Deepen your understanding of Spark's performance by researching and writing about optimization techniques, reinforcing your knowledge and helping others learn.
Show steps
  • Research different Spark optimization techniques.
  • Choose a specific technique and write a detailed blog post explaining its benefits and implementation.
  • Share your blog post with the community and gather feedback.
Mentor a junior developer on Spark
Deepen your understanding of Spark by mentoring a junior developer, reinforcing your knowledge while helping others grow their skills.
Show steps
  • Find a junior developer who is interested in learning Spark.
  • Establish regular mentoring sessions.
  • Share your knowledge of Spark concepts and best practices.
  • Provide guidance on projects and assignments.
Participate in a Spark competition or hackathon
Participating in a Spark competition or hackathon will challenge you to apply your skills to solve real-world problems and learn from others.
Show steps
  • Find a Spark competition or hackathon that interests you.
  • Form a team or work individually.
  • Develop a solution to the problem.
  • Submit your solution and present it to the judges.
Develop a Spark application for real-time data analysis
Apply your Spark skills to a practical project, building a real-time data analysis application that demonstrates your proficiency in handling streaming data.
Browse courses on Spark Streaming
Show steps
  • Design the architecture of your application.
  • Implement data ingestion and processing pipelines.
  • Visualize and analyze the results in real-time.
  • Deploy and monitor your application.
Attend a Spark workshop on advanced topics
Expand your knowledge of Spark by attending workshops focused on advanced topics, such as machine learning or graph processing.
Show steps
  • Identify a workshop that aligns with your interests.
  • Register and attend the workshop.
  • Actively participate in discussions and hands-on exercises.
  • Network with other attendees and experts.

Career center

Learners who complete Apache Spark with Scala - Hands On with Big Data! will develop knowledge and skills that may be useful to these careers:
Data Scientist
A Data Scientist uses data to build predictive models and solve business problems. This course can help build a foundation in Spark, a popular platform for Data Scientists. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets, as well as how to use Spark ML to perform machine learning algorithms.
Data Analyst
A Data Analyst uses data to create reports, solve problems, and improve decision-making. This course can help build a foundation in Spark, a big data platform widely used by Data Analysts. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.
Data Engineer
A Data Engineer builds and maintains big data infrastructure. This course covers how to build and deploy Spark scripts on Hadoop clusters, a common platform used by Data Engineers. You will also learn how to use Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.
Big Data Engineer
A Big Data Engineer designs and implements big data solutions. This course covers how to use Scala, a popular programming language for big data, to develop Spark applications. You will also learn how to use Spark SQL, DataFrames, and Datasets to transform and analyze large datasets, as well as how to deploy Spark scripts on Hadoop clusters.
Data Architect
A Data Architect designs and manages data systems. This course can help build a foundation in Spark, a big data platform commonly used by Data Architects. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets, as well as how to deploy Spark scripts on Hadoop clusters.
Software Developer
A Software Developer builds and maintains software applications. This course covers how to use Scala, a popular programming language for big data, to develop Spark applications. You will also learn how to use Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.
Machine Learning Engineer
A Machine Learning Engineer builds and deploys machine learning models. This course covers how to use Spark ML, a machine learning library for Spark, to build and deploy machine learning models. You will also learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.
Business Intelligence Analyst
A Business Intelligence Analyst uses data to improve business decision-making. This course can help build a foundation in Spark, a big data platform widely used by Business Intelligence Analysts. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.
Cloud Architect
A Cloud Architect designs and manages cloud computing systems. This course covers how to deploy Spark scripts on Hadoop clusters, a common platform used by Cloud Architects. You will also learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.
Data Warehouse Architect
A Data Warehouse Architect designs and manages data warehouses. This course can help build a foundation in Spark, a big data platform commonly used by Data Warehouse Architects. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets, as well as how to deploy Spark scripts on Hadoop clusters.
Database Administrator
A Database Administrator manages and maintains databases. This course covers how to use Spark to transform and analyze data in databases, a common task for Database Administrators. You will also learn how to use Scala, Spark SQL, DataFrames, and Datasets to work with data in databases.
Quantitative Analyst
A Quantitative Analyst uses data to make investment decisions. This course can help build a foundation in Spark, a big data platform commonly used by Quantitative Analysts. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.
Market Researcher
A Market Researcher uses data to understand consumer behavior. This course can help build a foundation in Spark, a big data platform commonly used by Market Researchers. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.
Statistician
A Statistician uses data to solve problems and make decisions. This course can help build a foundation in Spark, a big data platform commonly used by Statisticians. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.
Actuary
An Actuary uses data to assess risk and uncertainty. This course can help build a foundation in Spark, a big data platform commonly used by Actuaries. You will learn how to use Scala, Spark SQL, DataFrames, and Datasets to transform and analyze large datasets.

Reading list

We've selected 12 books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Apache Spark with Scala - Hands On with Big Data!.
Provides a thorough introduction to Apache Spark, covering its core concepts, APIs, and use cases. It is an excellent resource for both beginners and experienced developers who want to learn more about Spark.
Comprehensive guide to Apache Spark, written by the creators of the framework. It covers everything from the basics of Spark to advanced topics such as machine learning and graph processing.
Provides a deep dive into the internals of Apache Spark, with a focus on performance optimization. It valuable resource for developers who want to get the most out of Spark.
Provides a comprehensive introduction to machine learning with Apache Spark. It covers a wide range of topics, from the basics of machine learning to advanced topics such as deep learning.
Provides a comprehensive introduction to Apache GraphX, a graph processing framework built on top of Apache Spark. It covers a wide range of topics, from the basics of graph processing to advanced topics such as machine learning on graphs.
Provides a comprehensive introduction to structured streaming with Apache Spark. It covers a wide range of topics, from the basics of structured streaming to advanced topics such as fault tolerance and performance optimization.
Provides a comprehensive introduction to advanced analytics with Apache Spark. It covers a wide range of topics, from the basics of advanced analytics to advanced topics such as machine learning and graph processing.
Provides a comprehensive introduction to Apache Spark, with a focus on practical use cases. It covers a wide range of topics, from the basics of Spark to advanced topics such as machine learning and graph processing.
Provides a comprehensive introduction to real-time data analytics with Apache Spark and Apache Kafka. It covers a wide range of topics, from the basics of real-time data analytics to advanced topics such as stream processing and machine learning.
Provides a comprehensive introduction to the Scala programming language, with a focus on using Scala for big data applications. It covers a wide range of topics, from the basics of Scala to advanced topics such as functional programming and concurrency.
Provides a comprehensive introduction to the Scala programming language. It covers a wide range of topics, from the basics of Scala to advanced topics such as functional programming and concurrency.
Provides a collection of recipes for solving common problems in Scala. It covers a wide range of topics, from basic data structures to advanced topics such as functional programming and concurrency.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Apache Spark with Scala - Hands On with Big Data!.
Getting Started with Stream Processing with Spark...
Most relevant
Developing Spark Applications Using Scala & Cloudera
Most relevant
Structured Streaming in Apache Spark 2
Most relevant
Scala and Spark for Big Data and Machine Learning
Most relevant
Big Data Analysis with Scala and Spark (Scala 2 version)
Most relevant
Handling Fast Data with Apache Spark SQL and Streaming
Most relevant
Processing Streaming Data Using Apache Spark Structured...
Most relevant
Big Data Analysis with Scala and Spark
Most relevant
Spark 3.0 & Big Data Essentials with Scala | Rock the JVM
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser