We may earn an affiliate commission when you visit our partners.
Course image
Prof. Heather Miller

Manipulating big data distributed over a cluster using functional concepts is rampant in industry, and is arguably one of the first widespread industrial uses of functional ideas. This is evidenced by the popularity of MapReduce and Hadoop, and most recently Apache Spark, a fast, in-memory distributed collections framework written in Scala. In this course, we'll see how the data parallel paradigm can be extended to the distributed case, using Spark throughout. We'll cover Spark's programming model in detail, being careful to understand how and when it differs from familiar programming models, like shared-memory parallel collections or sequential Scala collections. Through hands-on examples in Spark and Scala, we'll learn when important issues related to distribution like latency and network communication should be considered and how they can be addressed effectively for improved performance.

Read more

Manipulating big data distributed over a cluster using functional concepts is rampant in industry, and is arguably one of the first widespread industrial uses of functional ideas. This is evidenced by the popularity of MapReduce and Hadoop, and most recently Apache Spark, a fast, in-memory distributed collections framework written in Scala. In this course, we'll see how the data parallel paradigm can be extended to the distributed case, using Spark throughout. We'll cover Spark's programming model in detail, being careful to understand how and when it differs from familiar programming models, like shared-memory parallel collections or sequential Scala collections. Through hands-on examples in Spark and Scala, we'll learn when important issues related to distribution like latency and network communication should be considered and how they can be addressed effectively for improved performance.

Learning Outcomes. By the end of this course you will be able to:

- read data from persistent storage and load it into Apache Spark,

- manipulate data with Spark and Scala,

- express algorithms for data analysis in a functional style,

- recognize how to avoid shuffles and recomputation in Spark,

Recommended background: You should have at least one year programming experience. Proficiency with Java or C# is ideal, but experience with other languages such as C/C++, Python, Javascript or Ruby is also sufficient. You should have some familiarity using the command line. This course is intended to be taken after Parallel Programming: https://www.coursera.org/learn/parprog1.

Enroll now

What's inside

Syllabus

Getting Started + Spark Basics
Get up and running with Scala on your computer. Complete an example assignment to familiarize yourself with our unique way of submitting assignments. In this week, we'll bridge the gap between data parallelism in the shared memory scenario (learned in the Parallel Programming course, prerequisite) and the distributed scenario. We'll look at important concerns that arise in distributed systems, like latency and failure. We'll go on to cover the basics of Spark, a functionally-oriented framework for big data processing in Scala. We'll end the first week by exercising what we learned about Spark by immediately getting our hands dirty analyzing a real-world data set.
Read more
Reduction Operations & Distributed Key-Value Pairs
This week, we'll look at a special kind of RDD called pair RDDs. With this specialized kind of RDD in hand, we'll cover essential operations on large data sets, such as reductions and joins.
Partitioning and Shuffling
This week we'll look at some of the performance implications of using operations like joins. Is it possible to get the same result without having to pay for the overhead of moving data over the network? We'll answer this question by delving into how we can partition our data to achieve better data locality, in turn optimizing some of our Spark jobs.
Structured data: SQL, Dataframes, and Datasets
With our newfound understanding of the cost of data movement in a Spark job, and some experience optimizing jobs for data locality last week, this week we'll focus on how we can more easily achieve similar optimizations. Can structured data help us? We'll look at Spark SQL and its powerful optimizer which uses structure to apply impressive optimizations. We'll move on to cover DataFrames and Datasets, which give us a way to mix RDDs with the powerful automatic optimizations behind Spark SQL.

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Taught by Prof. Heather Miller, who are recognized for their work in parallel programming in Scala
Examines big data manipulation using functional concepts, which is standard in industry
Develops familiarity with Apache Spark, a fast, in-memory distributed collections framework written in Scala
Assumes proficiency with Java or C# or some familiarity with C/C++, Python, Javascript or Ruby
Requires knowledge of using the command line
Completion of Parallel Programming course is recommended as a prerequisite

Save this course

Save Big Data Analysis with Scala and Spark to your list so you can find it easily later:
Save

Reviews summary

Spark and scala for big data

Students' reviews for Big Data Analysis with Scala and Spark are mixed. Some learners appreciate the course, calling it great, but others say it's lower quality compared to earlier courses in the specialization
One student thinks this course is great.
"Great course."
"Thanks for everything."
One student thinks this course is of a lower quality than earlier courses in the specialization.
"4th Course of the Specialization " Functional Programming In Scala": clearly down in quality compared to 2 first ones."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Big Data Analysis with Scala and Spark with these activities:
Read 'Learning Spark' by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia
Read 'Learning Spark' to gain a comprehensive understanding of the Spark framework, its architecture, and best practices for data processing and analysis.
Show steps
  • Read 'Learning Spark' and take notes on key concepts and techniques.
  • Complete the exercises and examples provided in the book to practice applying Spark in real-world scenarios.
Participate in a study group to discuss Spark concepts
Join a study group or discussion forum to connect with other learners and discuss Spark concepts, share knowledge, and enhance your understanding.
Show steps
  • Join a study group or online discussion forum focused on Spark.
  • Actively participate in discussions, asking questions, sharing insights, and collaborating with others.
Build a small Spark application for data analysis
Develop a mini project that utilizes Spark to analyze a dataset and gain practical experience in applying Spark concepts and techniques.
Show steps
  • Choose a dataset and define a specific data analysis task.
  • Design and implement a Spark application to perform the data analysis task.
  • Evaluate the results and identify areas for improvement.
Show all three activities

Career center

Learners who complete Big Data Analysis with Scala and Spark will develop knowledge and skills that may be useful to these careers:
Data Analyst
Data Analysts use statistical and mathematical modeling and other data analysis techniques to extract meaningful information from data. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Data Analyst by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Data Scientist
Data Scientists work with data to create knowledge. This can involve collecting, cleaning, and analyzing data, as well as developing algorithms and models to make predictions or recommendations. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Data Scientist by teaching you the basics of data analysis and how to use Apache Spark, a framework for big data processing in Scala.
Machine Learning Engineer
Machine Learning Engineers design and develop machine learning models, which are used to make predictions or recommendations based on data. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Machine Learning Engineer by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Software Engineer
Software Engineers design, develop, test, and maintain software systems. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Software Engineer by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Statistician
Statisticians collect, analyze, interpret, and present data. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Statistician by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Data Engineer
Data Engineers design, develop, test, and maintain data systems. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Data Engineer by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Database Administrator
Database Administrators manage and maintain databases. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Database Administrator by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Business Analyst
Business Analysts use data to help businesses make informed decisions. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Business Analyst by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Operations Research Analyst
Operations Research Analysts use mathematical and statistical models to solve business problems. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming an Operations Research Analyst by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Financial Analyst
Financial Analysts use data to evaluate investments and make recommendations to clients. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Financial Analyst by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Market Researcher
Market Researchers use data to understand consumer behavior and trends. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Market Researcher by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
User Experience Researcher
User Experience Researchers use data to understand how users interact with products and services. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a User Experience Researcher by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Quantitative Analyst
Quantitative Analysts use data to model and analyze financial markets. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Quantitative Analyst by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Actuary
Actuaries use data to assess and manage risk. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming an Actuary by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.
Data Visualization Specialist
Data Visualization Specialists use data to create visual representations of data. This course in Big Data Analysis with Scala and Spark can help you build a foundation for becoming a Data Visualization Specialist by teaching you the basics of Apache Spark, a framework for big data processing in Scala, as well as how to express algorithms for data analysis in a functional style.

Reading list

We've selected nine books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Big Data Analysis with Scala and Spark.
Provides a detailed overview of the internals of Apache Spark. It covers the topics such as how to install and set up a Hadoop cluster. It also shows how to read and write data to and from HDFS, and how to debug and monitor your programs. It also teaches how to write efficient Spark programs by using profiling techniques and tuning the runtime.
Provides a comprehensive overview of Apache Spark, a powerful framework for big data processing. It covers the basics of Spark, including its programming model, APIs, and cluster management. The book also provides a number of case studies that demonstrate how Spark can be used to solve real-world problems.
Provides a hands-on introduction to Apache Spark, with a focus on practical applications. It covers a wide range of topics, including data loading, transformations, machine learning, and streaming. The book also provides a number of code examples and exercises.
Provides a comprehensive overview of functional programming in Scala, with a focus on practical applications. It covers a wide range of topics, including data structures, monads, and concurrency.
Provides a collection of recipes that demonstrate how to use Apache Spark 3.0 to solve real-world problems. The recipes cover a wide range of topics, including data loading, transformations, machine learning, and streaming.
Provides a collection of recipes that demonstrate how to use Scala to solve real-world problems. The recipes cover a wide range of topics, including data structures, functional programming, and concurrency.
Provides a comprehensive overview of text processing using MapReduce. It covers a wide range of topics, including text tokenization, stemming, lemmatization, and machine learning. The book also provides a number of code examples and exercises.
Scala functional programming language that is well-suited for big data processing. provides a comprehensive overview of functional programming in Scala, with a focus on practical applications.
Provides a comprehensive overview of Apache Hadoop, the popular framework for big data processing. It covers a wide range of topics, including HDFS, MapReduce, and YARN.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Big Data Analysis with Scala and Spark.
Big Data Analysis with Scala and Spark (Scala 2 version)
Most relevant
Parallel programming (Scala 2 version)
Most relevant
Parallel programming
Most relevant
Thinking Functionally in Scala 2
Most relevant
Functional Program Design in Scala (Scala 2 version)
Most relevant
Distributed Programming in Java
Most relevant
Functional Program Design in Scala
Most relevant
Scala 2 Methods and Functions
Most relevant
Functional Programming Principles in Scala (Scala 2...
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser