We may earn an affiliate commission when you visit our partners.
Course image
Prof. Heather Miller

Manipulating big data distributed over a cluster using functional concepts is rampant in industry, and is arguably one of the first widespread industrial uses of functional ideas. This is evidenced by the popularity of MapReduce and Hadoop, and most recently Apache Spark, a fast, in-memory distributed collections framework written in Scala. In this course, we'll see how the data parallel paradigm can be extended to the distributed case, using Spark throughout. We'll cover Spark's programming model in detail, being careful to understand how and when it differs from familiar programming models, like shared-memory parallel collections or sequential Scala collections. Through hands-on examples in Spark and Scala, we'll learn when important issues related to distribution like latency and network communication should be considered and how they can be addressed effectively for improved performance.

Read more

Manipulating big data distributed over a cluster using functional concepts is rampant in industry, and is arguably one of the first widespread industrial uses of functional ideas. This is evidenced by the popularity of MapReduce and Hadoop, and most recently Apache Spark, a fast, in-memory distributed collections framework written in Scala. In this course, we'll see how the data parallel paradigm can be extended to the distributed case, using Spark throughout. We'll cover Spark's programming model in detail, being careful to understand how and when it differs from familiar programming models, like shared-memory parallel collections or sequential Scala collections. Through hands-on examples in Spark and Scala, we'll learn when important issues related to distribution like latency and network communication should be considered and how they can be addressed effectively for improved performance.

Learning Outcomes. By the end of this course you will be able to:

- read data from persistent storage and load it into Apache Spark,

- manipulate data with Spark and Scala,

- express algorithms for data analysis in a functional style,

- recognize how to avoid shuffles and recomputation in Spark,

Recommended background: You should have at least one year programming experience. Proficiency with Java or C# is ideal, but experience with other languages such as C/C++, Python, Javascript or Ruby is also sufficient. You should have some familiarity using the command line. This course is intended to be taken after Parallel Programming: https://www.coursera.org/learn/parprog1.

Note that this version of the course uses Scala 2.13. You can find a more recent version of the course that uses Scala 3 here: https://www.coursera.org/learn/scala-spark-big-data

Enroll now

What's inside

Syllabus

Getting Started + Spark Basics
Get up and running with Scala on your computer. Complete an example assignment to familiarize yourself with our unique way of submitting assignments. In this week, we'll bridge the gap between data parallelism in the shared memory scenario (learned in the Parallel Programming course, prerequisite) and the distributed scenario. We'll look at important concerns that arise in distributed systems, like latency and failure. We'll go on to cover the basics of Spark, a functionally-oriented framework for big data processing in Scala. We'll end the first week by exercising what we learned about Spark by immediately getting our hands dirty analyzing a real-world data set.
Read more
Reduction Operations & Distributed Key-Value Pairs
This week, we'll look at a special kind of RDD called pair RDDs. With this specialized kind of RDD in hand, we'll cover essential operations on large data sets, such as reductions and joins.
Partitioning and Shuffling
This week we'll look at some of the performance implications of using operations like joins. Is it possible to get the same result without having to pay for the overhead of moving data over the network? We'll answer this question by delving into how we can partition our data to achieve better data locality, in turn optimizing some of our Spark jobs.
Structured data: SQL, Dataframes, and Datasets
With our newfound understanding of the cost of data movement in a Spark job, and some experience optimizing jobs for data locality last week, this week we'll focus on how we can more easily achieve similar optimizations. Can structured data help us? We'll look at Spark SQL and its powerful optimizer which uses structure to apply impressive optimizations. We'll move on to cover DataFrames and Datasets, which give us a way to mix RDDs with the powerful automatic optimizations behind Spark SQL.

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Develops data analysis skills in Scala, a highly sought-after programming language
Covers Spark, the industry standard for big data processing and analytics
Taught by Professor Heather Miller, a respected researcher and educator in the field
Emphasis on hands-on practice through real-world data analysis assignments
Requires proficiency in Java or C#, which may be a barrier to some learners

Save this course

Save Big Data Analysis with Scala and Spark (Scala 2 version) to your list so you can find it easily later:
Save

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Big Data Analysis with Scala and Spark (Scala 2 version) with these activities:
Organize Course Materials
Stay organized by gathering and arranging all relevant course materials in one place.
Show steps
  • Create folders for lecture notes, assignments, and other resources.
  • Label and file materials systematically.
Review: Scala Basics
Refresh your knowledge of Scala syntax and constructs before starting the course.
Browse courses on Scala
Show steps
  • Review online resources or tutorials on Scala.
Read: Learning Spark: Lightning-Fast Big Data Analytics
Gain a deeper understanding of Spark concepts and best practices from a comprehensive book.
Show steps
  • Purchase or borrow the book.
  • Read and make notes on the relevant chapters.
Six other activities
Expand to see all activities and additional details
Show all nine activities
Apache Spark Resources Compilation
Enhance your knowledge of Apache Spark by collecting and organizing relevant tools and resources.
Show steps
  • Search for articles, tutorials, and documentation on Apache Spark.
  • Bookmark or save useful resources.
  • Create a document or online repository for your compilation.
Attend Apache Spark Meetup or Conference
Connect with professionals and learn about the latest advancements in Apache Spark by attending an industry event.
Show steps
  • Search for upcoming Apache Spark events in your area.
  • Register and attend the event.
  • Network with other attendees and speakers.
Spark Tutorial: Spark SQL, DataFrames, and Datasets
Deepen your understanding of Apache Spark's SQL and DataFrames capabilities by following an in-depth tutorial.
Show steps
  • Visit the tutorial website and follow the instructions.
Spark Exercises: Transformations and Actions
Reinforce your understanding of Spark transformations and actions by solving a series of challenging exercises.
Show steps
  • Download the exercise dataset.
  • Use Spark to load the dataset into an RDD.
  • Apply various transformations and actions to the RDD.
  • Verify the results of your operations.
Study Group: Spark RDDs and Transformations
Collaborate with peers to strengthen your understanding of Spark RDDs and transformations through group discussions.
Show steps
  • Form a study group with other course participants.
  • Choose a topic related to Spark RDDs and transformations.
  • Meet regularly to discuss the topic, solve problems, and share insights.
Spark Project: Data Analysis on a Real-World Dataset
Apply your Spark skills to a practical project, analyzing a real-world dataset to uncover insights.
Show steps
  • Identify a suitable dataset.
  • Load the dataset into Spark.
  • Explore and analyze the data using Spark operations.
  • Visualize and present your findings.

Career center

Learners who complete Big Data Analysis with Scala and Spark (Scala 2 version) will develop knowledge and skills that may be useful to these careers:
Data Scientist
Data Scientists develop new algorithms and solve complex problems by using large data sets. As a part of their work, Data Scientists collaborate with colleagues in the fields of engineering, statistics, and business to understand a company´s needs. This course would be useful, as it helps build a foundation in big data analysis, which is a core skill for Data Scientists. Skills learned in this course will assist in evaluating and implementing data sets more efficiently and effectively.
Data Analyst
Data Analysts collect, analyze, and interpret data to help organizations make informed decisions. Similar to Data Scientists, they collaborate with technical teams to translate their findings to actionable items. This course would be useful, as it teaches techniques and methods for data analysis that are in high demand.
Database Administrator
Database Administrators ensure that data storage systems operate smoothly. They work with data analysts, application developers, and end-users to provide access to data when needed. This course may be useful, as it provides a solid foundation in how data is stored and accessed.
Software Engineer
Software Engineers apply engineering principles to the design, development, deployment, and maintenance of software systems. This course may be useful, as it provides a solid foundation in data analysis and manipulation.
Data Architect
Data Architects design and build data architectures. They work with data analysts and other stakeholders to understand the business needs and then design and implement solutions that meet those needs. This course may be useful, as it provides a solid foundation in big data analysis and manipulation.
Business Analyst
Business Analysts identify and solve business problems by using data analysis and other techniques. They work with stakeholders to understand the business needs and then develop and implement solutions that meet those needs. This course may be useful, as it provides a solid foundation in data analysis and manipulation.
Machine Learning Engineer
Machine Learning Engineers develop and implement machine learning algorithms. They work with data scientists and other stakeholders to understand the business needs and then design and implement solutions that meet those needs. This course may be useful, as it provides a solid foundation in big data analysis and manipulation.
Data Engineer
Data Engineers build and maintain data pipelines. They work with data analysts and other stakeholders to understand the business needs and then design and implement solutions that meet those needs. This course may be useful, as it provides a solid foundation in big data analysis and manipulation.
Cloud Architect
Cloud Architects design, build, and manage cloud computing systems. They work with clients to understand their business needs and then design and implement solutions that meet those needs. This course may be useful, as it provides a solid foundation in big data analysis and manipulation.
Quantitative Analyst
Quantitative Analysts use mathematical and statistical models to analyze financial data. They work with portfolio managers and other stakeholders to develop and implement investment strategies. This course may be useful, as it provides a solid foundation in data analysis and manipulation.
Epidemiologist
Epidemiologists study the distribution and determinants of health-related states or events in specified populations. They work with public health officials and other stakeholders to develop and implement programs to prevent and control disease. This course may be useful, as it provides a solid foundation in data analysis and manipulation.
Biostatistician
Biostatisticians apply statistical methods to solve problems in the field of biology. They work with biologists and other scientists to design and conduct studies, and to analyze and interpret data. This course may be useful, as it provides a solid foundation in data analysis and manipulation.
Research Scientist
Research Scientists conduct research in a variety of fields, including science, engineering, and medicine. They develop and test new theories and technologies, and publish their findings in academic journals. This course may be useful, as it provides a solid foundation in data analysis and manipulation.
Statistician
Statisticians collect, analyze, and interpret data to help organizations make informed decisions. They work with stakeholders to understand the business needs and then develop and implement solutions that meet those needs. This course may be useful, as it provides a solid foundation in data analysis and manipulation.
Data Warehouse Architect
Data Warehouse Architects design and build data warehouses. They work with data analysts and other stakeholders to understand the business needs and then design and implement solutions that meet those needs. This course may be useful, as it provides a solid foundation in big data analysis and manipulation.

Reading list

We've selected seven books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Big Data Analysis with Scala and Spark (Scala 2 version).
Comprehensive reference guide to Spark, covering topics such as architecture, programming models, and performance tuning.
Practical guide to using Spark for big data analysis, covering topics such as data loading, transformations, and machine learning.
Provides a comprehensive overview of data science and big data analytics, covering topics such as data mining, machine learning, and data visualization.
Provides a comprehensive overview of algorithms, covering topics such as data structures, algorithms, and complexity theory.
Provides a comprehensive overview of algorithms and theory of computation, covering topics such as data structures, algorithms, and complexity theory.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Big Data Analysis with Scala and Spark (Scala 2 version).
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser