We may earn an affiliate commission when you visit our partners.
Course image
Romeo Kienzler

This course will empower you with the skills to scale data science and machine learning (ML) tasks on Big Data sets using Apache Spark. Most real world machine learning work involves very large data sets that go beyond the CPU, memory and storage limitations of a single computer.

Read more

This course will empower you with the skills to scale data science and machine learning (ML) tasks on Big Data sets using Apache Spark. Most real world machine learning work involves very large data sets that go beyond the CPU, memory and storage limitations of a single computer.

Apache Spark is an open source framework that leverages cluster computing and distributed storage to process extremely large data sets in an efficient and cost effective manner. Therefore an applied knowledge of working with Apache Spark is a great asset and potential differentiator for a Machine Learning engineer.

After completing this course, you will be able to:

- gain a practical understanding of Apache Spark, and apply it to solve machine learning problems involving both small and big data

- understand how parallel code is written, capable of running on thousands of CPUs.

- make use of large scale compute clusters to apply machine learning algorithms on Petabytes of data using Apache SparkML Pipelines.

- eliminate out-of-memory errors generated by traditional machine learning frameworks when data doesn’t fit in a computer's main memory

- test thousands of different ML models in parallel to find the best performing one – a technique used by many successful Kagglers

- (Optional) run SQL statements on very large data sets using Apache SparkSQL and the Apache Spark DataFrame API.

Enrol now to learn the machine learning techniques for working with Big Data that have been successfully applied by companies like Alibaba, Apple, Amazon, Baidu, eBay, IBM, NASA, Samsung, SAP, TripAdvisor, Yahoo!, Zalando and many others.

NOTE: You will practice running machine learning tasks hands-on on an Apache Spark cluster provided by IBM at no charge during the course which you can continue to use afterwards.

Prerequisites:

- basic python programming

- basic machine learning (optional introduction videos are provided in this course as well)

- basic SQL skills for optional content

The following courses are recommended before taking this class (unless you already have the skills)

https://www.coursera.org/learn/python-for-applied-data-science or similar

https://www.coursera.org/learn/machine-learning-with-python or similar

https://www.coursera.org/learn/sql-data-science for optional lectures

Enroll now

What's inside

Syllabus

Week 1: Introduction
This is an introduction to Apache Spark. You'll learn how Apache Spark internally works and how to use it for data processing. RDD, the low level API is introduced in conjunction with parallel programming / functional programming. Then, different types of data storage solutions are contrasted. Finally, Apache Spark SQL and the optimizer Tungsten and Catalyst are explained.
Read more
Week 2: Scaling Math for Statistics on Apache Spark
Applying basic statistical calculations using the Apache Spark RDD API in order to experience how parallelization in Apache Spark works
Week 3: Introduction to Apache SparkML
Understand the concept of machine learning pipelines in order to understand how Apache SparkML works programmatically
Week 4: Supervised and Unsupervised learning with SparkML
Apply Supervised and Unsupervised Machine Learning tasks using SparkML

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
This course offers hands-on labs and interactive materials, providing practical experience with Apache Spark
Taught by instructors who are recognized for their work in the field of machine learning, ensuring high-quality content
Develops skills in working with Big Data using Apache Spark, a framework highly relevant in industry
Provides a comprehensive study of Apache SparkML Pipelines, a powerful tool for applying machine learning algorithms to large datasets
Requires basic Python programming and machine learning skills to participate effectively
Explores topics not covered in the course's name, such as SQL-based data processing using Apache SparkSQL and DataFrame API

Save this course

Save Scalable Machine Learning on Big Data using Apache Spark to your list so you can find it easily later:
Save

Reviews summary

Sparkly scalable machine learning

Learners say Scalable Machine Learning on Big Data using Apache Spark is largely positive due to engaging assignments and clear explanations. It covers the basics of Apache Spark, including its use in parallel computing, machine learning, and data analysis. The course is project-based with hands-on assignments that help learners apply the concepts they learn. However, there are a few areas that could be improved, such as the quality of some videos and the need for more in-depth explanations and challenging exercises.
The instructor provides clear and detailed explanations of the concepts, making them easy to understand.
"Romeo Kienzler Sir is so helpful for students. there all lectures are great."
"The videos were not so awesome, but the curse was superb overall. It really addressed the intricacies of large data."
"The videos were really great to understand what is Apache Spark and How to use it."
The course offers numerous hands-on programming assignments that allow learners to apply their knowledge and gain practical experience.
"Great notebooks."
"Programming assignments helped a lot in grasping the concepts"
"It was a very interesting and skillful course."
The course lacks organization and coherence, with topics covered in a disjointed manner.
"The course should give more in-depth assignments and also more explanation."
"The course seems to be copy and pasted together from other courses and the versions/classes/functions of the software have changed compared to videos and assignments."
"The course is definitely one of the worst i had in coursera."
The instructor's accent can be difficult for some learners to understand, affecting the clarity of the explanations.
"Instructor pronunciation is not the best for someone who are not usually listening explain so fast."
"The instructor in this course lacks thorough explanation of the topics being discussed."
"Instructor was definitely knowledgeable, but i felt the course was more like a code review and more emphasis should be given to teaching/explaining."
The course lacks depth in some areas, and the explanations could be more detailed and thorough.
"Very high level, exercises could have been more challenging and hands-on"
"Compared to other courses in AI Engineering, this one was a bit too technical"
"Not enough coding opportunities provided. More Coding assignments and practice will be better and more content is very much needed."
Some of the videos and course materials are outdated and do not reflect the latest versions of Apache Spark.
"The content is quite old and full of mistakes."
"Some of the content could have been presented more clearly and recorded in the same manner consistently, few items seemed to repeat also while other are not covered well."
"Out of date and confusing examples."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Scalable Machine Learning on Big Data using Apache Spark with these activities:
Review: Basic Python programming
Refresh your basic Python programming skills to ensure a solid foundation for understanding Apache Spark.
Browse courses on Python Programming
Show steps
  • Review Python syntax and data structures
  • Practice writing simple Python scripts
  • Complete coding exercises
Review foundational concepts of machine learning before taking the course
Strengthen understanding of fundamental machine learning principles to support learning in this course.
Browse courses on Machine Learning
Show steps
  • Review introductory machine learning materials.
  • Complete practice problems and exercises.
  • Attend a refresher workshop or tutorial.
Review: Basic machine learning concepts
Revisit fundamental machine learning concepts to enhance your understanding of Apache Spark ML.
Browse courses on Machine Learning
Show steps
  • Review supervised and unsupervised learning algorithms
  • Understand model evaluation metrics
  • Practice implementing basic machine learning algorithms
Ten other activities
Expand to see all activities and additional details
Show all 13 activities
Join a study group or online forum for Apache Spark
Engage with peers, discuss concepts, and deepen understanding of Apache Spark concepts.
Browse courses on Apache Spark
Show steps
  • Find a suitable study group or online forum.
  • Participate regularly in discussions.
  • Contribute to group projects or assignments.
  • Seek help and provide support to other members.
Exercise: Perform basic statistical calculations using Apache Spark RDD API
Solidify your understanding of parallel programming and statistical calculations in Apache Spark.
Browse courses on Statistics
Show steps
  • Set up an Apache Spark environment
  • Load a dataset into an RDD
  • Perform basic statistical operations (e.g., mean, standard deviation, etc.)
  • Visualize the results
Tutorial: Introduction to Apache Spark ML pipelines
Gain a comprehensive understanding of how to construct and utilize machine learning pipelines in Apache Spark ML.
Show steps
  • Understand the concept of machine learning pipelines
  • Create a Spark ML pipeline
  • Train and evaluate a machine learning model using the pipeline
  • Deploy the pipeline for real-time predictions
Solve Apache Spark RDD practice problems
Practice solving problems involving RDDs to solidify understanding of distributed computing concepts.
Browse courses on Apache Spark
Show steps
  • Identify and define the problem you want to solve.
  • Convert the problem into a series of RDD operations.
  • Run the operations on a Spark cluster.
  • Validate the results.
Create a SparkML Pipeline for a supervised learning task
Develop a practical understanding of creating and executing SparkML pipelines for supervised learning tasks.
Show steps
  • Load and explore the dataset.
  • Create a SparkML pipeline.
  • Fit and evaluate the pipeline.
  • Deploy the pipeline for inference.
Project: Build a machine learning model using Apache Spark ML
Apply your skills to build a complete machine learning model using Apache Spark ML, encompassing data preparation, feature engineering, model training, and evaluation.
Browse courses on Machine Learning Models
Show steps
  • Define the problem and gather data
  • Clean and prepare the data
  • Create features and select relevant variables
  • Train, evaluate, and tune the machine learning model
  • Deploy the model and monitor its performance
Follow tutorials to explore advanced SparkML algorithms
Acquire hands-on experience with various SparkML algorithms through guided tutorials.
Show steps
  • Identify the specific algorithms you want to learn.
  • Find appropriate tutorials and documentation.
  • Follow the tutorials step-by-step.
  • Experiment with different parameters and datasets.
Contribute to Apache Spark community
Make meaningful contributions to the Apache Spark community, deepening your understanding of the framework and its applications.
Browse courses on Apache Spark
Show steps
  • Report bugs or issues
  • Write documentation or tutorials
  • Contribute code to the Apache Spark project
  • Participate in community discussions and forums
Mentor junior data scientists
Share your knowledge and expertise in Apache Spark and machine learning by mentoring junior data scientists or students.
Browse courses on Mentorship
Show steps
  • Identify and connect with mentees
  • Provide guidance on Apache Spark and machine learning concepts
  • Review mentee's projects and assignments
  • Offer encouragement and support
Develop a small-scale Apache Spark project
Gain practical experience by applying Apache Spark skills to solve a real-world problem.
Browse courses on Apache Spark
Show steps
  • Define the project scope and objectives.
  • Design the project architecture.
  • Implement the project using Apache Spark.
  • Test and evaluate the project.
  • Document and share the project outcomes.

Career center

Learners who complete Scalable Machine Learning on Big Data using Apache Spark will develop knowledge and skills that may be useful to these careers:
Machine Learning Researcher
A Machine Learning Researcher conducts research in the field of machine learning. This course helps build the foundation for success in this role. Apache Spark is a popular platform for machine learning and data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.
Machine Learning Engineer
A Machine Learning Engineer designs, develops, and deploys machine learning models and systems. This course helps build the foundation for success in this role. Apache Spark is a popular platform for machine learning and data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data. You also learn how to use Apache SparkML Pipelines to apply machine learning algorithms on large-scale compute clusters. This course may also be useful for Data Scientists and Data Analysts who want to learn more about Apache Spark.
Data Analyst
A Data Analyst transforms raw data into insights. This course helps build the foundation for success in this role. Apache Spark is a popular platform for data analysis. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data. You also learn how to use Apache SparkML Pipelines to apply machine learning algorithms on large-scale compute clusters.
Data Engineer
A Data Engineer designs, builds, and maintains data systems. This course helps build the foundation for success in this role. Apache Spark is a popular platform for data engineering. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.
Big Data Architect
A Big Data Architect designs and builds big data systems. This course helps build the foundation for success in this role. Apache Spark is a popular platform for big data. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.
Data Science Manager
A Data Science Manager leads a team of data scientists and engineers. This course helps build the foundation for success in this role. Apache Spark is a popular platform for machine learning and data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.
Data Scientist
A Data Scientist uses data to solve complex problems. This course helps build the foundation for success in this role. Apache Spark is a popular platform for data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data. You also learn how to use Apache SparkML Pipelines to apply machine learning algorithms on large-scale compute clusters.
Research Scientist
A Research Scientist conducts research in a specific field of science. This course may be useful for Research Scientists who want to learn more about Apache Spark. Apache Spark is a popular platform for machine learning and data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.
System Administrator
A System Administrator installs, maintains, and repairs computer systems. This course may be useful for System Administrators who want to learn more about Apache Spark. Apache Spark is a popular platform for machine learning and data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.
Quantitative Analyst
A Quantitative Analyst uses mathematics and statistics to solve financial problems. This course may be useful for Quantitative Analysts who want to learn more about Apache Spark. Apache Spark is a popular platform for machine learning and data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.
Business Analyst
A Business Analyst helps businesses make better decisions. This course may be useful for Business Analysts who want to learn more about Apache Spark. Apache Spark is a popular platform for machine learning and data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.
Statistician
A Statistician collects, analyzes, and interprets data. This course may be useful for Statisticians who want to learn more about Apache Spark. Apache Spark is a popular platform for machine learning and data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.
Software Engineer
A Software Engineer designs, develops, and tests software systems. This course may be useful for Software Engineers who want to learn more about Apache Spark. Apache Spark is a popular platform for machine learning and data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.
Cloud Engineer
A Cloud Engineer designs, builds, and maintains cloud computing systems. This course may be useful for Cloud Engineers who want to learn more about Apache Spark. Apache Spark is a popular platform for machine learning and data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.
Database Administrator
A Database Administrator designs, builds, and maintains databases. This course may be useful for Database Administrators who want to learn more about Apache Spark. Apache Spark is a popular platform for machine learning and data science. This course provides a comprehensive introduction to Apache Spark. Through hands-on exercises, you learn how to use Apache Spark to solve machine learning problems involving both small and big data.

Reading list

We've selected 14 books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Scalable Machine Learning on Big Data using Apache Spark.
A comprehensive reference covering the theoretical foundations of machine learning and pattern recognition. It emphasizes statistical modeling, probabilistic graphical models, and Bayesian inference for machine learning problems.
An advanced book on machine learning theory and algorithms. Explores concepts like supervised learning, unsupervised learning, and reinforcement learning in detail. Discusses the fundamental concepts and techniques involved.
Provides strong coverage of statistical methods used in machine learning, especially in supervised learning. Introduces regularization techniques and sparse models, which are commonly used in many machine learning applications.
An accessible and practical introduction to machine learning for those with a basic understanding of programming. Provides hands-on examples and case studies to demonstrate the application of machine learning in real-world scenarios.
Provides a broad overview of deep learning theory and its application potential. It dives into the mathematical intuition behind different deep learning models and provides hands-on examples using Python.
This provides a starting point for learning reinforcement learning concepts. Introduces concepts, algorithms, and theoretical foundations of reinforcement learning. It is valuable to understand the learning aspect of machine learning.
An introductory book to machine learning and data mining. Presents a wide range of techniques and concepts commonly used for machine learning tasks and data mining applications. Provides good hands-on examples using the WEKA software.
Mostly focused on supervised learning and big data clustering, this book is appropriate for readers who have already started their data science journey while providing depth to their knowledge. Useful advanced concepts such as hyperparameter optimization are provided in a clear manner.
This comprehensive reference book for Apache Spark. It takes a deep dive into the internals and covers advanced topics like graph processing and streaming that this course does not.
Provides a good amount of content on Apache Spark SQL and big data query processing with SQL. It useful resource especially for learners coming from a SQL background and are new to Apache Spark.
Great resource to learn the distributed computing and Hadoop ecosystem. It covers advanced topics like security, scalability, and monitoring that this course does not cover. Focuses on Apache Hadoop rather than Spark, but provides context and good depth to big data fundamentals.
Is focused on the Hadoop ecosystem, with chapters covering Apache Spark and Hive. It gives context and the overall view of big data processing along with the Hadoop platform's history.
Aims at presenting recipes for adapting your data to Apache Spark and writing the right code capable of running on large clusters. This book good companion to practice different use cases.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Scalable Machine Learning on Big Data using Apache Spark.
Apache Spark for Data Engineering and Machine Learning
Most relevant
AI Workflow: Enterprise Model Deployment
Most relevant
Apache Spark 2.0 with Java -Learn Spark from a Big Data...
Most relevant
Big Data Analysis with Scala and Spark (Scala 2 version)
Most relevant
Working with Big Data
Most relevant
Apache Spark with Scala - Hands On with Big Data!
Most relevant
Fundamentals of Scalable Data Science
Most relevant
Big Data with Scala and Spark
Most relevant
Big Data Analysis with Scala and Spark
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser