We may earn an affiliate commission when you visit our partners.
Course image
Aije Egwaikhide, Romeo Kienzler, and Rav Ahuja

This self-paced IBM course will teach you all about big data! You will become familiar with the characteristics of big data and its application in big data analytics. You will also gain hands-on experience with big data processing tools like Apache Hadoop and Apache Spark.

Read more

This self-paced IBM course will teach you all about big data! You will become familiar with the characteristics of big data and its application in big data analytics. You will also gain hands-on experience with big data processing tools like Apache Hadoop and Apache Spark.

Bernard Marr defines big data as the digital trace that we are generating in this digital era. You will start the course by understanding what big data is and exploring how insights from big data can be harnessed for a variety of use cases. You’ll also explore how big data uses technologies like parallel processing, scaling, and data parallelism.

Next, you will learn about Hadoop, an open-source framework that allows for the distributed processing of large data and its ecosystem. You will discover important applications that go hand in hand with Hadoop, like Distributed File System (HDFS), MapReduce, and HBase. You will become familiar with Hive, a data warehouse software that provides an SQL-like interface to efficiently query and manipulate large data sets.

You’ll then gain insights into Apache Spark, an open-source processing engine that provides users with new ways to store and use big data. In this course, you will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the components that make up Apache Spark.

You’ll learn about DataFrames and perform basic DataFrame operations and work with SparkSQL. Explore how Spark processes and monitors the requests your application submits and how you can track work using the Spark Application UI.

This course has several hands-on labs to help you apply and practice the concepts you learn. You will complete Hadoop and Spark labs using various tools and technologies, including Docker, Kubernetes, Python, and Jupyter Notebooks.

Enroll now

What's inside

Syllabus

What Is Big Data?
In this module, you’ll begin your acquisition of Big Data knowledge with the most up-to-date definition of Big Data. You’ll explore the impact of Big Data on everyday personal tasks and business transactions with Big Data Use Cases. You’ll also learn how Big Data uses parallel processing, scaling, and data parallelism. Going further, you’ll explore commonly used Big Data tools and explain the role of open-source in Big Data. Finally, you’ll go beyond the hype and explore additional Big Data viewpoints.
Read more
Introduction to the Hadoop Ecosystem
In this module, you'll gain a fundamental understanding of the Apache Hadoop architecture, ecosystem, practices, and commonly used applications, including Distributed File System (HDFS), MapReduce, Hive, and HBase. You’ll also gain practical skills in hands-on labs when you query the data added using Hive, launch a single-node Hadoop cluster using Docker, and run MapReduce jobs.
Apache Spark
In this module, you’ll turn your attention to the popular Apache Spark platform, where you will explore the attributes and benefits of Apache Spark and distributed computing. You'll gain key insights about functional programming and Lambda functions. You’ll also explore Resilient Distributed Datasets (RDDs), parallel programming, resilience in Apache Spark, and relate RDDs and parallel programming with Apache Spark. Then, you’ll dive into additional Apache Spark components and learn how Apache Spark scales with Big Data. Working with Big Data signals the need for working with queries, including structured queries using SQL. You’ll also learn about the functions, parts, and benefits of Spark SQL and DataFrame queries, and discover how DataFrames work with Spark SQL.
DataFrames and Spark SQL
In this module, you’ll learn about Resilient Distributed Datasets (RDDs), their uses in Apache Spark, and RDD transformations and actions. You'll compare the use of datasets with Spark's latest data abstraction, DataFrames. You'll learn to identify and apply basic DataFrame operations. You’ll explore Apache Spark SQL optimization and learn how Spark SQL and memory optimization benefit from using Catalyst and Tungsten. Finally, you’ll fortify your skills with guided hands-on lab to create a table view and apply data aggregation techniques.
Development and Runtime Environment Options
In this module, you’ll explore how Spark processes the requests that your application submits and learn how you can track work using the Spark Application UI. Because Spark application work happens on the cluster, you need to be able to identify Apache Cluster Managers, their components, and benefits. You’ll also know how to connect with each cluster manager and how and when you might want to set up a local, standalone Spark instance. Next, you’ll learn about Apache Spark application submission, including the use of Spark’s unified interface, “spark-submit,” and learn about options and dependencies. You’ll also describe and apply options for submitting applications, identify external application dependency management techniques, and list Spark Shell benefits. You’ll also look at recommended practices for Spark's static and dynamic configuration options and perform hands-on labs to use Apache Spark on IBM Cloud and run Spark on Kubernetes.
Monitoring and Tuning
Platforms and applications require monitoring and tuning to manage issues that inevitably happen. In this module, you'll learn about connecting the Apache Spark user interface web server and using the same UI web server to manage application processes. You’ll also identify common Apache Spark application issues and learn about debugging issues using the application UI and locating related log files. Further, you’ll discover and gain real-world knowledge about how Spark manages memory and processor resources using the hands-on lab.
Final Project and Assessment
In this module, you’ll perform a practice lab where you’ll explore two critical aspects of data processing using Spark: working with Resilient Distributed Datasets (RDDs) and constructing DataFrames from JSON data. You will also apply various transformations and actions on both RDDs and DataFrames to gain insights and manipulate the data effectively. Further, you’ll apply your knowledge in a final project where you will create a DataFrame by loading data from a CSV file and applying transformations and actions using Spark SQL. Finally, you’ll be assessed based on your learning from the course.

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Provides hands-on labs to practice concepts learned, thus offering practical experience and immediate feedback
Develops proficiency in big data tools, such as Apache Hadoop and Apache Spark, which are widely used in the industry
Taught by subject-matter experts: Romeo Kienzler, Rav Ahuja, and Aije Egwaikhide, who hold expertise in big data
Provides an overview of the platform, diving into the components of Apache Spark
Incorporates a blend of media, including videos, readings, and discussions, to create a multi-modal learning experience
Strengthens learners' foundation in big data, allowing for further advancement in the field

Save this course

Save Introduction to Big Data with Spark and Hadoop to your list so you can find it easily later:
Save

Reviews summary

Well-received big data course

Learners say that this course is well received with an overall rating of 79% positive, according to our analysis of 74 reviews. Many students praise the practical labs, comprehensive content, and engaging assignments that help make this a great course.
rated as largely positive
"This is really helpful for me to understand"
"That is a well packaged course"
"I found this course very interesting"
features engaging assignments
"The labs are short and have concise material."
"Fantastic blend of theory and practical (labs)."
"I​ like the content about Spark, it was well organised and demonstrated with hands-on lab"
has technical difficulties
"IBM stopped to maintain/support it's platform"
"The course is great, but when they updated the course a while ago, the certificate disappeared"
"Some labs environments where is not possible to work on the practical exercises."
includes difficult exams
"To much concepts and theory, i need more practice examples and less theory concepts"
"Too dry and technical"
"personally, I've found that the knowledge delivering method should be redone in a more attractive manner."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Introduction to Big Data with Spark and Hadoop with these activities:
Connect with experienced Big Data professionals
Seek guidance and support from individuals with expertise in Big Data.
Show steps
  • Attend industry events or join online communities.
  • Reach out to professionals on LinkedIn or other platforms.
Attend a Big Data meetup or conference
Connect with other learners and industry professionals to expand your network and knowledge.
Show steps
  • Find upcoming Big Data events in your area or online.
  • Register and attend the event.
  • Engage in discussions and ask questions.
Practice parallel processing concepts
Helps you solidify your understanding of parallel processing concepts to prepare for Hadoop
Browse courses on Parallel Processing
Show steps
  • Review the concept of parallel processing
  • Solve practice problems involving parallel processing
Ten other activities
Expand to see all activities and additional details
Show all 13 activities
Review basic probability and statistics
Refresh your knowledge of probability and statistics to support your understanding of data analysis concepts.
Browse courses on Probability
Show steps
  • Review concepts such as probability distributions, hypothesis testing, and statistical inference.
  • Solve practice problems to reinforce your understanding.
Review parallel programming concepts
Strengthen your foundation in parallel programming before starting the course.
Browse courses on Parallel Programming
Show steps
  • Review concepts such as concurrency, synchronization, and thread management.
  • Practice writing simple parallel programs.
Gather resources on Big Data tools
Organize and review relevant materials to enhance your understanding of Big Data tools.
Browse courses on Hadoop
Show steps
  • Search for articles, tutorials, and documentation on Hadoop and Spark.
  • Compile the resources in a folder or online repository.
  • Review the resources to reinforce your knowledge.
Explore additional tutorials on big data technologies
Tutorials will provide supplemental instruction and insights that can complement the course content, enhancing your understanding of big data concepts.
Browse courses on Big Data Technologies
Show steps
  • Identify relevant tutorials on platforms like YouTube, Pluralsight, or LinkedIn Learning.
  • Choose tutorials that cover specific topics or concepts you want to delve deeper into.
  • Follow the tutorials and complete any accompanying exercises or challenges.
Explore Hadoop Ecosystem Tools
Provides hands-on experience with Hadoop tools such as HDFS and MapReduce, enhancing your understanding.
Show steps
  • Find tutorials on Hadoop Ecosystem tools
  • Follow the tutorials to work with HDFS and MapReduce
Solve Hadoop and Spark practice problems
Practice applying Hadoop and Spark concepts by solving problems and challenges.
Show steps
  • Review Hadoop and Spark concepts covered in the course.
  • Find practice problems online or in textbooks.
  • Attempt to solve the problems on your own.
  • Compare your solutions to provided solutions or discuss with classmates.
Follow Spark SQL tutorials
Gain hands-on experience with Spark SQL by following guided tutorials.
Show steps
  • Find Spark SQL tutorials online or in documentation.
  • Follow the tutorials step-by-step.
  • Experiment with different queries and datasets.
Perform practice problems on Hadoop and Spark
Practice problems will reinforce your understanding of Hadoop and Spark concepts and help you develop proficiency in using these tools.
Browse courses on Hadoop
Show steps
  • Access online practice problems or exercises from educational platforms like Coursera, edX, or Udemy.
  • Set aside dedicated time for practice and work through the problems.
  • Review your solutions and identify areas for improvement.
Build a Spark application
Apply your knowledge by building a functional Spark application.
Show steps
  • Identify a problem or task that can be solved using Spark.
  • Design the application architecture.
  • Implement the application using Spark APIs.
  • Test and debug the application.
  • Deploy the application.
Develop a blog post or presentation on a specific aspect of big data
Creating content will allow you to synthesize your knowledge, demonstrate your understanding, and reinforce key concepts through the process of teaching others.
Show steps
  • Choose a specific topic or concept related to big data that you want to explore further.
  • Research and gather information from reliable sources.
  • Organize your ideas and outline a structure for your blog post or presentation.
  • Create your content, ensuring it is well-written, visually appealing, and engaging.
  • Share your blog post or presentation with others and seek feedback.

Career center

Learners who complete Introduction to Big Data with Spark and Hadoop will develop knowledge and skills that may be useful to these careers:
Big Data Architect
Big Data Architects design and manage big data systems. This course may be useful for Big Data Architects as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Big Data Architects can improve their ability to design and manage big data systems that can handle the demands of big data.
Machine Learning Engineer
Machine Learning Engineers build and deploy machine learning models. This course may be useful for Machine Learning Engineers as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Machine Learning Engineers can improve their ability to build and deploy machine learning models that can handle the demands of big data.
Data Scientist
Data Scientists use data to build models and make predictions. This course may be useful for Data Scientists as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Data Scientists can improve their ability to build and train models that can make accurate predictions.
Data Warehouse Architect
Data Warehouse Architects design and build data warehouses. This course may be useful for Data Warehouse Architects as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Data Warehouse Architects can improve their ability to design and build data warehouses that can handle the demands of big data.
Data Security Analyst
Data Security Analysts protect data from unauthorized access, use, disclosure, disruption, modification, or destruction. This course may be useful for Data Security Analysts as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Data Security Analysts can improve their ability to protect data from unauthorized access, use, disclosure, disruption, modification, or destruction.
Data Engineer
Data Engineers design, build, and maintain data pipelines and systems. This course may be useful for Data Engineers as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Data Engineers can improve their ability to build and manage data pipelines and systems that can handle the demands of big data.
Cloud Architect
Cloud Architects design and manage cloud computing systems. This course may be useful for Cloud Architects as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Cloud Architects can improve their ability to design and manage cloud computing systems that can handle the demands of big data.
Systems Engineer
Systems Engineers design, implement, and maintain computer systems. This course may be useful for Systems Engineers as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Systems Engineers can improve their ability to design, implement, and maintain computer systems that can handle the demands of big data.
Data Governance Analyst
Data Governance Analysts develop and enforce data governance policies and procedures. This course may be useful for Data Governance Analysts as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Data Governance Analysts can improve their ability to develop and enforce data governance policies and procedures that can help organizations manage their data effectively.
Information Security Analyst
Information Security Analysts protect information systems from unauthorized access, use, disclosure, disruption, modification, or destruction. This course may be useful for Information Security Analysts as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Information Security Analysts can improve their ability to protect information systems from unauthorized access, use, disclosure, disruption, modification, or destruction.
Software Engineer
Software Engineers design, develop, and maintain software systems. This course may be useful for Software Engineers as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Software Engineers can improve their ability to design and develop software systems that can handle the demands of big data.
Database Administrator
Database Administrators maintain and manage databases. This course may be useful for Database Administrators as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Database Administrators can improve their ability to maintain and manage databases that can handle the demands of big data.
Business Analyst
Business Analysts analyze business data to identify opportunities and solve problems. This course may be useful for Business Analysts as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Business Analysts can improve their ability to analyze business data and make better recommendations to businesses.
Data Analyst
Data Analysts collect, analyze, interpret, and present data in order to help businesses make informed decisions. This course may be useful for Data Analysts as it provides a strong foundation in big data, Hadoop, and Spark, all of which are essential technologies for working with large and complex datasets. By understanding these technologies, Data Analysts can improve their ability to extract insights from data and make better recommendations to businesses.

Reading list

We've selected eight books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Introduction to Big Data with Spark and Hadoop.
The definitive guide to Hadoop, co-authored by one of the original creators of Hadoop. Provides an in-depth overview of Hadoop's architecture, components, and ecosystem, and includes hands-on examples and case studies.
The definitive guide to Apache Spark, co-authored by one of the original creators of Spark. Provides a comprehensive overview of Spark's architecture, components, and ecosystem, and includes hands-on examples and case studies.
Beginner-friendly guide to data science, covering topics such as data collection, data cleaning, data analysis, data visualization, and data presentation.
Provides a comprehensive overview of cloud computing, its principles and paradigms. Covers cloud computing technologies, architectures, and services, and includes case studies from various industries.
Provides a comprehensive overview of data mining, its tools and techniques. Covers data mining algorithms, techniques, and applications, and includes case studies from various industries.
Provides a practical introduction to data visualization. Covers data visualization techniques, tools, and best practices, and includes case studies from various industries.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Introduction to Big Data with Spark and Hadoop.
Big Data, Hadoop, and Spark Basics
Most relevant
Developing Spark Applications Using Scala & Cloudera
Most relevant
Apache Spark Fundamentals
Most relevant
Apache Spark 2.0 with Java -Learn Spark from a Big Data...
Most relevant
Getting Started with Apache Spark on Databricks
Most relevant
SQL Big Data Convergence - The Big Picture
Most relevant
Apache Spark for Data Engineering and Machine Learning
Most relevant
Master Big Data - Apache...
Most relevant
Data Engineering and Machine Learning using Spark
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser