We may earn an affiliate commission when you visit our partners.
Course image
Edward Viaene

Important update: Effective January The sandbox can still be downloaded, but the full install requires a Cloudera subscription to get access to the yum repository.

In this course you will learn Big Data using the Hadoop Ecosystem. Why Hadoop? It is one of the most sought after skills in the IT industry. The average salary in the US is $112,000 per year, up to an average of $160,000 in San Fransisco (source: Indeed).

Read more

Important update: Effective January The sandbox can still be downloaded, but the full install requires a Cloudera subscription to get access to the yum repository.

In this course you will learn Big Data using the Hadoop Ecosystem. Why Hadoop? It is one of the most sought after skills in the IT industry. The average salary in the US is $112,000 per year, up to an average of $160,000 in San Fransisco (source: Indeed).

The course is aimed at Software Engineers, Database Administrators, and System Administrators that want to learn about Big Data. Other IT professionals can also take this course, but might have to do some extra research to understand some of the concepts.

You will learn how to use the most popular software in the Big Data industry at moment, using batch processing as well as realtime processing. This course will give you enough background to be able to talk about real problems and solutions with experts in the industry. Updating your LinkedIn profile with these technologies will make recruiters want you to get interviews at the most prestigious companies in the world.

The course is very practical, with more than 6 hours of lectures. You want to try out everything yourself, adding multiple hours of learning. If you get stuck with the technology while trying, there is support available. I will answer your messages on the message boards and we have a Facebook group where you can post questions.

Enroll now

What's inside

Syllabus

Introduction

Course introduction, lecture overview, course objectives

This document provides a guide to do the demos in this course

Read more

The 3 (or 4) V's of Big Data explained

What is Big Data? Some examples of companies using Big Data, like Spotify, Amazon, Google, and Tesla

What can we do with Big Data? Data Science explained.

How to build a Big Data System? What is Hadoop?

Hadoop Distributions: a comparison between Apache Hadoop, Hortonworks Data Platform, Cloudera, and MapR

How to install Hadoop? You can install Hadoop using vagrant with Virtualbox / VMWare, or on the Cloud using AWS. Hortonworks also provides a Sandbox.

This is a demo of how to install and use the Hortonworks Sandbox. An alternative to the full installation using Ambari if you have a machine that doesn't have a lot of memory available. You can also use both in conjunction.

A walkthrough of how to install the Hortonworks Data Platform (HDP) on your Laptop or Desktop

A walkthrough of how to install the Hortonworks Data Platform (HDP) on your Laptop or Desktop (Part II)

An introduction to HDFS, The Hadoop Distributed Filesystem

Communications between the DataNode and the NameNode explained

An introduction to HDFS using hadoop fs put. I'm also showing how a files gets divided in blocks and where those blocks are stored.

An introduction to downloading, uploading and listing files. This time I'm using the Ambari HDFS Viewer and the NameNode UI. I also show what configuration changes are necessary to make this work.

MapReduce WordCount, step by step explained

A demo of MapReduce WordCount on our HDP cluster

In HDFS, files are divided in blocks and stored on the DataNodes. In this lecture we're going to see what happens when we're reading lines from files that potentially span over multiple blocks.

Introducing Yarn, and concepts like the ResourceManager, the scheduler, the applicationsManager, the NodeManager, and the Application Master. I explain how an application is executed and the consequences when a node crashes.

A demo of an application executed using yarn jar. I provide an overview of Ambari Yarn metrics and the ResourceManager UI

Ambari also exposes a REST API. Commands can be executed directly to this API. Ambari also lets you do unattended install using Ambari Blueprints

A demo showing you the Ambari API and how to work with blueprints

An introduction to ETL processing in Hadoop. MapReduce, Pig, and Spark are suitable to do batch processing. Hive is more suitable for data exploration.

An introduction to Pig and Pig Latin.

This demo shows how to install pig and tez using Ambari on the Hortonworks Data Platform

In this demo I will show you basic pig commands to load, dump and store data. I'll also show you an example how to filter data.

More Pig commands in this final part of the pig demo. I'll go over commands like GROUP BY, FOREACH ... GENERATE and COUNT()

An introduction to Apache Spark. This lecture explains the differences between the spark-submit using local mode, yarn-cluster and yarn-client.

An introduction to WordCount in Spark using Python (pyspark)

Spark installation using Ambari and a demo of the Spark Wordcount using the pyspark shell.

This lectures gives an introduction to Resilient Distributed Datasets (RDDs). This abstraction allows you to do transformations and actions in Spark. I give an example using filter RDDs, and explain how shuffle RDDs impact disk and network IO

A demo of RDD transformations and actions in Spark

An overview of the most common RDD actions and transformations

An overview of what Spark MLLib (Machine Learning Library) can do. I explain a Recommendation Engine example, and a Clustering Example (K-Means / DBScan)

An introduction to SQL on Hadoop using Hive, enabling data warehouse capabilities. This lecture provides an architecture overview and an overview of the hive CLI and beeline using JDBC.

An overview of Hive Queries: creating tables, creating databases, inserting data, and selecting data. This lecture also shows where the hive data is stored in HDFS.

A demo that shows the installation of Hiveserver2 and the clients. Afterwards I show you a few example queries using a JDBC beeline connection.

Optimizing hive can't be done using indexes. This lecture explains how queries in hive should be optimized, using partitions and buckets. This lecture also handles User Defined Functions (UDFs) and Serialization / Deserialization

The Stinger initiative brings optimizations to Spark. Query time has lowered significantly over the years. This lecture explains you the details.

You can also use Hive in Spark using the Spark SQLContext.

All the lectures up until now were batch oriented. From now on we're going to discuss Realtime processing technologies like Kafka, Storm, Spark Streaming, and HBase / Phoenix.

An introduction to Kafka and its terminology like Producers, Consumers, Topics and Partitions.

An explanation of Kafka Topics covering Leader partitions, Follower partitions, and how writes are sent to the partitions. Also covers the Consumer groups to show the difference between publish-subscribe (pubsub) mechanism and queuing

Kafka guarantees at-least-once message delivery, but can also be configured for at-most-once. Log Compaction is a technique that Kafka provides to have a full dataset maintained in the commit log. This lecture shows an example of a customer dataset fully kept in Kafka and explains Log Tail, Cleaner Point and Log Head and how it impacts consumers.

A few example use cases of Kafka

The installation of Kafka on the Hortonworks Data Platform and a demo of a producer - consumer example.

This lecture provides an introduction to Storm, a realtime computing system. The architecture overview explains components like Nimbus, Zookeeper, and the Supervisor

This lecture explains what Storm topologies are. I talk about streams, tuples, spouts, and bolts.

A demo of a Storm Topology ingesting data from Kafka and doing computation on the data.

Message Delivery explained:

  • At most once delivery
  • At least once delivery
  • Exactly once delivery

This lecture also explains the Storm's reliability API (Anchoring and Acking) and the performance impact of acking.

An introduction to the Trident API, an alternative interface for Storm that supports exactly-once processing of messages.

Spark streaming is an alternative to Storm that gained a lot of popularity in the last few years. It allows you to reuse the code you wrote in batch and use it for stream processing.

Spark Streaming generates DStreams, micro-batches of RDDs. This lecture explains the Spark Streaming Architecture

This lecture explains possible receivers, like Kafka. It also shows a WordCount streaming example, where data is ingested from Kafka and processed using WordCount in Spark Streaming

This demo shows the Kafka-spark-streaming example.

In the previous lecture we did a WordCount using Spark Streaming, but our example was stateless. In this lecture I'm adding state, using UpdateStateByKey to keep state and checkpointing to save the data to HDFS.

A demo of a stateful spark streaming application. Performs a global WordCount from a topic from Kafka. Does checkpointing in HDFS.

More Spark Streaming Features, like Windowing and streaming algorithms

Introduction to HBase: a realtime, distributed, scalable, big data store on top of Hadoop. The lecture also briefly explains the CAP theorem.

An HBase table is different than a table in a Relational Database. This lecture explains the differences and talks about the row key, Column Families, Column Qualifiers, versions, and regions.

A lecture that explains the hbase:meta table, which is retrieved using Zookeeper when a client connects. This way the clients knows what RegionServer to contact to read/write data.

This lecture shows how a write (a PUT request) is handled by HBase. It shows how writes go to the WAL (Write-ahead-log), and the Memstore. I also show how flushes work to persist the data in HDFS.

HBase reads go to the Memstore and the BlockCache first, then to HFiles on HDFS. The lecture shows how indexes and Bloomfilters are used to speed up reads from disk.

HBase does minor and major compactions to merge HFiles in HDFS.

This lecture explains how a crash recovery in HBase happens, how Zookeeper and the HMaster are involved, how recovery uses the WAL files and how data is persisted to disk after a crash.

When tables become bigger, they split. This lecture explains how Regions are split. balanced over the RegionServer and how pre-splitting can help with the performance.

HBase hotspotting is something to avoid. This lecture explains when hotspotting can happen and how to avoid it using salting.

This demo shows how to install HBase using Ambari.

This demo gives you an introduction to the HBase Shell, where table can be created, data can be retrieved using get / scan, and data can be written using put

An example of a stateful Spark Streaming application that ingests data from a Kafka topic, runs the wordcount on the data, and stores the data in an HBase table.

An introduction to Phoenix, which brings SQL back into HBase.

An overview of Phoenix features like Salting, Compression, and Indexes. All implemented using standard SQL commands to make it easier for the database administrators and analysts to use HBase.

More Phoenix features like JOINs, VIEWs, and a Phoenix in Spark plugin.

A demo showing the Phoenix features

An introduction to Kerberos, which we are going to use to secure our Hadoop cluster

An overview of different deployment strategies of Kerberos in Hadoop

Getting familiar with Kerberos Technologies like Principals, Realms, and keytabs

A demo showing you how to install MIT Kerberos, enabling Kerberos in Ambari, and showing how this impacts the users using HDFS

Introduction to SPNEGO, protecting the HTTP interfaces in Hadoop against unauthorized access

A demo showing how SPNEGO works

The Knox gateway provides a single entry point to the Hadoop APIs and UIs. This lecture explains the Knox gateway architecture and how it can be used.

This lectures gives an introduction to Ranger, which can be used for access control on the Hadoop services (authorization)

Demo of installing ranger using Ambari

A demo of Ranger with Hive. Ranger can be used to put granular access controls on hive databases, tables, and columns.

Traffic lights

Read about what's good
what should give you pause
and possible dealbreakers
Covers batch processing and real-time processing, which are essential for building modern data pipelines and handling diverse data workloads
Explores the Hadoop ecosystem, which is a foundational technology for big data processing and storage in many organizations
Includes hands-on demos using the Hortonworks Data Platform (HDP), allowing learners to gain practical experience with a widely used Hadoop distribution
Requires a Cloudera subscription to get access to the yum repository, which may pose a barrier to entry for some learners
Uses Ambari for installation and management, which may not be the latest standard, as Cloudera has moved to Cloudera Manager
Focuses on Hortonworks Data Platform, which was acquired by Cloudera in 2019, so some technologies may be outdated

Save this course

Create your own learning path. Save this course to your list so you can find it easily later.
Save

Reviews summary

Comprehensive hadoop ecosystem introduction

According to learners, this course offers a comprehensive overview of the Hadoop ecosystem, covering essential tools like HDFS, Spark, Hive, and Kafka. Many find the theoretical explanations clear and the practical demonstrations helpful for grasping concepts. It's often seen as a solid foundation for those new to Big Data or looking to transition careers. However, a significant challenge highlighted by students is the difficulty with environment setup, particularly regarding installing the necessary platforms and potential costs associated with current distributions. While providing a broad introduction, some students feel it lacks depth on more advanced topics, suggesting it's best suited for beginners or intermediate learners.
Good introduction, sometimes lacks depth
"This is a good high-level overview, but I feel I need more in-depth coverage on optimizing specific components like Spark jobs."
"The pace was just right for me as someone completely new to Big Data technologies."
"Some sections felt a bit rushed, while others went over very basic concepts I already knew."
"Provides a solid foundation, but isn't sufficient if you need to become an expert in any single technology."
Instructor explains complex ideas well
"The instructor did a commendable job of breaking down complex topics like HDFS internals and Kafka concepts."
"I found the explanations clear and easy to follow, especially the architectural diagrams."
"He seemed very knowledgeable and was passionate about teaching the subject matter."
"Complex ideas were presented in an understandable way for someone new to the field."
Many demos help solidify concepts
"The hands-on demos for Spark and Hive were incredibly helpful for seeing how things work in practice."
"I appreciated the step-by-step walkthroughs provided for setting things up, even though it was tricky."
"Seeing the code examples run and the results made the theoretical concepts much clearer for me."
"The demos helped bridge the gap between theory and practical application."
Covers a wide range of Hadoop tools
"I got a great overview of HDFS, MapReduce, Spark, Hive, Kafka, and HBase in this course."
"The course provided a really broad introduction to the core technologies within the Hadoop ecosystem."
"It touches upon pretty much all the main components you need to know about Big Data."
"I feel like I understand the different pieces of the Hadoop puzzle now."
Setting up the environment is hard
"Getting the HDP sandbox or the full installation running was a major hurdle; I spent a lot of time on setup."
"The installation instructions seemed a bit outdated or didn't quite match my system, leading to many errors."
"The mention of needing a Cloudera subscription for a full install is a significant barrier and extra cost."
"I felt like I spent more time troubleshooting environment issues than actually learning the course material."
"Setting up the cluster environment was the hardest part by far."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Learn Big Data: The Hadoop Ecosystem Masterclass with these activities:
Review Basic Linux Commands
Familiarize yourself with basic Linux commands to navigate the Hadoop file system and manage Hadoop processes more effectively.
Browse courses on Linux CLI
Show steps
  • Review common commands like ls, cd, mkdir, rm, cp, mv.
  • Practice using these commands in a virtual Linux environment.
  • Familiarize yourself with file permissions and ownership.
Review: Hadoop: The Definitive Guide
Deepen your understanding of Hadoop's core components and architecture by studying a definitive guide.
Show steps
  • Read the chapters on HDFS and MapReduce.
  • Study the examples provided in the book.
  • Experiment with the concepts on a local Hadoop installation.
Build a Simple Data Pipeline with Pig
Gain practical experience with Pig by building a data pipeline to process and analyze a sample dataset.
Show steps
  • Choose a sample dataset (e.g., web server logs).
  • Write a Pig script to load, transform, and analyze the data.
  • Run the Pig script on a Hadoop cluster.
  • Analyze the results and refine the script as needed.
Four other activities
Expand to see all activities and additional details
Show all seven activities
Review: Kafka: The Definitive Guide
Enhance your understanding of Kafka's architecture and features by studying a definitive guide.
Show steps
  • Read the chapters on Kafka's architecture and core concepts.
  • Study the examples provided in the book.
  • Experiment with Kafka producers and consumers.
Create a Blog Post on Spark Streaming
Solidify your knowledge of Spark Streaming by writing a blog post explaining its concepts and use cases.
Show steps
  • Research Spark Streaming and its applications.
  • Outline the key concepts and features of Spark Streaming.
  • Write a clear and concise blog post with examples.
  • Publish the blog post on a platform like Medium or your personal website.
Practice HBase Data Modeling
Improve your HBase skills by designing data models for different use cases and practicing data access patterns.
Show steps
  • Choose several use cases (e.g., time-series data, user profiles).
  • Design HBase data models for each use case.
  • Implement data access patterns using the HBase shell or Java API.
  • Evaluate the performance of your data models and optimize as needed.
Create a Data Visualization Dashboard
Showcase your big data skills by creating a data visualization dashboard using tools like Tableau or Apache Superset to present insights from a Hadoop dataset.
Show steps
  • Choose a relevant dataset from your Hadoop environment.
  • Use Hive or Spark to query and transform the data.
  • Design and build a data visualization dashboard using a suitable tool.
  • Present the dashboard and explain the insights you've uncovered.

Career center

Learners who complete Learn Big Data: The Hadoop Ecosystem Masterclass will develop knowledge and skills that may be useful to these careers:
Hadoop Developer
A Hadoop developer builds and maintains applications that run on the Hadoop platform. This course helps a Hadoop developer by providing hands-on experience with the core components of the Hadoop ecosystem. The course covers HDFS, MapReduce, and Yarn, which are essential for developing distributed data processing applications. The sections on Pig, Hive, and Spark are particularly relevant for data transformation and analysis within Hadoop. The course includes practical demos and exercises that Hadoop developers can use to gain proficiency in writing and deploying applications on Hadoop clusters. Learning about Hadoop security is also beneficial
HBase Administrator
A HBase administrator is responsible for the setup, configuration, and maintenance of HBase clusters, a NoSQL database that runs on top of Hadoop. This course helps an HBase administrator by teaching about HBase architecture, including the meta table, regions, and RegionServers. The course also covers how writes and reads are handled, the importance of WAL, and compaction. It also touches on topics like crash recovery, region splitting, and hotspotting.
Big Data Architect
A big data architect designs and oversees the implementation of big data solutions. This course may be useful to a big data architect by providing a comprehensive understanding of the Hadoop ecosystem. The course covers various components, including HDFS, MapReduce, Yarn, Pig, Hive, and Spark, which are foundational for building scalable and efficient data architectures. A big data architect would benefit from the sections on real-time processing with Kafka, Storm, and Spark Streaming, as well as data storage solutions like HBase and Phoenix. The security aspects covered, such as Kerberos and Ranger, are essential for designing secure big data environments.
Spark Developer
A Spark developer focuses on building and optimizing applications using Apache Spark, a unified analytics engine for big data processing. This course helps a Spark developer by introducing them to Spark, explaining the differences between local mode, yarn-cluster, and yarn-client. The course provides an introduction to WordCount in Spark using Python, and dives into Resilient Distributed Datasets, which are crucial for Spark development.
Data Engineer
A data engineer designs, builds, and manages the infrastructure required for data storage, processing, and analysis. This course may be helpful to a data engineer because it covers essential big data technologies within the Hadoop ecosystem. Understanding Hadoop Distributed File System, MapReduce, and Yarn will enable data engineers to handle large datasets efficiently. Experience with Pig, Hive, and Spark helps in data transformation and warehousing. Real-time processing technologies like Kafka, Storm, and Spark Streaming, and data stores like HBase are crucial for building robust data pipelines.
Kafka Engineer
A Kafka engineer specializes in designing, implementing, and managing Apache Kafka, a distributed streaming platform. This course may be useful to someone pursuing this role by diving into Kafka architecture, terminology, and use cases. Specifically, the course goes over topics, partitions, producers, and consumers. Also touched on are message delivery guarantees and log compaction, providing a solid foundation.
Data Scientist
A data scientist analyzes large datasets to extract insights and develop predictive models. This course may be useful to a data scientist by equipping one with the skills to work with big data technologies. The course covers Hadoop, Spark, and related tools, which are essential for processing and analyzing large datasets. Understanding MapReduce, Spark MLlib, and Hive allows a data scientist to perform data mining, machine learning, and data warehousing tasks. Real-time processing technologies like Kafka and Spark Streaming are also beneficial for analyzing streaming data. A data scientist can also use the material on security to ensure compliance and protect data.
ETL Developer
An extract, transform, load developer designs and implements processes to extract data from various sources, transform it into a usable format, and load it into a data warehouse or data lake. This course helps an ETL developer by providing insight into how to do ETL processing in Hadoop. MapReduce, Pig, and Spark are covered and are suitable for doing batch processing. Hive is more suitable for data exploration. A solid foundation may prove to be valuable when looking into this role.
Database Administrator
A database administrator manages and maintains databases, ensuring their availability, performance, and security. This course may be useful to a database administrator looking to expand their skills into the realm of big data. It covers Hadoop, HBase, and Hive, which are commonly used for storing and processing large datasets. The course provides insights into data storage, data processing, and data warehousing within the Hadoop ecosystem. Knowledge of Kerberos, SPNEGO, and Ranger, covered in the course, helps the database administrator in securing Hadoop clusters and managing user access.
System Administrator
A system administrator is responsible for the configuration, management, and maintenance of computer systems and servers. This course may be useful to a system administrator looking to manage Hadoop clusters. The course provides insights into installing and configuring Hadoop, HDFS, and Yarn. Understanding the different Hadoop distributions, such as Apache Hadoop, Hortonworks Data Platform, and Cloudera, helps in choosing the right platform. Moreover, the course covers security aspects like Kerberos and Ranger, which are crucial for securing Hadoop environments. A system administrator can use the knowledge to optimize big data environments.
Machine Learning Engineer
A machine learning engineer develops, deploys, and maintains machine learning models and systems, often working with large datasets and distributed computing frameworks. This course may be useful to a machine learning engineer by providing the necessary skills to handle big data for model training and deployment. The course covers Spark MLlib, a machine learning library within the Spark ecosystem, and how to use it for recommendation engines and clustering. Understanding real-time processing with Kafka and Spark Streaming is also crucial for building real-time machine learning applications
Data Analyst
A data analyst collects, processes, and performs statistical analyses of data. This course may be useful to a data analyst interested in the Hadoop ecosystem. The course covers essential tools like Hive and Pig, which are used for data extraction, transformation, and loading. Additionally, the course provides insight into data warehousing, which can be very useful for data analysts. This can help a data analyst build a solid foundation.
Cloud Solutions Architect
A cloud solutions architect designs and implements cloud-based solutions, often involving big data technologies. This course may be useful to a cloud solutions architect who needs to deploy and manage Hadoop clusters in the cloud. The course mentions installing Hadoop on AWS, providing a practical understanding of cloud deployment. Familiarity with Hadoop distributions and components helps the cloud solutions architect design scalable and cost-effective big data solutions. This course builds a knowledge base about Hadoop.
Business Intelligence Analyst
A business intelligence analyst analyzes data to identify trends and insights that can improve business decision-making. This course may be useful to a business intelligence analyst who needs to work with large datasets stored in Hadoop. The course covers Hive, which allows analysts to query and analyze data using SQL-like syntax. The course can also aid familiarity with data warehousing concepts within the Hadoop ecosystem. A business intelligence analyst can leverage this knowledge to extract valuable insights from big data sources.
Solutions Architect
A solutions architect designs and implements comprehensive technology solutions that align with business needs, leveraging various technologies and platforms. This course may be useful to a solutions architect working with big data projects. By understanding the Hadoop ecosystem, including HDFS, MapReduce, Spark, Kafka, and HBase, the solutions architect can design scalable and efficient data processing pipelines. The course also covers security aspects like Kerberos and Ranger, essential for building secure solutions.

Reading list

We've selected two books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Learn Big Data: The Hadoop Ecosystem Masterclass.
Comprehensive guide to Hadoop, covering HDFS, MapReduce, and YARN in detail. It provides a solid foundation for understanding the core concepts of Hadoop and its ecosystem. It is commonly used as a textbook in academic institutions. Reading this book will significantly enhance your understanding of the underlying principles of Hadoop.
Provides a comprehensive overview of Apache Kafka, covering its architecture, design principles, and use cases. It delves into topics such as Kafka Streams, Kafka Connect, and Kafka's integration with other big data technologies. This book is particularly useful for understanding real-time data processing with Kafka. It serves as a valuable reference for both beginners and experienced Kafka users.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Similar courses are unavailable at this time. Please try again later.
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2025 OpenCourser