The Ultimate Hands-On Hadoop: Tame your Big Data! from Udemy

What's inside

Syllabus

Identify the major components of the Hadoop ecosystem, and run Hadoop on your desktop.

How to ask questions, tune the video playback, enable captions, and leave reviews.

After a quick intro, we'll dive right in and install Hortonworks Sandbox in a virtual machine right on your own PC. This is the quickest way to get up and running with Hadoop so you can start learning and experimenting with it. We'll then download some real movie ratings data, and use Hive to analyze it!

The activities in this course use the Hortonworks Data Platform (HDP.) But Hortonworks merged with Cloudera, and they're working on a new thing called CDP. Don't worry... here's why.

What's Hadoop for? What problems does it solve? Where did it come from? We'll learn Hadoop's backstory in this lecture.

We'll take a quick tour of all the technologies we'll cover in this course, and how they all fit together. You'll come out of this lecture knowing all the buzzwords!

Learn how Hadoop's Distributed Filesystem allows you store massive data sets across a cluster of commodity computers, in a reliable and scalable manner.

Before we can analyze movie ratings data from GroupLens using Hadoop, we need to load it into HDFS. You don't need to mess with command lines or programming to use HDFS. We'll start by importing some real movie ratings data into HDFS just using a web-based UI provided by Ambari.

Developers might be more comfortable interacting with HDFS via the command line interface. We'll import the same data, this time from a terminal prompt.

Learn how mappers and reducers provide a clever way to analyze massive distributed datasets quickly and reliably.

Learn what makes MapReduce so powerful, by horizontally scaling across a cluster of computers.

Let's look at a very simple example of MapReduce - counting how many of each rating type exists in our movie ratings data.

We need to adjust the HDP sandbox's repository paths prior to installing the software we need for subsequent activities.

We'll study our code for building a breakdown of movie ratings, and actually run it on your system!

As a challenge, see if you can write your own MapReduce script that sorts movies by how many ratings they received. I'll give you some hints, set you off, and then review my solution to the problem.

Let's see how I solved the challenge from the previous lecture - we'll change our script to count movies instead of ratings, and then review and run my solution for sorting by rating count.

Ambari is Hortonworks' web-based UI (similar to Hue used by Cloudera.) We can use it as an easy way to experiment with Pig, so let's take a closer look at it before moving ahead.

An overview of what Pig is used for, who it's for, and how it works.

We'll use Pig to script a chain of queries on MovieLens to solve a more complex problem.

Let's actually run our example from the previous lecture on your Hadoop sandbox, and find some good, old movies!

We covered most of the basics of Pig in our example, but let's look at what else Pig Latin can do.

I'll give you some pointers, and challenge you to write your own Pig script that finds the most popular really bad movie!

Let's look at my code for finding the most popular bad movies, and you can compare my results to yours.

What's so special about Spark? Learn how its efficiency and versatility make Apache Spark one of the hottest Hadoop-related technologies right now, and how it achieves this under the hood.

The core building block of Spark is the RDD; learn how they are used and the functions available on them.

As an example, let's write a Spark script to find the movie with the lowest average rating. We'll start by doing it just with RDD's.

Spark 2.0 placed a new emphasis on Datasets and SparkSQL. Learn how Datasets can make your Spark scripts even faster and easier to write.

Let's revisit the previous problem of finding the lowest-rated movies, but this time using DataFrames.

As an example of the more complicated things Spark is capable of, we'll use Spark's machine learning library to produce movie recommendations using the ALS algorithm.

As a very simple exercise, we'll build upon our earlier activity to filter the results by movies with a given number of ratings.

We'll review my solution to the previous exercise, and run the resulting scripts.

An introduction to Apache Hive and how it enables relational queries on HDFS-hosted data.

We'll import the MovieLens data set into Hive using the Ambari UI, and run a simple query to find the most popular movies.

Learn how Hive works under the hood of your Hadoop cluster, to efficiently query your data across a cluster using SQL commands. Well, technically it's HiveQL, but it will definitely seem familiar.

As a challenge, use this same Hive database to find the best-rated movie.

Compare your solution to mine for the exercise of finding the highest-rated movies using Hive.

A quick overview of MySQL and how it might fit into your Hadoop-based work.

Let import the MovieLens data set into MySQL, and run a query to view the most popular movies just to see that's it's working.

Learn how Sqoop works as a way to transfer data from an existing RDBMS like MySQL into Hadoop.

Sqoop can also work the other way - let's build a new table with Hive and export it back into MySQL.

Learn why "NoSQL" databases are important for efficiently and scalably vending your data.

HBase is a NoSQL columnar data store that sits on top of Hadoop. Learn what it's for and how it works.

We'll import our movie ratings into HBase through a RESTful service interface, using a Python script running our desktop to both populate and query the table.

We'll see how HBase can integrate with Pig to store big data into HBase in a distributed manner.

Cassandra is a popular NoSQL database, that is appropriate for vending data at massive scale outside of Hadoop.

In the next lecture, we'll install Cassandra into your sandbox. It's a complicated process, and a lot can go wrong. Really, if you're not pretty comfortable with Linux, you might want to just watch the exercises that involve Cassandra instead of running them yourself.

One common issue is ending up in a state where your RPM database (which keeps track of what packages you have installed on your system) becomes corrupt. You'll experience this as seeing an error message like this:

rpmts_HdrFromFdno – error: rpmdbNextIterator – Header V3 RSA/SHA1 Signature, key ID BAD

If you encounter this, "yum" will no longer work at all. But, there is a way to fix it.

Just enter the following commands (you can paste them into PuTTY by right-clicking in the PuTTY terminal window after copying them; and be sure you've already run "su root" to run the following as the root user:)

cd ~

wget http://mirror.centos.org/centos/6/os/x86_64/Packages/nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64.rpm

rpm2cpio http://mirror.centos.org/centos/6/os/x86_64/Packages/nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64.rpm | cpio -idmv

cp ./lib64/libfreeblpriv3.* /lib64

Now, yum should work again. Note that if you do a big "yum update" and the ssl library is updated, you may lose your connection via PuTTY. If you're disconnected, wait a couple of minutes to allow yum to finish what it's doing, issue an ACPI Shutdown command to the virtual machine (via the Machine menu,) restart the sandbox, and connect again.

Cassandra isn't a part of Hortonworks, so we'll need to install it ourselves.

We'll modify our HBase example to write results into a Cassandra database instead, and look at the results.

MongoDB is a popular alternative to Cassandra. Learn what's different about it.

We'll install MongoDB on our virtual machine using Ambari. Then, we'll study and run a script to load up a Spark DataFrame of user data, store it into MongoDB, and query MongoDB to get users under 20 years old.

We'll query our movie user data using MongoDB's command line interface, and set up an index on it.

With so many options for choosing a database, how do you decide? We'll look at the requirements of given problems, such as consistency, latency, and scalability, and how that can inform your decision.

In the previous lecture, I challenged you to choose a database for a stock trading application. Let's talk about my own thought process in this decision, and see if we reached the same conclusion.

What is Drill and what problems does it solve?

We'll install Drill so we can play with it, after installing a Hive and MongoDB database to work with.

We'll use Drill to execute a query that spans data on MongoDB and Hive at the same time!

What is Phoenix for? How does it work?

We'll get our hands dirty with Phoenix and use it to query our HBase database.

We'll use Phoenix with Pig to store and load MovieLens users data, and accelerate queries on it.

What is Presto, and how does it differ from Drill and Phoenix?

We'll install Presto, and issue some queries on Hive through it.

We'll configure Presto to also talk to our Cassandra database that we set up earlier, and do a JOIN query that spans both data in Cassandra and Hive!

Learn how YARN works in more depth as it controls and allocates the resources of your Hadoop cluster.

Like Spark, Tez also uses Directed Acyclic Graphs to optimize tasks on your cluster. Learn how it works, and how it's different.

As an example of the power of Tez, we'll execute a Hive query with and without it.

Mesos is an alternative cluster manager to Hadoop YARN. Learn how it differs, who uses Mesos, and why.

Zookeeper is a deceptively simple service for maintaining states across your cluster, like which servers are in service, in a highly reliable manner. Learn how it works, and what systems depend on Zookeeper for reliable operation.

Let's use ZooKeeper's command line interface to explore how it works.

Oozie allows you to set up complex workflows on your cluster using multiple technologies, and schedule them. Let's look at some examples of how it works.

As a hands-on example, we'll use Oozie to import movie data into HDFS from MySQL using Sqoop, then analyze that data using Hive.

Apache Zeppelin provides a notebook-based environment for importing, transforming, and analyzing your data.

We'll set up a Zeppelin notebook to load movie ratings and titles into Spark dataframes, and interactively query and visualize them.

Apache Hue is a popular alternative to Ambari views, especially on Cloudera platforms. Let's see what it offers and how it's different.

Let's talk about Chukwa and Ganglia, just so you know what they are.

Learn how Kafka provides a scalable, reliable means for collecting data across a cluster of computers and broadcasting it for further processing.

We'll get Kafka running, and set it up to publish and consume some data from a new topic.

We'll simulate a web server by monitoring an Apache log files using a Kafka connector, and watch Kafka pick up new lines in it.

Flume is another way to publish logs from a cluster. Learn about sinks and Flume's architecture, and how it differs from Kafka.

As a simple way to get started with Flume, we'll connect a source listening to a telnet connection to a sink that just logs information received.

As something closer to a real-world example, we'll configure Flume to monitor a directory on our local filesystem for new files, and publish their data into HDFS, organized by the time the data was received.

Spark streaming allows you to write "continuous applications" that process micro-batches of information in real time. Learn how it works, about DStreams, windowing, and the new Structured Streaming API.

We'll write and run a Spark Streaming application that analyzes web logs as they are streamed in from Flume.

As a challenge, extend the previous activity to look for status codes in the web log and aggregate how often different status codes appear. Also, let's fiddle with the slide interval.

Traffic lights

Read about what's good

what should give you pause

and possible dealbreakers

Covers a wide array of technologies within the Hadoop ecosystem, offering a comprehensive overview for those looking to build expertise in big data processing and analysis

Includes web-based UIs for many activities, allowing individuals with limited programming knowledge to grasp the fundamentals of Hadoop and big data concepts

Provides a hands-on approach to learning Hadoop, allowing learners to gain practical experience through exercises and real-world examples, which is essential for solidifying understanding

Features technologies like Spark, Kafka, and Cassandra, which are highly sought after in the big data industry, making this course valuable for career advancement

Requires installing Cassandra, which the course itself admits can be a complicated process and may pose challenges for those not comfortable with Linux environments

Focuses on application development rather than Hadoop administration, so learners seeking expertise in cluster management may need to supplement their learning with additional resources

Reviews summary

Comprehensive hands-on guide to big data

According to learners, this course offers a positive and comprehensive introduction to the vast Hadoop and big data ecosystem. Students appreciate the course's hands-on activities and practical approach, which helps make complex topics understandable and provides marketable skills for career development. However, some reviews note that the content can become outdated quickly due to the fast pace of technology in this field, and report encountering installation and setup issues, which can be challenging to troubleshoot.

Course moves fast, covering many topics.

"Because so many technologies are covered, the pace can feel very fast, and depth is sometimes sacrificed."

"It's a mile wide and an inch deep on many topics, which is fine for an overview but not for mastery."

"I wish certain key areas, like Spark or Hive, had been explored in more detail."

"The sheer volume of information can be overwhelming at times."

Helpful for starting a big data career.

"This course was instrumental in helping me get my first big data job."

"It provided me with the foundational knowledge and buzzwords needed for interviews."

"Taking this course gave me the confidence to pursue roles in the big data space."

"A great starting point if you're trying to transition into a big data role."

Clear explanations from an experienced instructor.

"The instructor does a fantastic job of explaining complex concepts simply and clearly."

"I felt the instructor's real-world experience really showed in the way the topics were presented."

"His explanations made sense and were easy to follow, even for someone new to some of these tools."

"The lectures were well-structured and the teaching style was engaging."

Strong focus on practical exercises and labs.

"The hands-on labs and exercises were the most useful part; I learned by doing."

"I really appreciated the practical steps and coding examples; it wasn't just theory."

"Working with the actual sandbox environment helped solidify my understanding of how these tools work in practice."

"The projects helped me apply what I learned immediately."

Explores a wide array of big data tools.

"This course covers an incredible number of different big data technologies, giving a really broad overview of the ecosystem."

"I was amazed by how many tools and frameworks were introduced; it really helped me understand the landscape."

"Getting exposure to over 25 technologies was very valuable for understanding how things fit together, even if not deeply."

"It's a solid overview of Hadoop and related components like Spark, Hive, Kafka, and NoSQL databases."

Some course material may be outdated.

"Given how fast big data tech changes, some sections felt a bit outdated with older versions of tools."

"The rapid evolution of the Hadoop ecosystem means parts of the course might not reflect the latest best practices or versions."

"I found some of the examples or tools used were not the most current, which required extra research."

"Keeping up with updates in this space is hard, and the course reflects that challenge with some older content."

Setting up the environment can be difficult.

"Getting the Hortonworks sandbox set up correctly was a major headache and took a lot of time."

"I struggled with the installation steps; they seemed prone to errors depending on your system configuration."

"The VM setup process was frustrating and required significant troubleshooting outside the course material."

"Updating packages and dealing with dependencies in the sandbox was often problematic."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in The Ultimate Hands-On Hadoop: Tame your Big Data! with these activities:

Review Linux Command Line Basics

Show steps

Reinforce your understanding of basic Linux commands, as the course involves working with the command line interface.

Browse courses on Linux Command Line

Show steps

Complete an online tutorial covering basic Linux commands.
Practice navigating the file system using the terminal.
Experiment with file manipulation commands like cp, mv, and rm.

Review: Hadoop: The Definitive Guide

Show steps

Deepen your understanding of Hadoop's core concepts and architecture, as this book provides comprehensive coverage of the Hadoop ecosystem.

View Hadoop: The Definitive Guide: Storage and... on Amazon

Show steps

Read the chapters on HDFS and MapReduce.
Study the examples provided in the book.
Take notes on key concepts and terminology.

Review: Spark: The Definitive Guide

Show steps

Enhance your understanding of Apache Spark, as this book provides a comprehensive guide to its features and capabilities.

View Spark: The Definitive Guide on Amazon

Show steps

Read the chapters on Spark's core concepts and APIs.
Study the examples provided in the book.
Experiment with different Spark features and functionalities.

Four other activities

Expand to see all activities and additional details

Show all seven activities

Follow Kafka Tutorials

Show steps

Improve your proficiency with Kafka by following online tutorials that demonstrate its features and use cases.

Show steps

Find online tutorials covering Kafka's core concepts.
Follow the tutorials to set up a Kafka cluster.
Experiment with producing and consuming messages using Kafka.
Explore Kafka's advanced features, such as Kafka Streams.

Blog Post: Hadoop Ecosystem Overview

Show steps

Solidify your understanding of the Hadoop ecosystem by creating a blog post that explains the different components and their roles.

Show steps

Research the different components of the Hadoop ecosystem.
Write a blog post explaining each component and its function.
Include diagrams or illustrations to enhance understanding.
Publish the blog post on a platform like Medium or your own website.

Analyze Movie Data with Hadoop

Show steps

Apply your knowledge of Hadoop to a real-world problem by analyzing a movie dataset, similar to the examples used in the course.

Show steps

Download a movie dataset from a source like MovieLens.
Load the data into HDFS.
Write MapReduce or Spark jobs to analyze the data.
Visualize the results using tools like Zeppelin or Hue.

Data Visualization Dashboard with Zeppelin

Show steps

Practice creating data visualizations using Zeppelin to present insights derived from Hadoop-processed data.

Show steps

Set up a Zeppelin notebook connected to your Hadoop cluster.
Load data from HDFS into Spark DataFrames.
Create visualizations using Zeppelin's built-in charting tools.
Design a dashboard to present key insights from the data.

Career center

Learners who complete The Ultimate Hands-On Hadoop: Tame your Big Data! will develop knowledge and skills that may be useful to these careers:

Data Engineer

A data engineer designs, builds, and manages the infrastructure that allows organizations to collect, process, and analyze large datasets. This course helps aspiring data engineers understand the Hadoop ecosystem, which is a core technology in big data processing. The course covers essential tools like HDFS and MapReduce for managing data on a cluster. You'll gain hands-on experience with Pig and Spark for analyzing data, and learn how to store and query data using Sqoop, Hive, and HBase. This knowledge prepares a data engineer to build and maintain efficient data pipelines.

See salaries and explore the career path for Data Engineer

Big Data Architect

A big data architect designs and oversees the implementation of big data solutions within an organization. This course helps provide a comprehensive understanding of the Hadoop ecosystem, enabling architects to make informed decisions about technology selection and system design. You'll explore various components like HDFS, MapReduce, Pig, Spark, Hive, and HBase. Furthermore, the course covers cluster management with YARN, Mesos, and Zookeeper. This allows the big data architect to develop robust, scalable, and efficient big data architectures.

See salaries and explore the career path for Big Data Architect

Hadoop Developer

A Hadoop developer writes code to process and analyze large datasets using the Hadoop framework. This course focuses on the practical skills needed to be a successful Hadoop developer. You'll learn how to install and work with a real Hadoop installation, manage big data with HDFS and MapReduce, and write programs to analyze data with Pig and Spark. The course also covers storing and querying data with tools like Sqoop and Hive. With hands-on activities, you'll gain experience writing real scripts on a Hadoop system using Scala, Pig Latin, and Python, essential for a Hadoop developer.

See salaries and explore the career path for Hadoop Developer

Spark Developer

A Spark developer uses the Apache Spark framework to build high-performance, distributed data processing applications. The course includes extensive coverage of Apache Spark, a crucial skill for any Spark developer. You'll learn how Spark achieves efficiency and versatility, and master the use of Resilient Distributed Datasets. The course guides you through writing Spark scripts to perform complex data transformations. This education equips a Spark developer with the skills to build scalable and efficient data processing solutions.

See salaries and explore the career path for Spark Developer

Data Scientist

A data scientist uses statistical analysis, machine learning, and data visualization techniques to extract insights from data. This course helps data scientists gain proficiency in using Hadoop to process and analyze large datasets. You'll learn how to use Spark's machine learning library to solve real-world problems, such as producing movie recommendations. Moreover, the course covers tools like Hive for querying data, enabling data scientists to perform complex analyses and derive valuable insights from big data.

See salaries and explore the career path for Data Scientist

Database Administrator

A database administrator is responsible for managing and maintaining databases, ensuring data integrity, security, and availability. This course helps database administrators learn how to integrate Hadoop with existing relational databases and manage NoSQL data stores. You'll explore tools like Sqoop for transferring data between Hadoop and relational databases, and learn about NoSQL databases like HBase and Cassandra. This knowledge is crucial for database administrators who need to manage diverse data environments and integrate Hadoop into their existing infrastructure.

See salaries and explore the career path for Database Administrator

Business Intelligence Analyst

A business intelligence analyst analyzes data to identify trends and insights that can help organizations make better business decisions. This course may be useful, as it helps analysts gain hands-on experience with tools like Hive for querying data stored in Hadoop. You'll learn how to use SQL-style queries to analyze data and extract valuable insights. The course also covers data visualization tools like Zeppelin, which enables analysts to create interactive dashboards and reports. This knowledge equips business intelligence analysts with the skills to work with big data and derive actionable insights.

See salaries and explore the career path for Business Intelligence Analyst

Machine Learning Engineer

A machine learning engineer develops and deploys machine learning models to solve complex problems. This course may be useful, as it helps machine learning engineers learn how to use Spark's machine learning library to build and train models on large datasets. You'll gain experience with algorithms. The course also covers tools for data processing and feature engineering, which are essential for building effective machine learning models. This knowledge empowers machine learning engineers to leverage big data for model training and deployment.

See salaries and explore the career path for Machine Learning Engineer

Cloud Solutions Architect

Cloud solutions architects design and implement cloud-based solutions for organizations, ensuring scalability, reliability, and cost-effectiveness. This course may be useful, as it helps cloud solutions architects understand how to deploy and manage Hadoop clusters in the cloud. You'll learn about cluster management tools like YARN and Mesos, and explore various technologies for data storage and processing. The course provides insights into how to design and implement big data solutions in cloud environments, which is essential for cloud solutions architects.

See salaries and explore the career path for Cloud Solutions Architect

Data Warehouse Architect

A data warehouse architect designs and oversees the construction of data warehouses, which are central repositories for storing and analyzing structured data. This course may be useful, as it helps data warehouse architects understand how to integrate Hadoop with existing data warehousing systems. You'll learn how to use Sqoop to transfer data between Hadoop and data warehouses, and explore tools like Hive for querying data. The course provides insights into how to leverage Hadoop for storing and processing large volumes of data within a data warehouse environment.

See salaries and explore the career path for Data Warehouse Architect

Systems Engineer

A systems engineer is responsible for designing, implementing, and managing the infrastructure that supports an organization's IT systems. This course may be useful, as it helps systems engineers learn how to manage Hadoop clusters and integrate them with existing infrastructure. You'll explore cluster management tools like YARN and Zookeeper, and gain insights into how to monitor and troubleshoot Hadoop systems. The course provides systems engineers with the knowledge to ensure the reliable and efficient operation of Hadoop-based systems.

See salaries and explore the career path for Systems Engineer

Software Developer

A software developer designs, writes, and tests code for software applications. This course may be useful, as it helps software developers learn how to integrate Hadoop into their applications for processing large datasets. You'll gain experience writing scripts in Scala, Pig Latin, and Python to analyze data on Hadoop. The course also covers tools for data storage and querying, enabling software developers to build applications that leverage the power of big data.

See salaries and explore the career path for Software Developer

Technical Project Manager

A technical project manager oversees and coordinates technical projects, ensuring they are completed on time and within budget. This course may be useful, as it helps technical project managers gain a foundational understanding of the Hadoop ecosystem. You'll learn about the major components and technologies involved in big data processing. The course provides project managers with the knowledge to effectively communicate with technical teams and manage Hadoop-related projects.

See salaries and explore the career path for Technical Project Manager

Quality Assurance Engineer

A quality assurance engineer tests software and systems to ensure they meet quality standards and function correctly. This course may be useful, as it helps quality assurance engineers learn how to test Hadoop-based systems and applications. You'll gain an understanding of the Hadoop ecosystem and the various technologies involved in big data processing. The course provides quality assurance engineers with the knowledge to develop and execute test plans for Hadoop environments.

See salaries and explore the career path for Quality Assurance Engineer

Technical Writer

A technical writer creates documentation for software, hardware, and other technical products. This course may be useful, as it helps technical writers gain a basic understanding of the Hadoop ecosystem. You'll learn about the major components and technologies involved in big data processing. The course provides technical writers with the knowledge to create accurate and informative documentation for Hadoop-related products and systems.

See salaries and explore the career path for Technical Writer

The Ultimate Hands-On Hadoop

Tame your Big Data!

What's inside

Syllabus

Traffic lights

Save this course

Reviews summary

Comprehensive hands-on guide to big data

Activities

Career center

Reading list

Share

Similar courses