Learn By Example: Hadoop, MapReduce for Big Data problems from Udemy

What's inside

Syllabus

Introduction

We start off with an introduction on what this course is all about.

Why is Big Data a Big Deal

Big data may be a cliched term, but what does it really mean? Where does this data come from and why is it big?

Distributed computing makes processing very fast - but why? Let's take a simple example and see why distributed computing is so powerful.

What exactly is Hadoop? Its origins and its logical components explained.

HDFS based on GFS (The Google File System) is the storage layer within Hadoop. It stores files in blocks of 128MB.

MapReduce is the framework which allows developers to write massively parallel programs without worrying about the underlying details of distributed computing. The developer simply implements the map() and reduce() functions in order to crunch large input sets of data.

Yarn is responsible for managing resources in the Hadoop cluster. Yarn was introduced recently in Hadoop 2.0.

Hadoop has 3 different install modes - Standalone, Pseudo-distributed and Fully Distributed. Get an overview of when to use each

How to set up Hadoop in the standalone mode. Windows users need to install a Virtual Linux instance before this video.

Set up Hadoop in the Pseudo-Distributed mode. All Hadoop services will be up and running!

In the world of MapReduce every problem can be thought of in terms of key values pairs. Map transforms the key-value pair in a meaningful way, they are sorted and merged and reduce combines key-value pairs in a meaningful way.

If you're learning MapReduce for the very first time - it's best to visualize what exactly it does before you get down into the little details.

What really goes on with a single record as it flows through the map and then reduce phase?

Counting the number of times a word occurs in input text is the Hello World of MapReduce. This was the very first example given in Jeff Dean and Sanjay Ghemawat's original paper on MapReduce.

Nothing is real unless it is on code. Setting up our very first Mapper.

Nothing is real unless it is on code. Setting up our very first Reducer.

Nothing is real unless it is on code. Setting up our very first MapReduce Job.

Learn how to use HDFS's command line interface and add data to HDFS to run your jobs on.

Run your very first MapReduce Job. We'll also explore the Web interface for YARN and HDFS and see how to track your jobs.

The reduce phase can be optimized by combining the output of the map phase at the map node itself. This is an optimization of the reduce phase to allow it to work on data that has been "partially reduced".

Using a Combiner should not change the output of the MapReduce. Which means not every Reducer can work as a combine function

The number of mapper processes depend on the number of input splits of your data. It's not really in your control. What you, as a developer, do control, is the number of reducers.

In order to have more than one Reducer work on your map data, you need partitions. Visualize how partitions and shuffle and sort work.

The Hadoop Streaming API uses the standard input and output to communicate with mapper and reducer functions in any language. Understand how Hadoop interacts with mappers and reducers in other languages.

It's not real till it's in code. Implement the word count MapReduce example in Python using the Streaming API.

Let's understand HDFS and it's data replication strategy in some detail.

Name nodes provide an index of what file is stored where in the data nodes. If the name node is lost the mapping of where the files are is lost. Which means even though the data is present in the data nodes, we'll have no idea how to access it!

Hadoop backs up name nodes using two strategies. Backing up the snapshot and edits to the file system and by setting up a secondary name node.

The Resource Manager assigns resources to processes based on policies and constraints of the cluster while the Node Manager manages memory, and other resource for a single node. These two form the basic components of Yarn.

What happens under the hood when you submit a job to Yarn? Resource Manager, Container, the Application Master and the Node Manager all work together to run your MapReduce job.

The Resource Manager acts as a pure scheduler and allows plugging in different policies to schedule jobs. Understand how the FIFO scheduler, the Capacity scheduler and the Fair scheduler work.

The user has a lot of leeway in configuring how the scheduler works. Let's study some of the options we can specify in the various config files.

The Main class in your MapReduce needs some special set up before it can accept command line arguments.

The library classes and interfaces which allow parsing command line arguments. Learn what they are and how to use them.

The Job object allows you to plug in your own classes to control inputs, outputs and many intermediate steps in the MapReduce.

Between the Map phase and the Reduce phase lie a whole number of intermediate steps performed by the Hadoop framework. Partitioning, Sorting and Grouping are 3 specific operations and each of these can be customized to fit your problem statement.

The Inverted Index which provides a mapping from every word to the page on which that word occurs is at the heart of every search engine. This is one of the original use cases for MapReduce.

It's not real unless it's in code, generate the inverted index using a MR job.

Understand why we need the Writable and the WritableComparable interface and why the keys in the Mapper output implement these interfaces.

A Bigram is a pair of adjacent words, use a special data type to represent a Bigram, it needs to be a WritableComparable to be serialized across the network and sorted and merged by Hadoop.

Use the Bigram data type in your MapReduce to produce a count of all Bigrams in the input text file.

Follow these instructions to set up your Hadoop project.

No code is complete without unit tests. The MRUnit framework uses JUnit to test MapReduce jobs. Write test cases for the Bigram count code.

The Input Format specifies the kind of input data that feeds into the MapReduce. The FileInputFormat is the base class for all inputs which are files

The most common kind of files are text files and binary files and Hadoop has built in library classes to represent both of these.

What if you want to partition on something other than key hashes? Custom partitioners allow you to partition on whatever metric you, you just need to write a bit of code.

Total Order Partitioning is a mind bending concept in Hadoop. This allows you to locally sort data such that it's in globally sorted order. Sounds confusing? It is a hard concept to wrap one's head around but the results are pretty amazing!

Input sampling, samples the input data to produce a key to partition mapping. The total order partitioner uses this mapping to partition the data in such a manner that locally sorting the data results in a globally sorted result.

The Hadoop Sort/Merge operation sorts the output keys of the mapper. Here is a neat trick to sort the values for each key as well.

At the heart of recommendation systems is a beautifully simple idea called collaborative filtering. If 2 users have a lot in common then the chances are that what one user likes the other will as well. You can recommend the users' likes to each other!

Recommend potential friends to all users of a social network. This involves using 2 MapReduce jobs and chaining them in such a way that the output of one MapReduce feeds into the second MapReduce

The first MapReduce finds the number of common friends for every pair of users. This requires special treatment for users who are already friends.

The second MapReduce takes in the common friends for every pair of users and generates the top 10 friend recommendations for every user of the social network.

Note there are 2 MR jobs chained together.

How is Hadoop different from a database? Can we leverage the power of Hadoop for structured data?

Let's see how to implement SQL Select , Where constructs using MapReduce

Select and Where constructs are implemented in the Mapper. Group By and having constructs are implemented in the reducer.

Joins can be surprisingly tricky to implement with MapReduce - Let's see what the Mapper looks like for a Join.

What should the Reducer do in a Join?

For the Join to work properly, you'll need to customize how the Sorting and Partitioning occurs in the MapReduce.

We continue with MapReduce joins. Let's put everything together in the Job.

K-Means Clustering is a popular machine learning algorithm. We start with an intro to Clustering and how the K-Means algorithm works.

We continue with understanding the K-Means algorithm. We'll break down the algorithm into a MapReduce task.

We'll start describing the code to implement K-Means with MapReduce. First some setup to represent data as points and measure the distance between them.

We need to set up a couple of Custom Writables that can be used by MapReduce for the input and output.

We're finally on to the MapReduce bits. We start by configuring the job and doing some setup that will be needed for the Mapper/Reducer.

The Mapper and Reducer for K-Means run once for each iteration and update the cluster centers.

Finally, we need to set it up so that the Jobs run iteratively and stop when convergence occurs.

Manually configure a Hadoop cluster. You'll use Linux Virtual Machines to do this. Please go through the "Setting up a Virtual Linux Instance (in the Installing Hadoop on Local Environment section) before this video.

You can use a cloud service to setup a Hadoop Cluster. This video gets you started with AWS and the Elastic Compute 2 CLI tools

Install Cloudera Manager on AWS and use it to launch a Hadoop Cluster.

Hadoop is basically for Linux/Unix systems. If you are on Windows, you can set up a Linux Virtual Machine on your computer and use that for the install.

If you are unfamiliar with softwares that require working with a shell/command line environment, this video will be helpful for you. It explains how to update the PATH environment variable, which is needed to set up most Linux/Mac shell based softwares.

Traffic lights

Read about what's good

what should give you pause

and possible dealbreakers

Covers advanced MapReduce topics like Total Sort and Secondary Sort, which are essential for optimizing big data processing workflows and handling complex data transformations

Includes hands-on exercises for setting up Hadoop clusters using both VMs and cloud environments, providing practical experience in deploying and managing big data infrastructure

Explores the implementation of K-Means clustering using MapReduce, offering insights into parallelizing machine learning algorithms for large datasets

Demonstrates how to integrate Hadoop with Python using the Hadoop Streaming API, enabling learners to leverage their existing Python skills for big data processing tasks

Requires learners to set up a Hadoop cluster using Linux VMs, which may require familiarity with Linux environments and command-line tools

Teaches Hadoop, MapReduce, and YARN, which are technologies that have been superseded by newer cloud-based big data processing tools

Reviews summary

Foundational hadoop mapreduce concepts & practice

According to learners, this course provides a solid foundation in Hadoop MapReduce and the fundamentals of distributed computing. Many highlight the hands-on exercises and practical examples, such as building an Inverted Index or a Recommendation System, as particularly valuable for understanding the "think parallel" paradigm. The instructors' experience is frequently praised. However, some students found the environment setup challenging due to potentially outdated software versions. While the course covers core MapReduce concepts thoroughly, some reviews note that Spark is now more prevalent in the industry, which is a factor to consider. Overall, it's seen as a good deep dive into classic big data processing.

MapReduce vs modern tools

"Learning MapReduce is a good way to understand the basics, but for current projects, Spark is often the preferred tool."

"The course focuses heavily on MapReduce; be aware that industry trends lean towards Spark."

"Good for historical context and fundamentals, but might not be sufficient if you only want to learn modern tools."

Practical problems demonstrated

"While the examples like Inverted Index and Recommendation Systems were engaging..."

"Seeing how to apply MapReduce to problems like collaborative filtering was very insightful."

"The Bigram counting example was a simple but effective way to show custom data types."

Practical setup and coding practice

"The hands-on labs for setting up the cluster on VMs were incredibly helpful..."

"I appreciated the practical coding examples and exercises; they made the concepts concrete."

"Setting up the Hadoop cluster manually was tough but very educational."

Builds strong core understanding

"The foundational concepts of MapReduce and HDFS were explained clearly. I now understand how big data processing works under the hood."

"I gained a solid understanding of the core principles of Hadoop and MapReduce from this course."

"Really helped solidify my understanding of the core components like YARN and the MapReduce paradigm."

Installation can be difficult

"The course uses older versions of Hadoop, which made setting up the environment a frustrating challenge."

"Getting the specific Hadoop versions to install correctly was the hardest part for me."

"Spent a lot of time debugging setup errors that weren't fully covered in the videos."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Learn By Example: Hadoop, MapReduce for Big Data problems with these activities:

Review Distributed Computing Concepts

Show steps

Reinforce your understanding of distributed computing principles, which are fundamental to Hadoop and MapReduce.

Browse courses on Distributed Computing

Show steps

Review key concepts like data partitioning and fault tolerance.
Study common distributed system architectures.
Practice with simple parallel programming exercises.

Review: Hadoop: The Definitive Guide

Show steps

Deepen your understanding of Hadoop architecture and components.

View Hadoop: The Definitive Guide: Storage and... on Amazon

Show steps

Read the chapters on HDFS and MapReduce.
Study the configuration options for Hadoop.
Experiment with the examples provided in the book.

Implement Word Count in MapReduce

Show steps

Solidify your understanding of MapReduce by implementing the classic word count example.

Show steps

Write a Map function to tokenize the input text.
Write a Reduce function to sum the counts for each word.
Configure and run the MapReduce job on a sample dataset.

Four other activities

Expand to see all activities and additional details

Show all seven activities

Review: Data-Intensive Text Processing with MapReduce

Show steps

Explore advanced text processing techniques using MapReduce.

View Data-Intensive Text Processing with MapReduce... on Amazon

Show steps

Read the chapters relevant to your interests (e.g., information retrieval).
Implement some of the algorithms described in the book.
Adapt the algorithms to your own datasets.

Create a Blog Post on YARN Schedulers

Show steps

Reinforce your knowledge of YARN by explaining different scheduling algorithms in a blog post.

Show steps

Research FIFO, Capacity, and Fair schedulers.
Write a clear and concise explanation of each scheduler.
Include diagrams to illustrate the scheduling process.
Publish the blog post on a platform like Medium or your personal website.

Build a Simple Recommendation System

Show steps

Apply your MapReduce skills to build a basic recommendation system using collaborative filtering.

Show steps

Collect user-item interaction data (e.g., movie ratings).
Implement MapReduce jobs to calculate user similarity.
Implement MapReduce jobs to generate recommendations.
Evaluate the performance of your recommendation system.

Contribute to an Open Source Hadoop Project

Show steps

Deepen your understanding of Hadoop by contributing to an open-source project.

Show steps

Identify a Hadoop project on GitHub or Apache.
Find a bug or feature request to work on.
Submit a pull request with your changes.
Participate in code reviews and discussions.

Career center

Learners who complete Learn By Example: Hadoop, MapReduce for Big Data problems will develop knowledge and skills that may be useful to these careers:

Hadoop Developer

A Hadoop developer works specifically with the Hadoop ecosystem to develop and maintain big data applications. For a Hadoop developer, this course offers hands-on experience with setting up Hadoop clusters, customizing MapReduce jobs, and using the Hadoop Streaming API with Python. The course covers advanced topics like Total Sort and Secondary Sort, enhancing a Hadoop developer's ability to optimize data processing. Those pursuing a career as a Hadoop developer will find the practical examples and in-depth coverage of Hadoop components invaluable for building and deploying efficient big data solutions.

See salaries and explore the career path for Hadoop Developer

Data Engineer

A data engineer designs, builds, and manages the infrastructure that allows organizations to process and analyze large datasets. This course on Hadoop and MapReduce helps build a foundation for a data engineer who needs to work with big data technologies. The course covers setting up Hadoop clusters, understanding HDFS and YARN, and customizing MapReduce jobs. Someone who wants to become a data engineer will find the hands-on experience with Hadoop, MapReduce, and parallel thinking extremely valuable. This course may be helpful to those who want to learn to build data pipelines, optimize data processing, and ensure data quality.

See salaries and explore the career path for Data Engineer

Cloud Architect

A cloud architect designs and implements cloud computing solutions. This course is useful for a cloud architect who needs to understand how to deploy and manage big data infrastructure in the cloud. The course's coverage of setting up Hadoop clusters on AWS with Cloudera Manager provides practical experience in deploying big data solutions in the cloud. This course may be helpful for those who want to become cloud architects by providing hands-on experience and comprehensive coverage of Hadoop components and their interaction in a cloud environment.

See salaries and explore the career path for Cloud Architect

Big Data Architect

A big data architect is responsible for designing and implementing the overall architecture of a big data ecosystem. This Hadoop and MapReduce course may provide a solid background for a big data architect, covering essential concepts like setting up Hadoop clusters, understanding HDFS and YARN, and thinking in parallel. A big data architect benefits from the course's practical examples, such as building an inverted index for search engines and generating bigrams from text. Someone who wants to become a big data architect will benefit from the comprehensive coverage of Hadoop components and their interactions, helping them design scalable and efficient big data solutions.

See salaries and explore the career path for Big Data Architect

Machine Learning Engineer

A machine learning engineer focuses on building and deploying machine learning models at scale. For a machine learning engineer, this course helps to understand the infrastructure needed to process large datasets for training machine learning models. The course's coverage of Hadoop, MapReduce, and parallel thinking helps the machine learning engineer efficiently manage and analyze large datasets. The K-Means clustering example provides practical experience in applying big data techniques to machine learning problems. A prospective machine learning engineer will find the hands-on experience and comprehensive coverage of Hadoop components invaluable for building scalable machine learning solutions.

See salaries and explore the career path for Machine Learning Engineer

Solutions Architect

A solutions architect designs and implements technology solutions that address specific business problems. This course is helpful for a solutions architect who needs to understand how to integrate big data technologies into overall system architecture. The course's coverage of Hadoop, MapReduce, and cloud deployment helps the solutions architect design scalable and efficient solutions. For a solutions architect, this Hadoop and MapReduce course provides hands-on experience and comprehensive coverage of Hadoop components and their interaction.

See salaries and explore the career path for Solutions Architect

Data Scientist

A data scientist uses statistical methods, machine learning, and data visualization techniques to extract insights and knowledge from data. This course may be useful for a data scientist to understand the underlying infrastructure for big data processing. The course's focus on Hadoop, MapReduce, and parallel thinking helps the data scientist process and analyze large datasets efficiently. Practical examples, such as recommending friends in a social network, provide valuable experience in applying big data techniques to real-world problems. This course may be helpful for a data scientist who needs to scale their analysis and modeling efforts to large datasets.

See salaries and explore the career path for Data Scientist

Technical Consultant

A technical consultant advises organizations on how to use technology to achieve their business goals. A technical consultant would find this Hadoop and MapReduce course helpful when advising clients on big data solutions. The course covers setting up Hadoop clusters, understanding HDFS and YARN, and customizing MapReduce jobs, which provides a solid background for recommending and implementing big data technologies. A technical consultant would improve their skills with the course's practical examples and in-depth coverage of Hadoop components and their interactions, enabling them to design scalable and efficient big data solutions.

See salaries and explore the career path for Technical Consultant

Data Analyst

A data analyst collects, cleans, and analyzes data to identify trends and insights that can help organizations make better decisions. This course provides a data analyst with the skills to process and analyze large datasets using Hadoop and MapReduce. The course's focus on parallel thinking and practical examples, such as generating bigrams from text, helps the data analyst efficiently extract insights from data. This course may be useful for those who want to become data analysts by providing hands-on experience and comprehensive coverage of Hadoop components.

See salaries and explore the career path for Data Analyst

Software Engineer

A software engineer designs, develops, and maintains software systems. This course may be helpful for a software engineer to understand how to build and integrate applications with big data infrastructure. The course's coverage of Hadoop, MapReduce, and parallel thinking helps the software engineer design scalable and efficient systems. This course may be useful for those who want to integrate software applications with big data platforms. Someone interested in becoming a software engineer may find value in the hands-on experience and comprehensive coverage of Hadoop components.

See salaries and explore the career path for Software Engineer

Business Intelligence Analyst

A business intelligence analyst analyzes data to identify trends and insights that can help organizations make better decisions. While this course focuses on the technical aspects of Hadoop and MapReduce, it may be helpful for a business intelligence analyst to understand the underlying infrastructure for big data processing. The course's coverage of Hadoop and MapReduce enables analysts to process and analyze large datasets efficiently. Someone who wants to be a business intelligence analyst may improve their skills by learning how to access and process data stored in Hadoop.

See salaries and explore the career path for Business Intelligence Analyst

Research Scientist

A research scientist conducts research to advance knowledge in a particular field. This course may be helpful for a research scientist who needs to process and analyze large datasets for their research. The course's coverage of Hadoop, MapReduce, and parallel thinking helps the research scientist efficiently manage and analyze large datasets. Practical examples, such as building an inverted index for search engines, may give the research scientist valuable experience in applying big data techniques to research problems. Someone interested in becoming a research scientist may find value in the hands-on experience.

See salaries and explore the career path for Research Scientist

Database Administrator

A database administrator (DBA) manages and maintains database systems, ensuring data integrity, security, and availability. While this course focuses on Hadoop and MapReduce, it may be helpful for a DBA to understand how big data technologies complement traditional database systems. The course may provide insights into how Hadoop can be used to process and analyze large datasets that are too complex for traditional databases. A prospective database administrator will find the course interesting as it discusses integrating Hadoop with other systems.

See salaries and explore the career path for Database Administrator

Analytics Manager

An analytics manager leads a team of analysts and data scientists to extract insights and knowledge from data. While this course focuses on the technical aspects of Hadoop and MapReduce, it may be helpful for an analytics manager to understand the underlying infrastructure for big data processing. The course's coverage may help the analytics manager make informed decisions about technology investments and resource allocation for their team. Someone interested in becoming an analytics manager may find value in understanding how Hadoop can be used to process and analyze large datasets.

See salaries and explore the career path for Analytics Manager

Data Visualization Specialist

A data visualization specialist creates visual representations of data to help people understand complex information. This course may be useful for a data visualization specialist to understand how to access and process large datasets using Hadoop and MapReduce. This course may be helpful for those who want to become data visualization specialists by providing insight into how various data technologies work together. Someone interested in becoming a data visualization specialist may find value in learning how to work with data infrastructure.

See salaries and explore the career path for Data Visualization Specialist

Learn By Example

Hadoop, MapReduce for Big Data problems

What's inside

Syllabus

Traffic lights

Save this course

Reviews summary

Foundational hadoop mapreduce concepts & practice

Activities

Career center

Reading list

Share

Similar courses