We may earn an affiliate commission when you visit our partners.
Course image
Navdeep Kaur

In this course, you will start by learning what is hadoop distributed file system and most common hadoop commands required to work with Hadoop File system.

Then you will be introduced to Sqoop Import

Read more

In this course, you will start by learning what is hadoop distributed file system and most common hadoop commands required to work with Hadoop File system.

Then you will be introduced to Sqoop Import

  • Understand lifecycle of sqoop command.

  • Use sqoop import command to migrate data from Mysql to HDFS.

  • Use sqoop import command to migrate data from Mysql to Hive.

  • Use various file formats, compressions, file delimeter,where clause and queries while importing the data.

  • Understand split-by and boundary queries.

  • Use incremental mode to migrate the data from Mysql to HDFS.

Further, you will learn Sqoop Export to migrate data.

  • What is sqoop export

  • Using sqoop export, migrate data from HDFS to Mysql.

  • Using sqoop export, migrate data from Hive to Mysql.

Further, you will learn about Apache Flume

  • Understand Flume Architecture.

  • Using flume, Ingest data from Twitter and save to HDFS.

  • Using flume, Ingest data from netcat and save to HDFS.

  • Using flume, Ingest data from exec and show on console.

  • Describe flume interceptors and see examples of using interceptors.

  • Flume multiple agents

  • Flume Consolidation.

In the next section, we will learn about Apache Hive

  • Hive Intro

  • External & Managed Tables

  • Working with Different Files - Parquet,Avro

  • Compressions

  • Hive Analysis

  • Hive String Functions

  • Hive Date Functions

  • Partitioning

  • Bucketing

You will learn about Apache Spark

  • Spark Intro

  • Cluster Overview

  • RDD

  • DAG/Stages/Tasks

  • Actions & Transformations

  • Transformation & Action Examples

  • Spark Data frames

  • Spark Data frames - working with diff File Formats & Compression

  • Dataframes API's

  • Spark SQL

  • Dataframe Examples

  • Spark with Cassandra Integration

  • Running Spark on Intellij IDE

  • Running Spark on EMR

Enroll now

What's inside

Learning objective

Hadoop distributed file system and commands. lifecycle of sqoop command. sqoop import command to migrate data from mysql to hdfs. sqoop import command to migrate data from mysql to hive. working with various file formats, compressions, file delimeter,where clause and queries while importing the data. understand split-by and boundary queries. use incremental mode to migrate the data from mysql to hdfs. using sqoop export, migrate data from hdfs to mysql. using sqoop export, migrate data from hive to mysql. understand flume architecture. using flume, ingest data from twitter and save to hdfs. using flume, ingest data from netcat and save to hdfs. using flume, ingest data from exec and show on console. flume interceptors.

Syllabus

Hadoop distributed file system and Hadoop Commands
Meet your Instructor
Course Intro
Big Data Intro
Read more
Understanding Big Data Ecosystem
Google Cloud Cluster Setup
Google Cloud Account Setup
Dataproc Cluster Setup - Part1
DataProc Cluster Setup - Part2
Upload Files on Google Cloud
Sqoop Setup
Environment Update
Hadoop & Yarn
HDFS and Hadoop Commands
Yarn Cluster Overview
Sqoop lifecycle, migrate data from HDFS/Hive to Mysql, working with various file formats,compressions,file delimeters, conditional imports ,queries, split-by, boundary queries and incremental imports
Sqoop Introduction
Managing Target Directories
Working with Parquet File Format
Working with Avro File Format
Working with Different Compressions
Conditional Imports
Split-by and Boundary Queries
Field delimeters
Incremental Appends
Sqoop-Hive Cluster Fix
Access Hive on Google Cloud
Sqoop Hive Import
Sqoop List Tables/Database
Sqoop Assignment1
Sqoop Assignment2
Sqoop Import Practice1
Sqoop Import Practice2
Export data from Hdfs/Hive to Mysql
Export from Hdfs to Mysql
Export from Hive to Mysql
Export Avro Compressed to Mysql
Bonus Lecture: Sqoop with Airflow
Apache Flume Architecture, Working with Various SInks, Source and channels, Interceptors
Flume Setup
Flume Introduction & Architecture
Exec Source and Logger Sink
Moving data from Twitter to HDFS
Moving data from NetCat to HDFS
Flume Interceptors
Flume Interceptor Example
Flume Multi-Agent Flow
Flume Consolidation
This section will teach about doing big data analytics with apache hive
Access Hive Shell on Google Cloud
Hive Introduction
Hive Database
Hive Managed Tables
Hive External Tables
Hive Inserts
Hive Analytics
Working with Parquet
Compressing Parquet
Working with Fixed File Format
Alter Command
Hive String Functions
Hive Date Functions
Hive Partitioning
Hive Bucketing
Spark with Yarn & HDFS
What is Apache Spark
Understanding Cluster Manager (Yarn)
Understanding Distributed Storage (HDFS)
Running Spark on Yarn/HDFS
Understanding Deploy Modes
GCS Cluster
Spark on GCS Cluster
Upload Data files for Spark
Spark Internals
Drivers & Executors
RDDs & Dataframes
Transformation & Actions
Wide & Narrow Transformations
Understanding Execution Plan
Different Plans by Driver
Spark RDD : Transformation & Actions
Map/FlatMap Transformation
Filter/Intersection
Union/Distinct Transformation
GroupByKey/ Group people based on Birthday months
ReduceByKey / Total Number of students in each Subject
SortByKey / Sort students based on their rollno
MapPartition / MapPartitionWithIndex
Change number of Partitions
Join / join email address based on customer name
Spark Actions
Spark RDD Practice
Upload Files
Scala Tuples
Filter Error Logs
Frequency of word in Text File
Population of each city
Orders placed by Customers
average rating of movie
Spark Dataframes & Spark SQL

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Develops foundational skills for big data analytics, which are core skills for data engineering
Covers big data tools that are highly relevant to industry, such as Hadoop, Apache Flume, Apache Hive, and Apache Spark
Teaches foundational big data analytics concepts that are useful for personal growth and development
Covers a comprehensive range of data engineering components, including Hadoop, Apache Flume, Apache Hive, and Apache Spark
Instructors are not recognized for their work in the topic that this course teaches

Save this course

Save Master Big Data - Apache Spark/Hadoop/Sqoop/Hive/Flume/Mongo to your list so you can find it easily later:
Save

Reviews summary

Big data tutorial for beginners

According to students, this beginner-friendly Big Data tutorial is well-organized, engaging, and covers a wide range of relevant tools such as Apache Spark, Hadoop, Sqoop, Hive, Flume, and Mongo. It's a great option for those looking to start their journey in Big Data.
Course content is well-paced and easy to follow.
"Lectures are straight to the point."
This course is especially good for beginners to Big Data.
"This course is very good for those who wants to learn sqoop for the first time."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Master Big Data - Apache Spark/Hadoop/Sqoop/Hive/Flume/Mongo with these activities:
Review Basic Data Structures and Algorithms
Refresh your foundational understanding of data structures and algorithms to strengthen your problem-solving capabilities in Big Data.
Browse courses on Data Structures
Show steps
  • Review concepts such as arrays, linked lists, stacks, and queues.
  • Practice implementing basic algorithms such as sorting, searching, and recursion.
Review 'Hadoop: The Definitive Guide' by Tom White
Gain a comprehensive understanding of Hadoop's architecture, components, and use cases by reviewing this authoritative book.
Show steps
  • Read the book's introduction and overview of Hadoop.
  • Review the chapters on HDFS, Yarn, and MapReduce.
  • Read the chapters on advanced topics such as security, performance tuning, and Hadoop ecosystem tools.
  • Take notes and highlight key concepts.
Follow Tutorials on Apache Spark RDD
Enhance your understanding of Apache Spark RDDs by following online tutorials and applying the concepts to practice problems.
Browse courses on Apache Spark
Show steps
  • Find online tutorials that cover Apache Spark RDD.
  • Follow the tutorials step-by-step and try out the examples.
  • Practice using RDD transformations and actions on your own datasets.
  • Join online forums or communities to discuss your progress and ask questions.
Four other activities
Expand to see all activities and additional details
Show all seven activities
Build a Data Ingestion Pipeline with Apache Flume
Build a data ingestion pipeline using Apache Flume to gain hands-on experience in collecting and processing real-world data.
Show steps
  • Set up a Flume agent and configure data sources (e.g., Twitter, Netcat).
  • Create a data sink (e.g., HDFS) and configure the Flume agent to send data to the sink.
  • Write a Flume interceptor to pre-process or filter data before sending to the sink.
  • Monitor and troubleshoot the data ingestion pipeline to ensure smooth data flow.
  • Present the results of the data ingestion pipeline and discuss its potential applications.
Practice Apache Hive Functions
Practice various Apache Hive functions to solidify understanding and reinforce skills in data analysis and manipulation.
Browse courses on Apache Hive
Show steps
  • Create a Hive session and load a dataset into a Hive table.
  • Practice using string functions such as UPPER(), LOWER(), SUBSTRING().
  • Practice using date functions such as DATE_FORMAT(), DATE_ADD(), DATE_SUB().
  • Practice using aggregation functions such as COUNT(), SUM(), AVG().
  • Practice using conditional functions such as CASE WHEN().
Volunteer at a Big Data Project
Gain practical experience in working on a real-world Big Data project by volunteering with organizations that specialize in this field.
Browse courses on Big Data
Show steps
  • Research Big Data projects and identify potential organizations to volunteer with.
  • Contact the organization and express your interest in volunteering.
  • Participate in the project and contribute your skills.
  • Network with other volunteers and professionals in the field.
Develop a Presentation on Spark SQL
Create a presentation that showcases your understanding of Spark SQL and its capabilities for data analysis.
Browse courses on Spark SQL
Show steps
  • Gather information on Spark SQL's features and use cases.
  • Design the presentation slides with clear and concise content.
  • Practice delivering the presentation and get feedback.
  • Present the presentation to an audience.

Career center

Learners who complete Master Big Data - Apache Spark/Hadoop/Sqoop/Hive/Flume/Mongo will develop knowledge and skills that may be useful to these careers:
Data Engineer
Data Engineers use Apache Spark to wrangle big data. This course will give a Data Engineer the skills to work with Apache Spark, which is increasingly-popular in big data management.
Operations Research Analyst
Operations Research Analysts may use Apache Spark to analyze big data to help improve operations and processes. This course teaches the basics of Apache Spark, which can help an Operations Research Analyst build a foundation in Apache Spark.
Data Scientist
Data Scientists use Apache Spark to process large datasets. This course teaches the fundamentals of Apache Spark, allowing a Data Scientist to build on these skills and improve their capabilities with Apache Spark.
Risk Analyst
Risk Analysts may use Apache Spark to analyze large sets of data to identify and assess risk. This course teaches the basics of Apache Spark, which can help a Risk Analyst build a foundation in Apache Spark.
Data Architect
Data Architects may use Apache Spark to design big data systems. This course teaches the basics of Apache Spark, which can help a Data Architect build a foundation for working with Apache Spark.
Database Administrator
Database Administrators may use Apache Spark to assist with big data administration tasks. This course teaches the basics of Apache Spark, which can help a Database Administrator expand their big data skillset.
Software Engineer
Software Engineers may work with Apache Spark when working with big data in a development environment. This course teaches the basics and fundamentals of Apache Spark.
Data Analyst
A Data Analyst may use Apache Spark to work with big data. This course teaches the basics of Apache Spark. These skills can help a Data Analyst succeed, particularly as Apache Spark has become more popular in data analysis.
Financial Analyst
Financial Analysts may use Apache Spark to analyze large sets of financial data. This course teaches the fundamentals of Apache Spark, which can help a Financial Analyst get started with Apache Spark.
Statistician
Statisticians may use Apache Spark to process large datasets for statistical analysis purposes. This course may be useful for Statisticians who wish to learn about Apache Spark and expand their big data skillset.
Quantitative Analyst
Quantitative Analysts may use Apache Spark to process big data for quantitative analysis purposes. This course may be useful for a Quantitative Analyst who wants to learn the basics of Apache Spark.
Market Research Analyst
Market Research Analysts may utilize Apache Spark to process and analyze big data for market research purposes. This course may be useful for a Market Research Analyst who wants to learn the basics of Apache Spark.
Machine Learning Engineer
Machine Learning Engineers may use Apache Spark to process large amounts of data for model creation. While this course focuses on the fundamentals, it may be useful for a Machine Learning Engineer who wants to learn the basics of Apache Spark.
Actuary
Actuaries may use Apache Spark to analyze big data to assess risk. While this course is focused on the fundamentals of big data processing, it may be useful for Actuaries who wish to learn about Apache Spark.
Business Intelligence Analyst
Business Intelligence Analysts may use Apache Spark to analyze large sets of data. This course may be useful for Business Intelligence Analysts who wish to expand their big data skillset and work with Apache Spark.

Reading list

We've selected nine books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Master Big Data - Apache Spark/Hadoop/Sqoop/Hive/Flume/Mongo.
"Spark: The Definitive Guide" provides a comprehensive overview of Apache Spark, a unified analytics engine for large-scale data processing. covers the basics of Spark, as well as advanced topics such as machine learning and graph processing.
"Hadoop: The Definitive Guide" provides a comprehensive overview of Hadoop, covering its architecture, components, and use cases. is recommended for readers who want to gain a deep understanding of Hadoop and its ecosystem.
"Natural Language Processing with Python" provides a practical guide to using Python for natural language processing. covers the basics of natural language processing, as well as advanced topics such as machine translation and text classification.
"Big Data Analytics with Java" provides a practical guide to using Java for big data analytics. covers the basics of big data analytics, as well as advanced topics such as machine learning and graph processing.
"Machine Learning with Spark" provides a practical guide to using Spark for machine learning. covers the basics of machine learning, as well as advanced topics such as deep learning and natural language processing.
"Deep Learning with Python" provides a practical guide to using Python for deep learning. covers the basics of deep learning, as well as advanced topics such as convolutional neural networks and recurrent neural networks.
"Data Science with Python" provides a practical guide to using Python for data science. covers the basics of data science, as well as advanced topics such as machine learning and deep learning.
"Hadoop Operations" provides a practical guide to operating a Hadoop cluster. covers the basics of Hadoop administration, as well as advanced topics such as security and performance tuning.
"Apache Flume: Getting Started" tutorial that introduces Apache Flume, a distributed log collection system. covers the basics of Flume, including its architecture, components, and use cases.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Master Big Data - Apache Spark/Hadoop/Sqoop/Hive/Flume/Mongo.
Hadoop Developer In Real World
Most relevant
Big Data Essentials
Most relevant
Data Engineering using Kafka and Spark Structured...
Most relevant
Introduction to Big Data with Spark and Hadoop
Most relevant
Data Engineering Essentials using SQL, Python, and PySpark
Most relevant
Big Data, Hadoop, and Spark Basics
Most relevant
Learning Apache Hadoop EcoSystem- Hive
Most relevant
Kafka Integration with Storm, Spark, Flume, and Security
Most relevant
Modeling Data Warehouses using Apache Hive
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser