Hive to ADVANCE Hive (Real time usage) :Hadoop querying tool from Udemy

What's inside

Syllabus

Introduction (Theory)

This video gives a brief description of Hive.

Announcement

This videos explains the reason of Why Hive was developed.

Since syntax of Hive is similar to that of SQL .This video will explain the similarities and difference between SQL and Hive

This videos explains the general working of Hive. Hive can process and stores structural data only. It does this by linking the metadata of it's table to the file in HDFS.

Important Note* - Hive is not a database . After loading data into hive table, Our HDFS file will not move to Hive rather after loading now Hive will sees that file in a tabular way

Architecture of Hive-

Hive has these components in its architecture:

UI
Driver
Compiler
Metastore
Execution engine

This PDF contains a step by step procedure to install Hadoop and Hive along with other resources like Java,Virtual Box and Ubuntu

Hive is not a database but to store the metadata of its tables Hive uses Databases .By default Hive provides Derby database but in real time projects we use strong databases like MYSQL.

This videos explains How to create database in various ways using different options.

Since Hive stores the data into structural format we create Tables. In this lecture we will create tables in Hive .Tables in Hive can be created in many ways with a lot of options.

After table creation we have to load the data in those Hive tables. Note that loading does not mean transferring data into Hive because Hive is not a database,Rather it will just link the metadata of Hive table to corresponding HDFS file.

Part 2

Since Hive stores the data into structural format we create Tables. In this lecture we will create tables in Hive .Tables in Hive can be created in many ways with a lot of options.

There are two types of tables we can create in Hive i.e. Internal or Managed and External tables.

The main difference between these two tables appears while we drop a table. In case of dropping of Internal tables both the schema as well as data is lost since Hive is responsible for both schema and data but in case of dropping External table only the schema or metadata of table is lost, the data is not lost and will be present in the same HDFS location. The data can still be accessed by other applications.

Insert statement is used to load the data from one Hive table to another Hive table.

Multi insert statement is used to load data from a 1 table into multiple tables.

Once created a Hive table it's schema can be changed according to new requirements. This lecture explains how a Hive table's schema can be changed in various ways.

Order by - In Hive, ORDER BY guarantees total ordering of data, and for that it has to be passed on to a single reducer.

Sort by- Sort by does not ensure full ordering of data rather it ensures ordering of data within a reducer.

Distribute by- Distribute by ensures that all rows with the same Distribute By columns will go to the same reducer.

Cluster by- Cluster By is a shortcut for both Distribute By and Sort By. First Distribute by ensures all same column values in single reducer and then sorts those rows inside the reducer.

Lecture also explains

difference between order by and sort by

order by with limit clause

behaviour order by command in strict and non strict mode

Hive provides us with these Date and Mathematical functions.

This video explains various string functions in Hive.

Split() function in Hive
Substr() function in Hive
instr() function in Hive

These string functions are widely used in Hive.

Following are the list of conditional statements used in Hive.

if(boolean testCondition, T valueTrue, T valueFalseOrNull)	Returns valueTrue when testCondition is true, returns valueFalseOrNull otherwise.
isnull( a )	Returns true if a is NULL and false otherwise.
isnotnull ( a )	Returns true if a is not NULL and false otherwise.
nvl(T value, T default_value)	Returns default value if value is null else returns value (as of HIve 0.11).
COALESCE(T v1, T v2, ...)	Returns the first v that is not NULL, or NULL if all v's are NULL.
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END	When a = b, returns c; when a = d, returns e; else returns f.
CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END	When a = true, returns b; when c = true, returns d; else returns e.

These are concepts of Advance Hive. Explode function and lateral view

explode() takes in an array as an input and outputs the elements of the array as separate rows.

Lateral view is used to select other table columns with exploded columns.

This is an advance function in Hivewhere if any substring of A matches with B then it evaluates to true.

Example: ‘Super’ RLIKE ‘Su’ –> True

Advance Hive functions. They come under hive analytical functions.

In Rank() function equal ranks are given same rank value

In dense_rank() function equal ranks are given same rank value but there will be no gaps as are in rank() function

row_number() function will not give equal rank value for same ranks .

The lecture also contains difference between rank() and dense_rank().

Partitioning is a data organizing technique in Hive. It is a way of organizing tables into smaller partitions based on values of columns in table.

Partitioning can be done in two ways.

Static Partitioning: In static partitioning the partition column is hardcoded and we have to manually mention the partition name while loading data into it.

Dynamic Partitioning: In dynamic partitioning Hive automatically decides the partitions based on the values of partitioned column .

This lecture covers:

Static partition in Hive table
How to load data into partitioned table in Hive

Dynamic Partitioning in Hive: In dynamic partitioning Hive automatically decides the partitions based on the values of partitioned column.

This lecture covers:

What is dynamic partition
Load data into dynamic partitioned table.
Difference between static and dynamic partition.
Set dynamic partition property

This is an Advance Hive concept usually asked in Interviews.

Partitioned tables schema can also be altered like changing partition location, adding new partition, drop a partition.

MSCK repair table command in Hive is used to update the metadata of table in case of manually adding a partition in HDFS location

Bucketing is another data organizing technique in Hive. While partitioning is organizing table into a number of directories, bucketing is organizing table in files.

This video explains:

Bucketing in Hive
Difference between Partitioning and Bucketing in Hive.
How to do bucketing.
Properties to be set to do bucketing.
Where to do buckering

Tablesampling is Advance Hive concept and a provision of bucketing.

Video explains:

What is tablesampling.
Difference between tablesampling and limit operator in hive

Advance Hive concept:

With No_drop and offline command we can prevent a Hive table or partition from being queried or dropped.

Joins in Hive behave as same as in SQL i.e joining 2 tables based on a joining condition.

This video explains how to join 2 tables in Hive.

These are types of joins supported by Hive : Inner join, Left outer join, Right outer join and Full outer join

Advance Hive concept

We can also join 3 tables in a single query in Hive.

This video contains:

How to join 3 tables in Hive
Memory organisation while joining tables.

Map join is a Advance Hive join .

The logic behind Map join is that the join operation is executed totally on Map side .No reducer is used in Map joins.

These video contains:

What is Map join in Hive.
How Map join executes
When to use Map join in Hive

Views in Hive also serve the same purpose as in SQL.A View on a table can be thought of as an image of that table.

This video contains

What are views in Hive

How to create views
Different ways to create views
Dropping views in Hive

Advantages of Views in Hive.

Where to use Views in Hive

Indexing in Hive is a optimization technique to reduce the throughput time of query. There is a separate index table created in which indexes of all indexed columns are stored.

In this lecture we will learn:

What is Indexing in Hive
How to create indexing in Hive
Types of Indexing in Hive
With Deferred rebuild command in Indexing
Advantages of Indexing
Where and When to use Indexing in Hive
When not to use Indexing in Hive

As we know there are 2 types of indexes in Hive. The question is Can we create both Indexes on same table at same time. Answer is yes. We can create multiple Indexes on same table. More iis explained in the videos

Indexing should be use blindly everywhere in Hive since Indexing can be disadvantageous also in some cases. This video explains When and When not to use Indexing in Hive.

UDF (User defined functions) are backbone of any Real time Hive project because in Live projects requirements are complicated and can not be met with Built-in functions. WIth UDF we can write our own functions according to the requirement and then use those functions in Hive queries.

This video explains how to create a UDF function in Java and how to use it in Hive.

Advance Hive table property:

This property is used in Hive tables while loading data into them. We can skip some rows from file to be loaded into our Hive table

Immutable property is also used in Real time projects of Hive. This lecture shows behaviour of Insert statement with into and overwrite options when immutable property is set to true.

Purge property is used In Hive when we don't our data doing to trash. When the purge property is set to true the data will be completely gone and cannot be recovered.

This Advance Hive lecture contains:

Purge property in Hive
How to drop a table in Hive
How to truncate table in Hive
Difference between dropping a table and truncating a table in Hive

Advance Hive table property:

By default nothing is null between delimiters in a file for Hive. This property tells Hive what value should be considered as NULL .

ACID properties in Hive - Advance Hive concept

Advance Hive table property

There are different types of file formats supported by Hive. Hive can store the data in RC ,Parquet ,Textfile formats. Out of these 1 is ORC file format. This video explains the table properties of that Hive table which stores data in ORC format.

This lecture includes the first set of configurations and settings that can made in Hive

This lecture includes the second set of configurations and settings that can made in Hive

Hive creates its own small files during and after query execution. By setting some Hive properties we can tell Hive to automatically Merge files. It is an optimization technique and concept of Advance Hive

Parallelism in Hive means executing independent portions of a query parallely . This is done to reduce the execution time.

**Note: Parallelism should be used wisely, it may lead to a deadlock situation.

This video contains:

Parallel processing in HIve.
hive.exec.parallel property
Hive parallel join

;Advance Hive concept

Same like we can execute Unix and HDFS commands in Hive shell, Hive queries can also be executed in Bash shell. We use hive -e option to run individual hive commands and hive -f option to run Hive scripts from Bash shell.

Advance Hive:

We can also run Unix and Hadoop commands from our Hive shell. This is a good approach to run Unix and Hadoop commands from Hive shell rather an opening a new JVM instance.

Advance Hive:

Hive variables are widely used in Real time Live Hive Projects. Variables in Hive behave as same in any other programming language. We can declare variables in two ways hiveconf and hivevar

Advance Hive concept:

We can also pass value to Hive variables from Bash shell. These Hive variables can be passes to a query or to a Hive script.

Advance Hive concept

This Hive lecture explains how to a variable gets it's value from other variable and to do this which Hive property should be set to true.

Custom input format in hive - Advance Hive

Hive works in these following 3 modes:

Embedded mode
Local mode
Remote mode

This Advance Hive lecture explains:

What is compression in Hive
What are its advantages
What types of files can be compressed in Hive
Different types of compression formats in Hive
How to compress files in Hive

Hiverc is a configuration file in Hive. It runs when we launch our hive shell. Hiverc file is used to

Add jar files
Set configurations
Set propertie

Archiving in Hive is a rarely known feature of Hive and is also rarely used. By archiving we bundle less frequently used files in Hive for saving memory.

Hive Use case 1:

This Use Case is about How to load Semi-structured data in Hive. Usually we load structured data in Hive tables but we can also load semi-structured data like XML in our Hive tables.

This videos explains:

What is XML file
Structure of XML file
Creation of Hive table for XML data
How to load XML file into Hive tables

Hive Use case 1

Explained further.

Hive Use case 2:

This Hive Use case is usually asked in Interviews . This explains How to capture changed data in Hive or How we deal with Hive tables that are being updated frequently.

Hive Use case 2

This Hive Use case is usually asked in Interviews . This explains How to capture changed data in Hive or How we deal with Hive tables that are being updated frequently.

Hive Use case 3

In purely Hive based Interviews this Use case is very common. As in Big data training we all learn Word count by using MapReduce Program. The same Word count can be achieved by Hive also.

This Lecture will explain How to implement Word count in Hive .

Hive Use case 4:

This Use case explains How we deal with that situation when there are multiple Hive tables built on same file. We can create any number of Hive tables on 1 file. The Question is how those different Hive tables having different number of columns behaves since the file's data is same for every table.

Traffic lights

Read about what's good

what should give you pause

and possible dealbreakers

Covers advanced Hive concepts like partitioning, bucketing, and user-defined functions, which are essential for optimizing performance in real-world big data environments

Includes practical use cases and interview questions, which prepares learners for applying their knowledge in professional settings and job interviews

Explores configuration settings and optimization techniques, which enables learners to fine-tune Hive for specific use cases and improve query performance

Requires familiarity with basic Hive concepts, so learners without prior experience may need to acquire foundational knowledge before taking this course

Focuses on advanced features and real-time usage, which may not be suitable for learners seeking a general introduction to big data processing with Hadoop

Examines Hive, which is often used with older versions of Hadoop, so learners should verify its relevance to their current technology stack

Reviews summary

Comprehensive hive for real-world use

According to learners, this course offers a comprehensive and in-depth exploration of Apache Hive, covering both basic and advanced concepts essential for real-time Big Data projects and interviews. Students appreciate the focus on practical application and the inclusion of topics like partitioning, bucketing, UDFs, and optimization techniques. The section on interview use cases is frequently highlighted as particularly valuable and unique. While some reviewers may note challenges with setup or specific technical details, the overall sentiment is largely very positive, indicating that the course provides a strong foundation and prepares learners effectively for professional contexts.

Includes practical exercises and demos.

"The practical working demos really helped solidify my understanding."

"Liked that the course includes hands-on examples."

"Ability to run Unix/Hadoop commands from Hive shell shown is practical."

Goes into detail on many Hive features.

"The instructor explains even very thin details of Hive which are hard to find elsewhere."

"Loved the detailed explanations of partitioning, bucketing, and UDFs."

"Found the depth on topics like ACID properties and indexing very informative."

Emphasizes practical, real-world application of Hive.

"This course is very relevant for real-time usage of Hive in Big Data environments."

"The concepts taught are directly applicable to live projects."

"I found the real-world examples extremely useful for my job."

Course covers fundamental to advanced Hive topics.

"The course covers both basic and advance Hive concepts, which is really helpful for understanding everything."

"It goes from the very beginning to quite complex topics like bucketing and partitioning."

"I liked that it started with the basics and then moved into advanced features necessary for real projects."

Includes a valuable section on interview questions.

"The use cases for interviews section is absolutely brilliant and helped me a lot."

"This course is worth it just for the interview questions section."

"Got asked about one of the use cases in an actual interview!"

Potential difficulties with setting up the environment.

"Setting up Hadoop and Hive environment can be tricky based on the provided PDF."

"Struggled a bit with the installation steps initially."

"Could perhaps benefit from more detailed or updated setup instructions for various environments."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Hive to ADVANCE Hive (Real time usage) :Hadoop querying tool with these activities:

Review SQL Fundamentals

Show steps

Solidify your understanding of SQL syntax and concepts, as HiveQL is based on SQL. This will make learning Hive's specific syntax and functions easier.

Browse courses on SQL

Show steps

Review basic SQL commands like SELECT, INSERT, UPDATE, and DELETE.
Practice writing SQL queries on sample databases.
Familiarize yourself with SQL JOIN operations.

Review 'Hadoop: The Definitive Guide'

Show steps

Gain a deeper understanding of the Hadoop ecosystem, which Hive is built upon. This will provide valuable context for understanding Hive's architecture and functionality.

View Hadoop: The Definitive Guide: Storage and... on Amazon

Show steps

Read the chapters related to HDFS and MapReduce.
Take notes on the key concepts and architecture.
Relate the concepts to Hive's operation.

Practice HiveQL Queries

Show steps

Reinforce your understanding of HiveQL syntax and functions through hands-on practice. This will help you become more proficient in writing and executing Hive queries.

Show steps

Set up a local Hive environment with sample data.
Write HiveQL queries to perform various data analysis tasks.
Experiment with different Hive functions and features.

Four other activities

Expand to see all activities and additional details

Show all seven activities

Create a Hive Cheat Sheet

Show steps

Consolidate your knowledge of Hive syntax and commands by creating a cheat sheet. This will serve as a quick reference guide for future use.

Show steps

Compile a list of commonly used Hive commands and functions.
Organize the commands and functions into logical categories.
Add brief explanations and examples for each command and function.

Analyze a Real-World Dataset with Hive

Show steps

Apply your Hive skills to analyze a real-world dataset and gain practical experience. This will help you solidify your understanding of Hive and its applications.

Show steps

Find a publicly available dataset relevant to your interests.
Load the dataset into Hive.
Write Hive queries to analyze the data and extract insights.
Document your analysis and findings.

Review 'Programming Hive'

Show steps

Deepen your understanding of HiveQL and its advanced features, such as UDFs and custom input formats. This will enable you to tackle more complex data analysis tasks.

View Programming Hive: Data Warehouse and Query... on Amazon

Show steps

Read the chapters related to advanced HiveQL features.
Experiment with UDFs and custom input formats.
Apply these features to your real-world dataset analysis.

Optimize Hive Queries for Performance

Show steps

Learn and apply techniques to optimize Hive queries for better performance. This will help you improve the efficiency of your data analysis workflows.

Show steps

Research Hive query optimization techniques, such as partitioning, bucketing, and indexing.
Apply these techniques to your real-world dataset analysis.
Measure the performance improvement of your optimized queries.
Document your optimization strategies and results.

Career center

Learners who complete Hive to ADVANCE Hive (Real time usage) :Hadoop querying tool will develop knowledge and skills that may be useful to these careers:

Data Engineer

A data engineer designs, builds, and manages the infrastructure required for data storage and processing. This often involves working with large datasets and utilizing tools like Hadoop and Hive. The course provides in depth coverage of Hive, including advanced concepts such as partitioning, bucketing, custom input formatters, and user defined functions, all of which helps data engineers in real world big data projects. The configuration settings covered can give a data engineer greater mastery over Hive. The course is particularly helpful because it teaches how to use Hive to analyze data stored in HDFS.

See salaries and explore the career path for Data Engineer

Big Data Architect

A big data architect is responsible for designing and overseeing the implementation of big data solutions within an organization. This involves selecting appropriate technologies, defining data pipelines, and ensuring the scalability and reliability of the data infrastructure. Because the course addresses advanced features of Hive such as compression techniques, working with multiple tables, and loading unstructured data, it can help one become familiar with the tasks of a big data architect. The course is especially valuable because it delves into rarely used Hive commands and concepts.

See salaries and explore the career path for Big Data Architect

Database Administrator

A database administrator manages and maintains databases, ensuring their performance, security, and availability. In the context of big data, this may involve working with Hive and other related technologies. This course will be helpful in understanding Hive architecture, database creation, table management, and data loading, all essential skills for a database administrator working with Hadoop based systems. In particular, taking this course helps one learn to use Hive for real time big data projects.

See salaries and explore the career path for Database Administrator

Data Analyst

A data analyst examines data to identify trends, patterns, and insights that can inform business decisions. This often involves writing SQL like queries to extract and manipulate data from large datasets. Because the course covers both basic and advanced Hive concepts, including mathematical, date, and string functions, it provides a solid basis for a data analyst to perform complex data analysis tasks. The course is especially helpful in teaching how to deal with use cases that are frequently asked in interviews.

See salaries and explore the career path for Data Analyst

Business Intelligence Analyst

A business intelligence analyst uses data to help organizations make better decisions. This includes collecting data, analyzing it, and creating reports and dashboards to visualize key metrics. This often requires proficiency in SQL like querying languages such as Hive. Through its coverage of topics such as joins, views, and user defined functions, the course prepares a business intelligence analyst to extract meaningful insights from large datasets. The course may be helpful in teaching how to perform analytics on data stored in HDFS.

See salaries and explore the career path for Business Intelligence Analyst

Hadoop Developer

A Hadoop developer builds and maintains applications that run on the Hadoop ecosystem. This often involves writing code to process and analyze large datasets using tools like Hive. Since the course includes the creation of custom input formatters in Hive, this allows a Hadoop developer to work with different data formats. The course is particularly helpful because it covers the practical application of Hive in real world scenarios.

See salaries and explore the career path for Hadoop Developer

ETL Developer

An extract, transform, load developer is responsible for designing and implementing processes to extract data from various sources, transform it into a usable format, and load it into a data warehouse or other storage system. The advanced Hive features covered in the course, such as partitioning, bucketing, and compression techniques, can enable an ETL developer to efficiently process large datasets. Learning about loading unstructured data in Hive can be particularly helpful to an ETL developer.

See salaries and explore the career path for ETL Developer

Software Engineer

A software engineer designs, develops, and tests software applications. In the context of big data, a software engineer may work on building tools and platforms that leverage technologies like Hadoop and Hive. The skills learned in this course, such as writing SQL like queries and creating user defined functions, can be applied to developing data processing applications in Hive. The course may be useful because it explains rarely used commands and concepts in Hive.

See salaries and explore the career path for Software Engineer

Data Scientist

A data scientist uses statistical and machine learning techniques to analyze data and build predictive models. While data scientists often use languages like Python and R, knowledge of SQL like querying languages such as Hive is essential for accessing and manipulating data stored in Hadoop based systems. With its coverage of advanced functions and configuration settings in Hive, the course helps a data scientist to perform complex data analysis tasks. The course may be useful in teaching how to apply Hive in real world projects.

See salaries and explore the career path for Data Scientist

Machine Learning Engineer

A machine learning engineer develops and deploys machine learning models. This often involves working with large datasets and using tools like Hadoop and Hive to process and prepare data for model training. The advanced Hive concepts covered in the course, such as partitioning, bucketing, and user defined functions, can enable a machine learning engineer to efficiently handle large datasets. The course may be helpful in understanding the practical aspects of using Hive in big data environments.

See salaries and explore the career path for Machine Learning Engineer

Cloud Engineer

A cloud engineer is responsible for managing and maintaining cloud computing infrastructure. This may involve working with big data technologies like Hadoop and Hive, especially in cloud based environments. Acquiring familiarity with the configuration settings of Hive, as taught in this course, can help cloud engineers optimize the performance of Hive installations. The course may be helpful in showing how Hive can be leveraged in cloud environments.

See salaries and explore the career path for Cloud Engineer

Solutions Architect

A solutions architect designs and implements technology solutions to meet business needs. This may involve integrating big data technologies like Hadoop and Hive into an overall system architecture. The course teaches about the architecture of Hive. This can help a solutions architect understand how Hive interacts with other components in a big data ecosystem. The course may be useful in teaching how to apply Hive in various business scenarios.

See salaries and explore the career path for Solutions Architect

Technical Consultant

A technical consultant provides expert advice and guidance to organizations on technology related issues. This may involve helping clients implement and optimize big data solutions using technologies like Hadoop and Hive. Exposure to advanced Hive concepts and features, as provided in this course, can enhance a technical consultant's ability to provide effective recommendations. The course may be useful in enhancing a consultant's knowledge of Hive for real time projects.

See salaries and explore the career path for Technical Consultant

Project Manager

A Project Manager is in charge of the planning, execution, and closing of projects. This usually involves ensuring that a team meets its goals within budget and timing constraints. A Project Manager may find the course materials useful in managing those who use the Hive querying tool. The course may be useful in providing familiarity with the scope of big data projects using Hive.

See salaries and explore the career path for Project Manager

Product Manager

A product manager is responsible for the strategy, roadmap, and feature definition for a product or product line. While a product manager may not directly work with Hive, understanding its capabilities and limitations can inform product decisions related to big data solutions. The course may be useful in providing insights into the use of Hive in real world scenarios.

See salaries and explore the career path for Product Manager

Hive to ADVANCE Hive (Real time usage)

Hadoop querying tool

Here's a deal for you

What's inside

Syllabus

Traffic lights

Save this course

Reviews summary

Comprehensive hive for real-world use

Activities

Career center

Reading list

Share

Similar courses