We may earn an affiliate commission when you visit our partners.
Course image
Ramesh Sannareddy, Yan Luo, Jeff Grossman, and Sabrina Spillner

Delve into the two different approaches to converting raw data into analytics-ready data. One approach is the Extract, Transform, Load (ETL) process. The other contrasting approach is the Extract, Load, and Transform (ELT) process. ETL processes apply to data warehouses and data marts. ELT processes apply to data lakes, where the data is transformed on demand by the requesting/calling application.

Read more

Delve into the two different approaches to converting raw data into analytics-ready data. One approach is the Extract, Transform, Load (ETL) process. The other contrasting approach is the Extract, Load, and Transform (ELT) process. ETL processes apply to data warehouses and data marts. ELT processes apply to data lakes, where the data is transformed on demand by the requesting/calling application.

In this course, you will learn about the different tools and techniques that are used with ETL and Data pipelines. Both ETL and ELT extract data from source systems, move the data through the data pipeline, and store the data in destination systems. During this course, you will experience how ELT and ETL processing differ and identify use cases for both. You will identify methods and tools used for extracting the data, merging extracted data either logically or physically, and for loading data into data repositories.

You will also define transformations to apply to source data to make the data credible, contextual, and accessible to data users. You will be able to outline some of the multiple methods for loading data into the destination system, verifying data quality, monitoring load failures, and the use of recovery mechanisms in case of failure.

By the end of this course, you will also know how to use Apache Airflow to build data pipelines as well be knowledgeable about the advantages of using this approach. You will also learn how to use Apache Kafka to build streaming pipelines as well as the core components of Kafka which include: brokers, topics, partitions, replications, producers, and consumers.

Finally, you will complete a shareable final project that enables you to demonstrate the skills you acquired in each module.

Enroll now

What's inside

Syllabus

Data Processing Techniques
ETL or Extract, Transform, and Load processes are used for cases where flexibility, speed, and scalability of data are important. You will explore some key differences been similar processes, ETL and ELT, which include the place of transformation, flexibility, Big Data support, and time-to-insight. You will learn that there is an increasing demand for access to raw data that drives the evolution from ETL to ELT. Data extraction involves advanced technologies including database querying, web scraping, and APIs. You will also learn that data transformation is about formatting data to suit the application and that data is loaded in batches or streamed continuously.
Read more
ETL & Data Pipelines: Tools and Techniques
Extract, transform and load (ETL) pipelines are created with Bash scripts that can be run on a schedule using cron. Data pipelines move data from one place, or form, to another. Data pipeline processes include scheduling or triggering, monitoring, maintenance, and optimization. Furthermore, Batch pipelines extract and operate on batches of data. Whereas streaming data pipelines ingest data packets one-by-one in rapid succession. In this module, you will learn that streaming pipelines apply when the most current data is needed. You will explore that parallelization and I/O buffers help mitigate bottlenecks. You will also learn how to describe data pipeline performance in terms of latency and throughput.
Building Data Pipelines using Airflow
The key advantage of Apache Airflow's approach to representing data pipelines as DAGs is that they are expressed as code, which makes your data pipelines more maintainable, testable, and collaborative. Tasks, the nodes in a DAG, are created by implementing Airflow's built-in operators.​ In this module, you will learn about Apache Airflow having a rich UI that simplifies working with data pipelines. You will explore how to visualize your DAG in graph or tree mode. You will also learn about the key components of a DAG definition file, and you will learn that Airflow logs are saved into local file systems and then sent to cloud storage, search engines, and log analyzers.
Building Streaming Pipelines using Kafka
Apache Kafka is a very popular open source event streaming pipeline. An event is a type of data that describes the entity’s observable state updates over time. Popular Kafka service providers include Confluent Cloud, IBM Event Stream, and Amazon MSK. Additionally, Kafka Streams API is a client library supporting you with data processing in event streaming pipelines. In this module, you will learn that the core components of Kafka are brokers, topics, partitions, replications, producers, and consumers. You will explore two special types of processors in the Kafka Stream API stream-processing topology: The source processor and the sink processor. You will also learn about building event streaming pipelines using Kafka.
Final Assignment
In this final assignment module, you will apply your newly gained knowledge to explore two very exciting hands-on labs. “Creating ETL Data Pipelines using Apache Airflow” and “Creating Streaming Data Pipelines using Kafka”. You will explore building these ETL pipelines using real-world scenarios. You will extract, transform, and load data into a CSV file. You will also create a topic named “toll” in Apache Kafka, download and customize a streaming data consumer, as well as verifying that streaming data has been collected in the database table.

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Examines ETL and ELT processes, which is standard in data analytics
Teaches both ETL and ELT, which helps learners gain skills in both approaches
Develops foundational theory and practical skills in ETL and Data Pipelines
Taught by instructors recognized for their work in data management
Uses real-world scenarios for hands-on labs in ETL and streaming pipelines

Save this course

Save ETL and Data Pipelines with Shell, Airflow and Kafka to your list so you can find it easily later:
Save

Reviews summary

Etl and data pipelines with shell, airflow, and kafka

Learners say this hands-on course teaches Apache Airflow, Apache Kafka, and ETL pipelining for data engineers and developers. There are positive comments about the labs and assignments which learners say are practical and helpful. However, learners mention some of the lectures seem too basic, or even boring. Some reviews also mention there are occasional technical issues with the course's labs.
This course emphasizes hands-on experience, providing learners with practical labs and assignments.
"Labs in this course are very helpful and to the point."
"Amazing for beginners to this subject! The labs are super useful and everything is explained in a really nice way."
"The final project to connect Airflow as a pipeline management tool to Kafka server is a very useful hands-on project."
Some learners found the lectures to be too basic or boring, suggesting that the course may not be suitable for experienced learners.
"As with all these IBM courses this one is super boring. Robot voice talking over powerpoints, as usual."
"The course material was basic so make sure do to a lot of your own additional learning outside of the coureswork."
"Week 1 feels useless because the main idea is to learn about Airflow and Kafka, and all this information about ETL it is not relevant if the course is positioned as an advanced one."
Some learners encountered technical issues with the course's labs, which may have hindered their experience.
"Buggy practice. Not possible to complete without fixing airflow start script yourself."
"The lab exercises were not loaded, so I had to move to the next section and it was not understandable, there is a technical issue!"
"I cannot proceed with the "SUBMIT a DAG" lab as I am constantly being shown the error - "cp: cannot create regular file '/home/project/airflow/dags/my_first_dag.py': Permission denied" when I run the command - "cp my_first_dag.py $AIRFLOW_HOME/dags"."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in ETL and Data Pipelines with Shell, Airflow and Kafka with these activities:
Use our data modeling tutorial
The course covers database querying and data modeling so this refresher will prepare you in advance.
Browse courses on Data Modeling
Show steps
  • Visit the tutorial and read the introductory sections
  • Work your way through the exercises and sample problems
  • Review the sample solutions
Review concepts of data extraction, transformation, and loading
Reviewing these concepts will provide a strong foundation for understanding the course material.
Browse courses on Data Extraction
Show steps
  • Read course syllabus and skim assigned textbooks
  • Review notes or materials from previous courses on data management
Run SQL queries against different databases
SQL is covered heavily in this course so these practice drills will improve your results
Browse courses on Data Extraction
Show steps
  • Load data into a local database
  • Write SQL queries to retrieve specific sets of data
  • Practice aggregating, sorting, and filtering data
Eight other activities
Expand to see all activities and additional details
Show all 11 activities
Create a study guide or summary of key concepts
Creating a study guide will help consolidate learning and improve retention.
Browse courses on ETL
Show steps
  • Review course notes, readings, and assignments
  • Identify and summarize important concepts
Follow online tutorials or workshops to learn specific data analysis techniques
Following tutorials and workshops will provide additional guidance and support in learning data analysis techniques.
Browse courses on Data Analysis Techniques
Show steps
  • Identify tutorials or workshops that focus on specific techniques you want to learn
  • Complete the tutorials or participate in the workshops
Practice data transformation techniques
Practicing data transformation techniques will enhance understanding and proficiency.
Browse courses on Data Manipulation
Show steps
  • Complete practice exercises provided in the course modules
  • Find additional practice problems online or in textbooks
Attend industry meetups or conferences related to data engineering
Attending industry events will provide opportunities to connect with professionals and learn about current trends.
Show steps
  • Research and identify relevant events
  • Register and attend the events
Create a data transformation pipeline using Python or another programming language
Creating a data transformation pipeline will provide hands-on experience and reinforce learning.
Browse courses on Python Programming
Show steps
  • Choose a dataset and define the transformation rules
  • Write code to implement the transformations
  • Test and evaluate the pipeline
Volunteer at a non-profit organization that utilizes data analytics
Volunteering will provide practical experience and exposure to real-world data analysis applications.
Browse courses on Data Analytics
Show steps
  • Identify organizations that align with your interests
  • Contact the organizations and inquire about volunteer opportunities
Develop a personal data analytics project
Working on a personal project will allow you to apply your skills and explore your interests in data analysis.
Show steps
  • Identify a problem or opportunity that you want to address
  • Gather and analyze data
  • Develop and implement a solution
Create a presentation or report that showcases your data analysis findings
Creating a deliverable will provide a structured way to communicate your analysis and insights.
Browse courses on Presentation
Show steps
  • Organize and analyze your data
  • Develop a clear and concise message

Career center

Learners who complete ETL and Data Pipelines with Shell, Airflow and Kafka will develop knowledge and skills that may be useful to these careers:
Data Engineer
Data Engineers build and maintain the infrastructure that supports data-driven applications and products. They design, build, test, deploy, maintain, and monitor data pipelines and data platforms. In this course, you will learn about data pipelines, data integration, and data cleansing. These concepts will help you to design and implement robust and scalable data engineering systems.
Data Warehouse Engineer
Data Warehouse Engineers design, develop, and maintain data warehouses. They are responsible for ensuring that the data in the data warehouse is accurate, consistent, and accessible to users. This course provides a foundation in ETL and data pipelines, which are essential skills for Data Warehouse Engineers.
Data Integration Architect
Data Integration Architects design and implement data integration solutions. They work with stakeholders to identify and understand their data integration needs. This course provides a foundation in data integration and data cleansing, which are essential skills for Data Integration Architects.
Data Analyst
Data Analysts collect, transform, and analyze data to provide insights that inform decision-making. In this course, you will learn about data extraction, transformation, and loading (ETL), as well as how to build data pipelines. These skills will be essential for collecting and preparing data for analysis.
Software Engineer
Software Engineers design, develop, test, and maintain software systems. In this course, you will learn about shell scripting, Apache Airflow, and Apache Kafka. These technologies are essential for building and managing data pipelines.
Data Migration Specialist
Data Migration Specialists migrate data from one system to another. They work with stakeholders to identify and understand their data migration needs. This course provides a foundation in data integration and data cleansing, which are essential skills for Data Migration Specialists.
Data Architect
A Data Architect takes responsibility for the design of data management solutions that address business requirements. They consider the overall architecture of an organization's data landscape, which may include data lakes, data warehouses, and operational systems. This course provides a basis for the data management lifecycle. Concepts of data integration will also be useful for a Data Architect looking at how to handle data from diverse sources.
Machine Learning Engineer
Machine Learning Engineers build and deploy machine learning models to solve complex problems in a variety of industries. This course provides a foundation in Apache Kafka, which is a popular open source event streaming platform. Event streaming is a key technology for building real-time machine learning applications.
Database Administrator
Database Administrators are responsible for the performance and security of databases. They design, implement, and maintain database systems. This course provides a foundation in data integration and data cleansing, which are essential skills for Database Administrators.
Data Scientist
Data Scientists use scientific methods, processes, algorithms, and systems to extract knowledge and insights from data in various forms, both structured and unstructured. This course provides a foundation in data integration and data cleansing, which are essential skills for Data Scientists.
Data Governance Specialist
Data Governance Specialists develop and implement data governance policies and procedures. They work with stakeholders to identify and understand their data governance needs. This course provides a foundation in data integration and data cleansing, which are essential skills for Data Governance Specialists.
Information Architect
Information Architects design and implement information systems. They work with stakeholders to identify and understand their information needs. This course provides a foundation in data integration and data cleansing, which are essential skills for Information Architects.
Data Quality Analyst
Data Quality Analysts ensure that data is accurate, consistent, and complete. They work with stakeholders to identify and understand their data quality needs. This course provides a foundation in data integration and data cleansing, which are essential skills for Data Quality Analysts.
Data Privacy Analyst
Data Privacy Analysts ensure that data is used in a compliant and ethical manner. They work with stakeholders to identify and understand their data privacy needs. This course provides a foundation in data integration and data cleansing, which are essential skills for Data Privacy Analysts.
Business Analyst
Business Analysts work with stakeholders to identify and understand their business needs. They analyze data to identify opportunities for improvement and develop solutions to business problems. This course provides a foundation in data integration and data cleansing, which are essential skills for Business Analysts.

Reading list

We've selected eight books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in ETL and Data Pipelines with Shell, Airflow and Kafka.
Provides an in-depth understanding of Apache Kafka's architecture, components, and use cases. can serve as a valuable reference for the course's section on building streaming pipelines with Kafka.
Provides a comprehensive guide to Apache Spark, including topics such as data processing, data transformation, and data visualization.
Covers data engineering concepts and techniques using Python. can provide additional insights into data extraction, transformation, and loading processes, complementing the course's ETL focus.
Provides a comprehensive guide to data-intensive text processing with MapReduce, including topics such as text mining, natural language processing, and machine learning.
Provides a comprehensive overview of data-intensive applications and their architectural patterns. can serve as background reading for the course, helping learners understand the broader context of data pipelines.
Provides a comprehensive overview of Hadoop and its ecosystem. can serve as background reading for the course, helping learners understand the foundations of data processing and storage in large-scale environments.
Provides a solid foundation in data warehousing principles and practices. can serve as background reading for the course, helping learners understand the context and evolution of data pipelines in the context of data warehousing.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to ETL and Data Pipelines with Shell, Airflow and Kafka.
Building ETL and Data Pipelines with Bash, Airflow and...
Most relevant
Building Batch Data Pipelines on Google Cloud
Most relevant
Building Batch Data Pipelines on Google Cloud
Most relevant
Extract, Transform, and Load Data
Most relevant
Data Analytics and Databases on AWS
Most relevant
The Path to Insights: Data Models and Pipelines
Most relevant
Building Your First ETL Pipeline Using Azure Databricks
Most relevant
Apache Spark for Data Engineering and Machine Learning
Most relevant
Designing SSIS Integration Solutions
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser