We may earn an affiliate commission when you visit our partners.
Course image
Course image
edX logo

Spark, Hadoop, and Snowflake for Data Engineering

Noah Gift and Kennedy Behrman

In this course, you will:

Read more

In this course, you will:

  • Explore essential data engineering platforms (Hadoop, Spark, and Snowflake) and learn how to optimize and manage them
  • Delve into Databricks, a powerful platform for executing data analytics and machine learning tasks
  • Hone your Python data science skills with PySpark
  • Discover the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and learn how to integrate it with Databricks
  • Gain methodologies to help you improve your project management and workflow skills for data engineering, including applying Kaizen, DevOps, and Data Ops best practices

This course is designed for learners who want to pursue or advance their career in data science or data engineering, or for software developers or engineers who want to grow their data management skill set. With quizzes to test your knowledge throughout, this comprehensive course will help guide your learning journey to become a proficient data engineer, ready to tackle the challenges of today's data-driven world.

What's inside

Learning objectives

  • Optimize and manage hadoop, spark, and snowflake platforms
  • Execute data analytics and machine learning tasks using databricks
  • Enhance python data science skills with pyspark
  • Manage end-to-end machine learning lifecycle with mlflow
  • Apply kaizen, devops, and dataops methodologies for data engineering

Syllabus

Module 1: Overview and Introduction to PySpark (7 hours)
- 10 videos (Total 25 minutes)
- Meet your Co-Instructor: Kennedy Behrman (0 minutes, Preview module)
Read more
- Meet your Co-Instructor: Noah Gift (1 minute)
- Overview of Big Data Platforms (1 minute)
- Getting Started with Hadoop (1 minute)
- Getting Started with Spark (1 minute)
- Introduction to Resilient Distributed Datasets (RDD) (2 minutes)
- Resilient Distributed Datasets (RDD) Demo (4 minutes)
- Introduction to Spark SQL (1 minute)
- PySpark Dataframe Demo: Part 1 (3 minutes)
- PySpark Dataframe Demo: Part 2 (7 minutes)
- 9 readings (Total 90 minutes)
- Welcome to Data Engineering Platforms with Python! (10 minutes)
- What is Apache Hadoop? (10 minutes)
- What is Apache Spark? (10 minutes)
- Use Apache Spark in Azure Databricks (optional) (10 minutes)
- Choosing between Hadoop and Spark (10 minutes)
- What are RDDs? (10 minutes)
- Getting Started: Creating RDD's with PySpark (10 minutes)
- Spark SQL, Dataframes and Datasets (10 minutes)
- PySpark and Spark SQL (10 minutes)
- 7 quizzes (Total 210 minutes)
- PySpark (30 minutes)
- Big Data Platforms (30 minutes)
- Apache Hadoop Concepts (30 minutes)
- Apache Spark Concepts (30 minutes)
- RDD Concepts (30 minutes)
- Spark SQL Concepts (30 minutes)
- PySpark Dataframe Concepts (30 minutes)
- 2 discussion prompts (Total 20 minutes)
- Meet and Greet (optional) (10 minutes)
- Let Us Know if Something's Not Working (10 minutes)
- 2 ungraded labs (Total 120 minutes)
- Practice: Creating RDD's with PySpark (60 minutes)
- Practice: Reading Data into Dataframes (60 minutes)
Module 2: Snowflake (4 hours)
- 8 videos (Total 27 minutes)
- What is Snowflake? (2 minutes, Preview module)
- Snowflake Layers (2 minutes)
- Snowflake Web UI (3 minutes)
- Navigating Snowflake (3 minutes)
- Creating a Table in Snowflake (5 minutes)
- Snowflake Warehouses (3 minutes)
- Writing to Snowflake (3 minutes)
- Reading from Snowflake (2 minutes)
- 5 readings (Total 50 minutes)
- Accessing Snowflake (10 minutes)
- Detailed View Inside Snowflake (10 minutes)
- Snowsight: The Snowflake Web Interface (10 minutes)
- Working with Warehouses (10 minutes)
- Python Connector Documentation (10 minutes)
- 6 quizzes (Total 180 minutes)
- Snowflake (30 minutes)
- Snowflake Architecture (30 minutes)
- Snowflake Layers (30 minutes)
- Navigating Snowflake (30 minutes)
- Creating a Table (30 minutes)
- Writing to Snowflake (30 minutes)
Module 3: Azure Databricks and MLFlow (5 hours)
- 16 videos (Total 71 minutes)
- Accessing Databricks (0 minutes, Preview module)
- Spark Notebooks with Databricks (4 minutes)
- Using Data with Databricks (4 minutes)
- Working with Workspaces in Databricks (3 minutes)
- Advanced Capabilities of Databricks (1 minute)
- PySpark Introduction on Databricks (7 minutes)
- Exploring Databricks Azure Features (3 minutes)
- Using the DBFS to AutoML Workflow (4 minutes)
- Load, Register and Deploy ML Models (2 minutes)
- Databricks Model Registry (2 minutes)
- Model Serving on Databricks (2 minutes)
- What is MLOps? (12 minutes)
- Exploring Open-Source MLFlow Frameworks (5 minutes)
- Running MLFlow with Databricks (6 minutes)
- End to End Databricks MLFlow (4 minutes)
- Databricks Autologging with MLFlow (4 minutes)
- 7 readings (Total 70 minutes)
- What is Azure Databricks? (10 minutes)
- Introduction to Databricks Machine Learning (10 minutes)
- What is the Databricks File System (DBFS)? (10 minutes)
- Serverless Compute with Databricks (10 minutes)
- MLOps Workflow on Azure Databricks (10 minutes)
- Run MLFlow Projects on Azure Databricks (10 minutes)
- Databricks Autologging (10 minutes)
- 4 quizzes (Total 120 minutes)
- DataBricks (30 minutes)
- PySpark SQL (30 minutes)
- PySpark DataFrames (30 minutes)
- MLFlow with Databricks (30 minutes)
- 1 ungraded lab (Total 60 minutes)
- ETL-Part-1: Keyword Extractor Tool to HashTag Tool (60 minutes)
Module 4: DataOps and Operations Methodologies (12 hours)
- 21 videos (Total 502 minutes)
- Kaizen Methodology for Data (4 minutes, Preview module)
- Introducing GitHub CodeSpaces (9 minutes)
- Compiling Python in GitHub Codespaces (18 minutes)
- Walking through Sagemaker Studio Lab (28 minutes)
- Pytest Master Class (Optional) (166 minutes)
- What is DevOps? (2 minutes)
- DevOps Key Concepts (35 minutes)
- Continuous Integration Overview (32 minutes)
- Build an NLP in Cloud9 with Python (43 minutes)
- Build a Continuously Deployed Containerized FastAPI Microservice (43 minutes)
- Hugo Continuous Deploy on AWS (18 minutes)
- Container Based Continuous Delivery (8 minutes)
- What is DataOps? (1 minute)
- DataOps and MLOps with Snowflake (61 minutes)
- Building Cloud Pipelines with Step Functions and Lambda (16 minutes)
- What is a Data Lake? (2 minutes)
- Data Warehouse vs. Feature Store (2 minutes)
- Big Data Challenges (1 minute)
- Types of Big Data Processing (1 minute)
- Real-World Data Engineering Pipeline (2 minutes)
- Data Feedback Loop (0 minutes)
- 6 readings (Total 60 minutes)
- GitHub Codespaces Overview (10 minutes)
- Getting Started with Amazon SageMaker Studio Lab (10 minutes)
- Teaching MLOps at Scale with GitHub (Optional) (10 minutes)
- Getting Started with DevOps and Cloud Computing (10 minutes)
- Benefits of Serverless ETL Technologies (10 minutes)
- Next Steps (10 minutes)
- DataOps and Operations Methodologies (30 minutes)
- Kaizen Methodology (30 minutes)
- DevOps (30 minutes)
- DataOps (30 minutes)
- ETL-Part2: SQLite ETL Destination (60 minutes)

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Develops foundational and advanced skills for those in data science and engineering fields
Taught by experienced professionals in the industry
Explores a range of essential data engineering platforms and technologies
Covers project management and workflow optimization practices
Requires some prior programming experience, limiting accessibility for absolute beginners

Save this course

Save Spark, Hadoop, and Snowflake for Data Engineering to your list so you can find it easily later:
Save

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Spark, Hadoop, and Snowflake for Data Engineering with these activities:
Review Concepts in PySpark Study Materials
Reviewing these topics will bring back important concepts in PySpark at the start of the course to stimulate your thinking.
Browse courses on Pyspark
Show steps
  • Review big data platforms, such as Hadoop and Spark
  • Go over PySpark dataframes
  • Examine RDDs, Spark SQL, and dataframe concepts
Work Through PySpark Practice Exercises
These exercises will initiate the implementation of concepts and solidifies your understanding of PySpark.
Browse courses on Pyspark
Show steps
  • Load PySpark and setup the environment
  • Implement PySpark dataframe operations
  • Execute PySpark SQL queries
Join a Mentoring Program for Data Engineering
Enhance your understanding and practical skills by sharing your knowledge and guiding others in their data engineering journey.
Browse courses on Mentorship
Show steps
  • Identify a mentoring program in data engineering
  • Apply and get matched with a mentee
  • Provide guidance and support to your mentee
11 other activities
Expand to see all activities and additional details
Show all 14 activities
Read "Spark: The Definitive Guide"
Gain a comprehensive understanding of Spark's architecture, programming model, and advanced techniques.
Show steps
  • Obtain a copy of the book
  • Read through the relevant chapters
  • Take notes and highlight important concepts
Explore Snowflake Documentation and Tutorials
Self-guided exploration of resources will grant you more profound insights into the features and functionalities of Snowflake.
Browse courses on Snowflake
Show steps
  • Familiarize yourself with Snowflake architecture and components
  • Review best practices for Snowflake data management
  • Experiment with Snowflake's scripting and programming capabilities
Run PySpark examples
Reinforce your understanding of PySpark by running the examples provided in the course materials.
Browse courses on Pyspark
Show steps
  • Navigate to the PySpark examples directory.
  • Run the examples using the provided commands.
  • Observe the output and compare it to the expected results.
Follow Databricks Academy Tutorials
Supplement your learning by following structured tutorials from Databricks Academy to enhance your practical skills.
Browse courses on Databricks
Show steps
  • Identify relevant tutorials
  • Follow the tutorials step-by-step
  • Complete the exercises and quizzes
Create a Resource Collection on Snowflake
Build your knowledge base by gathering and organizing resources related to Snowflake's features and capabilities.
Browse courses on Snowflake
Show steps
  • Identify relevant resources (e.g., documentation, tutorials, articles)
  • Organize the resources into a structured format
  • Share the resource collection with others
Attend a Data Analytics Workshop
Participate in a hands-on workshop to gain practical experience and deepen your understanding of data analytics concepts.
Browse courses on Data Analytics
Show steps
  • Identify a relevant workshop
  • Register and attend the workshop
  • Engage actively in the hands-on exercises
Spark SQL Practice Problems
Reinforce your understanding of Spark SQL syntax and operations by attempting to solve practice problems.
Show steps
  • Access the practice problems
  • Attempt to solve the problems on your own
  • Review the solutions provided
Write a Blog Post on Data Engineering with Hadoop
Solidify your understanding by explaining concepts related to Hadoop and data engineering in a blog post.
Browse courses on Hadoop
Show steps
  • Choose a specific topic within Hadoop and data engineering
  • Research and gather information
  • Write a well-structured and informative blog post
  • Publish and promote your blog post
Build a Data Pipeline Prototype in Databricks
Hands-on implementation of these concepts will greatly enhance your comprehension and real-world readiness.
Browse courses on Azure Databricks
Show steps
  • Establish a Databricks environment
  • Design and develop your data pipeline architecture
  • Implement data ingestion, transformation, and visualization
Build a Data Pipeline with PySpark
Apply your skills to a practical project involving data extraction, transformation, and loading using PySpark.
Browse courses on Pyspark
Show steps
  • Define the project scope and objectives
  • Design the data pipeline architecture
  • Implement the pipeline using PySpark
  • Test and evaluate the pipeline
Contribute to an Open-Source Data Engineering Project
Enhance your skills and contribute to the data engineering community by participating in open-source projects.
Browse courses on Open Source
Show steps
  • Identify an open-source data engineering project
  • Explore the project's codebase and documentation
  • Identify an area where you can contribute
  • Submit a pull request with your contribution

Career center

Learners who complete Spark, Hadoop, and Snowflake for Data Engineering will develop knowledge and skills that may be useful to these careers:

Reading list

We haven't picked any books for this reading list yet.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Spark, Hadoop, and Snowflake for Data Engineering.
Spark, Hadoop, and Snowflake for Data Engineering
Most relevant
Machine Learning with Apache Spark
Most relevant
Apache Spark for Data Engineering and Machine Learning
Most relevant
Data Engineering Essentials using SQL, Python, and PySpark
Most relevant
Data Engineering and Machine Learning using Spark
Most relevant
Data Engineering using Kafka and Spark Structured...
Most relevant
Developing Spark Applications Using Scala & Cloudera
Most relevant
Big Data Computing with Spark
Most relevant
Building Deep Learning Models on Databricks
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser