Spark, Hadoop, and Snowflake for Data Engineering from edX

In this course, you will:

Explore essential data engineering platforms (Hadoop, Spark, and Snowflake) and learn how to optimize and manage them
Delve into Databricks, a powerful platform for executing data analytics and machine learning tasks
Hone your Python data science skills with PySpark
Discover the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and learn how to integrate it with Databricks
Gain methodologies to help you improve your project management and workflow skills for data engineering, including applying Kaizen, DevOps, and Data Ops best practices

This course is designed for learners who want to pursue or advance their career in data science or data engineering, or for software developers or engineers who want to grow their data management skill set. With quizzes to test your knowledge throughout, this comprehensive course will help guide your learning journey to become a proficient data engineer, ready to tackle the challenges of today's data-driven world.

What's inside

Learning objectives

Optimize and manage hadoop, spark, and snowflake platforms
Execute data analytics and machine learning tasks using databricks
Enhance python data science skills with pyspark

Manage end-to-end machine learning lifecycle with mlflow
Apply kaizen, devops, and dataops methodologies for data engineering

Optimize and manage hadoop, spark, and snowflake platforms
Execute data analytics and machine learning tasks using databricks
Enhance python data science skills with pyspark
Manage end-to-end machine learning lifecycle with mlflow
Apply kaizen, devops, and dataops methodologies for data engineering

Syllabus

Module 1: Overview and Introduction to PySpark (7 hours)

- 10 videos (Total 25 minutes)

- Meet your Co-Instructor: Kennedy Behrman (0 minutes, Preview module)

Traffic lights

Read about what's good

what should give you pause

and possible dealbreakers

Develops foundational and advanced skills for those in data science and engineering fields

Taught by experienced professionals in the industry

Explores a range of essential data engineering platforms and technologies

Covers project management and workflow optimization practices

Requires some prior programming experience, limiting accessibility for absolute beginners

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Spark, Hadoop, and Snowflake for Data Engineering with these activities:

Review Concepts in PySpark Study Materials

Show steps

Reviewing these topics will bring back important concepts in PySpark at the start of the course to stimulate your thinking.

Browse courses on Pyspark

Show steps

Review big data platforms, such as Hadoop and Spark
Go over PySpark dataframes
Examine RDDs, Spark SQL, and dataframe concepts

Work Through PySpark Practice Exercises

Show steps

These exercises will initiate the implementation of concepts and solidifies your understanding of PySpark.

Browse courses on Pyspark

Show steps

Load PySpark and setup the environment
Implement PySpark dataframe operations
Execute PySpark SQL queries

Join a Mentoring Program for Data Engineering

Show steps

Enhance your understanding and practical skills by sharing your knowledge and guiding others in their data engineering journey.

Browse courses on Mentorship

Show steps

Identify a mentoring program in data engineering
Apply and get matched with a mentee
Provide guidance and support to your mentee

11 other activities

Expand to see all activities and additional details

Show all 14 activities

Read "Spark: The Definitive Guide"

Show steps

Gain a comprehensive understanding of Spark's architecture, programming model, and advanced techniques.

View Spark: The Definitive Guide: Big Data... on Amazon

Show steps

Obtain a copy of the book
Read through the relevant chapters
Take notes and highlight important concepts

Explore Snowflake Documentation and Tutorials

Show steps

Self-guided exploration of resources will grant you more profound insights into the features and functionalities of Snowflake.

Browse courses on Snowflake

Show steps

Familiarize yourself with Snowflake architecture and components
Review best practices for Snowflake data management
Experiment with Snowflake's scripting and programming capabilities

Run PySpark examples

Show steps

Reinforce your understanding of PySpark by running the examples provided in the course materials.

Browse courses on Pyspark

Show steps

Navigate to the PySpark examples directory.
Run the examples using the provided commands.
Observe the output and compare it to the expected results.

Follow Databricks Academy Tutorials

Show steps

Supplement your learning by following structured tutorials from Databricks Academy to enhance your practical skills.

Browse courses on Databricks

Show steps

Identify relevant tutorials
Follow the tutorials step-by-step
Complete the exercises and quizzes

Create a Resource Collection on Snowflake

Show steps

Build your knowledge base by gathering and organizing resources related to Snowflake's features and capabilities.

Browse courses on Snowflake

Show steps

Identify relevant resources (e.g., documentation, tutorials, articles)
Organize the resources into a structured format
Share the resource collection with others

Attend a Data Analytics Workshop

Show steps

Participate in a hands-on workshop to gain practical experience and deepen your understanding of data analytics concepts.

Browse courses on Data Analytics

Show steps

Identify a relevant workshop
Register and attend the workshop
Engage actively in the hands-on exercises

Spark SQL Practice Problems

Show steps

Reinforce your understanding of Spark SQL syntax and operations by attempting to solve practice problems.

Show steps

Access the practice problems
Attempt to solve the problems on your own
Review the solutions provided

Write a Blog Post on Data Engineering with Hadoop

Show steps

Solidify your understanding by explaining concepts related to Hadoop and data engineering in a blog post.

Browse courses on Hadoop

Show steps

Choose a specific topic within Hadoop and data engineering
Research and gather information
Write a well-structured and informative blog post
Publish and promote your blog post

Build a Data Pipeline Prototype in Databricks

Show steps

Hands-on implementation of these concepts will greatly enhance your comprehension and real-world readiness.

Browse courses on Azure Databricks

Show steps

Establish a Databricks environment
Design and develop your data pipeline architecture
Implement data ingestion, transformation, and visualization

Build a Data Pipeline with PySpark

Show steps

Apply your skills to a practical project involving data extraction, transformation, and loading using PySpark.

Browse courses on Pyspark

Show steps

Define the project scope and objectives
Design the data pipeline architecture
Implement the pipeline using PySpark
Test and evaluate the pipeline

Contribute to an Open-Source Data Engineering Project

Show steps

Enhance your skills and contribute to the data engineering community by participating in open-source projects.

Browse courses on Open Source

Show steps

Identify an open-source data engineering project
Explore the project's codebase and documentation
Identify an area where you can contribute
Submit a pull request with your contribution

Career center

Learners who complete Spark, Hadoop, and Snowflake for Data Engineering will develop knowledge and skills that may be useful to these careers:

Reading list

We haven't picked any books for this reading list yet.

Spark, Hadoop, and Snowflake for Data Engineering

What's inside

Learning objectives

Syllabus

Traffic lights

Save this course

Activities

Career center

Reading list

Share

Similar courses