Spark SQL and Spark 3 using Scala Hands-On with Labs from Udemy

As part of this course, you will learn all the key skills to build Data Engineering Pipelines using Spark SQL and Spark Data Frame APIs using Scala as a Programming language. This course used to be a CCA 175 Spark and Hadoop Developer course for the preparation of the Certification Exam. As of 10/31/2021, the exam is sunset and we have renamed it to Spark SQL and Spark 3 using Scala as it covers industry-relevant topics beyond the scope of certification.

About Data Engineering

Data Engineering is nothing but processing the data depending on our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc. Apache Spark is evolved as a leading technology to take care of Data Engineering at scale.

I have prepared this course for anyone who would like to transition into a Data Engineer role using Spark (Scala). I myself am a proven Data Engineering Solution Architect with proven experience in designing solutions using Apache Spark.

Let us go through the details about what you will be learning in this course. Keep in mind that the course is created with a lot of hands-on tasks which will give you enough practice using the right tools. Also, there are tons of tasks and exercises to evaluate yourself.

Setup of Single Node Big Data Cluster

Many of you would like to transition to Big Data from Conventional Technologies such as Mainframes, Oracle PL/SQL, etc and you might not have access to Big Data Clusters. It is very important for you set up the environment in the right manner. Don't worry if you do not have the cluster handy, we will guide you through support via Udemy Q&A.

Setup Ubuntu-based AWS Cloud9 Instance with the right configuration
Ensure Docker is setup
Setup Jupyter Lab and other key components
Setup and Validate Hadoop, Hive, YARN, and Spark

Are you feeling a bit overwhelmed about setting up the environment? Don't worry. We will provide complementary lab access for up to 2 months. Here are the details.

Training using an interactive environment. You will get 2 weeks of lab access, to begin with. If you like the environment, and acknowledge it by providing a 5* rating and feedback, the lab access will be extended to additional 6 weeks (2 months). Feel free to send an email to support@itversity.com to get complementary lab access. Also, if your employer provides a multi-node environment, we will help you set up the material for the practice as part of the live session. On top of Q&A Support, we also provide required support via live sessions.

A quick recap of Scala

This course requires a decent knowledge of Scala. To make sure you understand Spark from a Data Engineering perspective, we added a module to quickly warm up with Scala. If you are not familiar with Scala, then we suggest you go through relevant courses on Scala as Programming Language.

Data Engineering using Spark SQL

Let us, deep-dive into Spark SQL to understand how it can be used to build Data Engineering Pipelines. Spark with SQL will provide us the ability to leverage distributed computing capabilities of Spark coupled with easy-to-use developer-friendly SQL-style syntax.

Getting Started with Spark SQL
Basic Transformations using Spark SQL
Managing Spark Metastore Tables - Basic DDL and DML
Managing Spark Metastore Tables Tables - DML and Partitioning
Overview of Spark SQL Functions
Windowing Functions using Spark SQL

Data Engineering using Spark Data Frame APIs

Spark Data Frame APIs are an alternative way of building Data Engineering applications at scale leveraging distributed computing capabilities of Spark. Data Engineers from application development backgrounds might prefer Data Frame APIs over Spark SQL to build Data Engineering applications.

Data Processing Overview using Spark Data Frame APIs leveraging Scala as Programming Language
Processing Column Data using Spark Data Frame APIs leveraging Scala as Programming Language
Basic Transformations using Spark Data Frame APIs leveraging Scala as Programming Language - Filtering, Aggregations, and Sorting
Joining Data Sets using Spark Data Frame APIs leveraging Scala as Programming Language

All the demos are given on our state-of-the-art Big Data cluster. You can avail of one-month complimentary lab access by reaching out to support@itversity.com with a Udemy receipt.

What's inside

Syllabus

Introduction

CCA 175 Spark and Hadoop Developer - Curriculum

Set up self support lab to prepare for CCA 175 Certification on AWS using Cloud9

Getting Started with Cloud9

Traffic lights

Read about what's good

what should give you pause

and possible dealbreakers

Covers building data engineering pipelines using Spark SQL and Spark Data Frame APIs, which are essential for processing data at scale in modern data architectures

Includes setting up a single-node big data cluster, which is helpful for those transitioning from conventional technologies and lacking access to existing big data environments

Requires a decent knowledge of Scala and includes a module to quickly warm up with Scala, but suggests relevant courses for those not familiar with the language

Provides complementary lab access for hands-on practice, enhancing the learning experience and skill development in a practical environment

Teaches Spark 2 and Spark 3, which may require learners to manage multiple versions of Spark and understand the differences between them

Includes content related to CCA 175 Spark and Hadoop Developer certification, which has been sunset, so some content may be less relevant to current industry practices

Reviews summary

Practical spark sql and dataframes with scala

According to learners, this course offers a solid foundation in Spark SQL and DataFrames using Scala, primarily designed for those aiming for Data Engineering roles. Many students highlighted the hands-on labs and exercises as a major strength, providing practical experience essential for application. The course provides detailed guidance for setting up a Big Data environment, a step some found complex but necessary. It effectively covers the essentials of Spark for data processing. Reviewers consistently noted the importance of having a decent prior understanding of Scala, as the introductory section is brief. While generally well-received for its practical focus, some reviews suggested certain parts might feel slightly outdated. Overall, it is viewed as a valuable course for getting started with Spark.

Strong Scala skills are recommended.

"You really need a decent understanding of Scala before taking this course."

"The Scala warm-up section is too brief if you're not already familiar."

"Wish I had stronger Scala skills going into this, it would have helped."

"Recommends a good grasp of Scala fundamentals, which I found necessary."

Detailed but can be challenging.

"Setting up the environment was quite involved, taking significant time."

"The instructions for Cloud9 setup were helpful, although troubleshooting was needed at times."

"Environment setup felt like the hardest part, requiring patience."

"Setting up the lab environment is crucial and well-documented, but be prepared for potential issues."

Geared towards practical application.

"The course has a very practical, hands-on approach."

"Liked that it focuses on applying Spark to real-world tasks."

"Great for learning how to use Spark for actual data processing jobs."

"The course is highly practical and focused on implementation."

Good introduction to core Spark concepts.

"Provides a solid introduction to Spark SQL and DataFrames."

"I got a good understanding of the basic Spark operations needed for data engineering."

"Covers the essentials of Spark DataFrames effectively."

"The course material explains fundamental Spark concepts clearly."

Provides essential practical experience.

"The hands-on labs were the most valuable part of the course for me."

"I learned so much by actually doing the exercises in the labs."

"The emphasis on hands-on coding and labs made the concepts stick better."

"Practical application through labs is a major strength here."

Some parts feel a bit old.

"Some sections felt a bit outdated compared to current industry practices."

"While it mentions Spark 3, parts of the course seem based on older versions or approaches."

"Could use some updates to reflect the latest Spark features and best practices."

"Some material felt slightly behind the curve."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Spark SQL and Spark 3 using Scala Hands-On with Labs with these activities:

Review Scala Fundamentals

Show steps

Reviewing Scala fundamentals will ensure a smoother learning experience when applying Spark Data Frame APIs.

Show steps

Review basic syntax and data types.
Practice writing simple Scala functions.
Familiarize yourself with Scala collections.

Practice Basic HDFS Commands

Show steps

Practicing HDFS commands will help you manage data within the Hadoop environment used by Spark.

Show steps

Practice listing, creating, and deleting directories.
Practice copying files between local and HDFS.
Practice checking file metadata and storage usage.

Read "Learning Spark"

Show steps

Reading "Learning Spark" will provide a deeper understanding of the underlying concepts and best practices for using Spark SQL and DataFrames.

View Learning Spark: Lightning-Fast Big Data Analysis on Amazon

Show steps

Read the chapters related to Spark SQL and DataFrames.
Work through the examples provided in the book.
Compare the book's examples with the course's labs.

Four other activities

Expand to see all activities and additional details

Show all seven activities

Implement Spark SQL Queries

Show steps

Practicing Spark SQL queries will reinforce your understanding of SQL syntax and its application within the Spark environment.

Show steps

Write SQL queries to filter, aggregate, and sort data.
Experiment with different Spark SQL functions.
Optimize query performance using techniques learned in the course.

Build a Simple Data Pipeline

Show steps

Building a data pipeline will allow you to apply your knowledge of Spark SQL and DataFrames to a real-world problem.

Show steps

Define the data source and target for the pipeline.
Implement data extraction, transformation, and loading (ETL) using Spark SQL or DataFrames.
Test and validate the pipeline's functionality.
Document the pipeline's design and implementation.

Write a Blog Post on Spark Optimization

Show steps

Writing a blog post will help you consolidate your knowledge of Spark optimization techniques and share your insights with others.

Show steps

Research different Spark optimization techniques.
Choose a specific optimization technique to focus on.
Write a clear and concise explanation of the technique.
Provide examples of how to apply the technique in practice.

Read "Spark: The Definitive Guide"

Show steps

Reading "Spark: The Definitive Guide" will provide a more in-depth understanding of Spark's capabilities and advanced features.

View Spark: The Definitive Guide on Amazon

Show steps

Read the chapters related to advanced Spark SQL features.
Explore the book's examples of complex data transformations.
Compare the book's recommendations with your own experiences.

Career center

Learners who complete Spark SQL and Spark 3 using Scala Hands-On with Labs will develop knowledge and skills that may be useful to these careers:

Data Engineer

A Data Engineer designs, builds, and manages data pipelines, and this course is directly relevant to this role. Data Engineers transform data into a format that is more useful for analysis. This course helps build a foundation in using Spark SQL and Spark Data Frame APIs with Scala, which are core technologies for data engineering pipelines. You can learn how to process data at scale, set up a big data cluster, and use tools like Hadoop, Hive, and YARN. If you want to become a Data Engineer, this course will prepare you, especially if you want to use Spark and Scala in your daily work.

See salaries and explore the career path for Data Engineer

Analytics Engineer

An Analytics Engineer focuses on transforming raw data into usable datasets for analysis. This course is directly relevant because it covers data engineering pipelines using Spark SQL and Spark Data Frame APIs. This course helps build skills in data transformation, data modeling, and pipeline development, which are core to analytics engineering. Learning Scala within the Spark ecosystem provides the practical knowledge needed for a successful career as an Analytics Engineer working with big data technologies.

See salaries and explore the career path for Analytics Engineer

Big Data Developer

A Big Data Developer is involved in developing and maintaining scalable big data solutions. This course provides the relevant skills for developing these solutions using Spark and Scala. You will gain hands-on experience in setting up and configuring a big data cluster, using Spark SQL and Data Frame APIs for data processing, and working with related technologies like Hadoop and Hive. The course's focus on practical exercises and real-world tasks makes it valuable for anyone wanting to develop big data applications. If your goal is to become a Big Data Developer, this course can help you gain the necessary hands-on skills.

See salaries and explore the career path for Big Data Developer

ETL Developer

An Extract, Transform, Load or ETL Developer builds data pipelines to extract data from various sources, transforms it, and loads it into a data warehouse. This course may be useful because it focuses on building data engineering pipelines using Spark SQL and Spark Data Frame APIs with Scala. The course content on data processing and transformations directly applies to building efficient ETL processes. Additionally, the experience of setting up and managing Big Data clusters as taught in the course is also relevant to an ETL Developer. You will gain practical experience with industry-standard tools.

See salaries and explore the career path for ETL Developer

Data Architect

A Data Architect designs and manages the data infrastructure for an organization. This course provides a strong foundation in the technologies used to build scalable data pipelines. You will gain hands-on experience with Spark SQL, Data Frame APIs, and big data cluster setup, which are all essential for designing efficient and reliable data architectures. If you want to become a Data Architect, this course will help you develop a practical understanding of the technologies needed to build modern data infrastructure including data lakes and data warehouses.

See salaries and explore the career path for Data Architect

Data Warehouse Architect

A Data Warehouse Architect designs and oversees the implementation of data warehousing solutions. This course can help build a solid foundation in the technologies used in modern data warehousing, particularly Apache Spark. Knowing how to use Spark SQL and Data Frame APIs is essential for anyone architecting data solutions at scale. Furthermore, the course's coverage of setting up and managing big data clusters provides practical insights into the infrastructure aspects of data warehousing. If you aspire to be a Data Warehouse Architect, this course will help you understand the practical considerations of building scalable and efficient data warehouses.

See salaries and explore the career path for Data Warehouse Architect

Machine Learning Engineer

A Machine Learning Engineer develops and deploys machine learning models. This course may be useful for understanding how to process and prepare data for machine learning at scale. The course's focus on Spark SQL and Data Frame APIs allows you to become proficient in data manipulation and transformation, which is an important step in the machine learning pipeline. You will also learn how to work with big data technologies that are commonly used in machine learning workflows. For those who wish to become a Machine Learning Engineer, this course will help develop skills in data engineering for machine learning.

See salaries and explore the career path for Machine Learning Engineer

Data Scientist

A Data Scientist analyzes large datasets, develops statistical models, and derives insights to inform business decisions. While Data Scientists often focus on the analytical aspects, understanding data engineering is becoming increasingly important. This course can provide a strong foundation in data processing using Spark SQL and Data Frame APIs. You will gain the skills to manipulate and transform data at scale, which is valuable for preparing data for analysis and modeling. This course is relevant for Data Scientists who want to expand their skillset into data engineering aspects.

See salaries and explore the career path for Data Scientist

Solutions Architect

A Solutions Architect designs and implements IT solutions to address business problems. This course can be beneficial in understanding how to design data-centric solutions using Apache Spark. The knowledge of Spark SQL, Data Frame APIs, and big data cluster setup will help you design scalable and efficient data processing solutions. This course provides the practical skills needed to make informed decisions about data architecture and technology choices. For aspiring Solutions Architects, this course helps develop a practical understanding of big data technologies.

See salaries and explore the career path for Solutions Architect

Cloud Engineer

A Cloud Engineer builds and maintains cloud infrastructure and services. The course can prove useful because it involves setting up a big data cluster on AWS Cloud9 or GCP, which helps build practical experience with cloud environments. Understanding how to deploy and manage big data technologies in the cloud through Spark, Hadoop, and related tools are also relevant. For Cloud Engineers who want to specialize in big data deployments, this course offers a useful skillset.

See salaries and explore the career path for Cloud Engineer

Software Engineer

A Software Engineer designs, develops, and tests software applications. This course provides valuable experience in using Scala, a programming language often used in building scalable and high-performance applications. The course's coverage of Spark SQL and Data Frame APIs within the context of Scala can be valuable for Software Engineers working on data-intensive applications. If you are a Software Engineer looking to expand your skillset into big data processing, this course is relevant.

See salaries and explore the career path for Software Engineer

Database Administrator

A Database Administrator manages and maintains databases, ensuring their availability, performance, and security. This course can be relevant because it covers aspects of managing data within a big data environment using technologies like Hadoop and Hive. Setting up and configuring these systems, as taught in the course, provides valuable experience that can be applied to managing data in distributed systems. The course may be useful in expanding a Database Administrator's skill set into the realm of big data technologies.

See salaries and explore the career path for Database Administrator

Data Analyst

A Data Analyst interprets data and transforms it into insights that inform business decisions. While Data Analysts often use tools like SQL and Excel, understanding big data technologies can be increasingly valuable. This course may be useful by giving familiarity with Spark SQL and Data Frame APIs for data manipulation. You can learn how to process large datasets and extract meaningful information. This course is relevant for Data Analysts who wish to expand their skills in big data processing.

See salaries and explore the career path for Data Analyst

Application Developer

An Application Developer designs and codes applications. This course can be beneficial for Application Developers who want to work on data-intensive applications that require scalable data processing. Learning Spark SQL and Data Frame APIs with Scala will enable you to build applications that can efficiently handle large datasets. This course helps expand an Application Developer's skill set into the realm of big data and distributed computing.

See salaries and explore the career path for Application Developer

Business Intelligence Analyst

A Business Intelligence Analyst analyzes data to identify trends and insights that help improve business performance. This course can be helpful in understanding how data is processed and transformed in a big data environment. The course's coverage of Spark SQL and data frame APIs allows you to learn how to efficiently query and analyze large datasets. For aspiring Business Intelligence Analysts, this course provides an understanding of data processing technologies used in modern business intelligence systems.

See salaries and explore the career path for Business Intelligence Analyst

Spark SQL and Spark 3 using Scala Hands-On with Labs

Here's a deal for you

What's inside

Syllabus

Traffic lights

Save this course

Reviews summary

Practical spark sql and dataframes with scala

Activities

Career center

Reading list

Share

Similar courses