We may earn an affiliate commission when you visit our partners.
Pluralsight logo

Handling Batch Data with Apache Spark on Databricks

Janani Ravi

This course will teach you how to transform and aggregate batch data using Apache Spark on the Azure Databricks platform using selection, filter, and aggregation queries, built-in and user-defined functions, and perform windowing and join operations on batch data.

Read more

This course will teach you how to transform and aggregate batch data using Apache Spark on the Azure Databricks platform using selection, filter, and aggregation queries, built-in and user-defined functions, and perform windowing and join operations on batch data.

Azure Databricks allows you to work with big data processing and queries using the Apache Spark unified analytics engine. Azure Databricks allows to work with a variety of batch sources and makes it seamless to analyze, visualize, and process data on the Azure Cloud Platform. In this course, Handling Batch Data with Apache Spark on Databricks, you will learn how to perform transformations and aggregations on batch data with selection, filtering, grouping, and ordering queries that use the DataFrame API. You will understand the difference between narrow transformations and wide transformations in Spark which will help you figure out why certain transformations are more efficient than others. You will also see how you can execute these same transformations by executing SQL queries on your data. Next, you will learn how you can implement your own custom user-defined functions to process your data. You will write code on Azure Databricks notebooks to define and register your UDFs and use them to transform your data. You will also understand how to define and use different flavors of vectorized UDFs for data processing and learn how vectorized UDFs are often more efficient than regular UDFs. Along the way, you will also see how you can read from Azure Cosmos DB as a source for your batch data. Finally, you will see how you can repartition your data in memory to improve processing performance, you will use window functions to compute statistics on your data and you will combine data frames using union and join operations. When you’re finished with this course you will have the skills and ability to perform advanced transformations and aggregations on batch data, including defining and using user-defined functions for processing.

Enroll now

What's inside

Syllabus

Course Overview
Transforming Data Using DataFrames
Transforming Data Using Spark SQL
Applying User-defined Functions to Transform Data
Read more
Processing Data Using Joins and Window Functions

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Develops data processing capabilities, which are core skills for data analysts and engineers
Explores concepts of batch data manipulation with Apache Spark and Azure Databricks, which is standard in industry
Offers practical experience through Azure Databricks notebooks, valuable for hands-on learners
Instructed by Janani Ravi, a recognized expert in data analytics
Requires prior experience with data processing concepts and programming
May require additional resources, such as a subscription fee, to access Azure Databricks notebooks

Save this course

Save Handling Batch Data with Apache Spark on Databricks to your list so you can find it easily later:
Save

Activities

Coming soon We're preparing activities for Handling Batch Data with Apache Spark on Databricks. These are activities you can do either before, during, or after a course.

Career center

Learners who complete Handling Batch Data with Apache Spark on Databricks will develop knowledge and skills that may be useful to these careers:
Data Scientist
A data scientist, using both programming and statistical analysis, uncovers knowledge from data. Data science is used to create predictive models that can, for example, predict weather patterns or determine sales outcomes. This course is a great introduction to transforming and aggregating data using Apache Spark on Azure Databricks, skills that are essential to data science.
Data Engineer
A data engineer is responsible for designing, building, and maintaining data pipelines. This course may be useful for data engineers who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate batch data.
Data Architect
A data architect designs and builds the architecture for data systems. This course may be useful for data architects who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate batch data.
Data Visualization Analyst
A data visualization analyst creates visual representations of data. This course may be useful for data visualization analysts who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate data in order to create visualizations that can be used to communicate insights to stakeholders.
Machine Learning Engineer
A machine learning engineer constructs and deploys machine learning models. This course may be useful for machine learning engineers who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate batch data for use in machine learning models.
Statistician
A statistician collects, analyzes, interprets, and presents data. This course may be useful for statisticians who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate large amounts of data.
Operations Research Analyst
An operations research analyst uses mathematical and analytical methods to solve complex business problems. This course may be useful for operations research analysts who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate large amounts of data.
Quantitative Analyst
A quantitative analyst uses mathematical and statistical models to analyze and interpret financial data. This course may be useful for quantitative analysts who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate large amounts of financial data.
Computer Scientist
A computer scientist researches and develops new computing technologies. This course may be useful for computer scientists who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate large amounts of data.
Software Developer
A software developer designs, builds, and maintains software systems. This course may be useful for software developers who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate batch data.
Actuary
An actuary is a business professional who assesses financial risks in the insurance and finance industries. This course may be useful for actuaries who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate large amounts of data.
Database Administrator
A database administrator is responsible for the installation, configuration, maintenance, and performance of database management systems. This course may be useful for database administrators who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate large amounts of data.
Business Analyst
A business analyst is a professional who can identify and solve business problems through the use of data analysis. This course may be useful for business analysts who need to learn how to use Apache Spark on Azure Databricks to transform and aggregate batch data in order to gain insights that can be used to improve business processes.
Data Analyst
Data analysts gather, clean, and analyze data in order to obtain insights that may be useful for a business. This course, Handling Batch Data with Apache Spark on Databricks, may be useful to anyone in this field because it shows how to read and analyze data from a variety of sources, including Azure Cosmos DB.
Software Engineer
Software engineers research, plan, design, develop, test, and deploy computer systems and applications. This course can be useful for software engineers who are working on big data projects and need to learn how to use Apache Spark on Azure Databricks to transform and aggregate batch data.

Reading list

We've selected 12 books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Handling Batch Data with Apache Spark on Databricks.
Goes into great depth on Apache Spark and would be a valuable reference for anyone working with Spark. This book is considered more in-depth and would be useful for learners who want to gain a deeper understanding of Spark's capabilities.
Provides a comprehensive overview of using Python for data analysis. It covers a wide range of topics, from basic concepts to advanced techniques, and includes hands-on examples and exercises.
Provides a comprehensive overview of using Apache Spark for data processing and analytics. It covers a wide range of topics, from basic concepts to advanced techniques, and includes hands-on examples and exercises.
Provides a comprehensive overview of using Python for data science. It covers a wide range of topics, from basic concepts to advanced techniques, and includes hands-on examples and exercises.
This is an introductory book on Apache Spark which would be helpful for learners who are new to big data. can be used as a primary or supplementary resource for this course.
Focuses on using R for data science. This book would be helpful for learners who are interested in learning more about using R for data science.
Focuses on using Python for data analysis. This book would be helpful for learners who are interested in learning more about using Python for data analysis.
Focuses on data science from scratch. This book would be helpful for learners who are interested in learning more about data science from a beginner's perspective and want to fill in any knowledge gaps.
Focuses on data analysis with Pandas. This book would be helpful for learners who are interested in learning more about data analysis with Pandas.
Focuses on machine learning with Python. This book would be helpful for learners who are interested in learning more about machine learning with Python.
Focuses on deep learning with Python. This book would be helpful for learners who are interested in learning more about deep learning with Python.
Focuses on natural language processing with Python. This book would be helpful for learners who are interested in learning more about natural language processing with Python.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Handling Batch Data with Apache Spark on Databricks.
Getting Started with Apache Spark on Databricks
Most relevant
Data Engineering using Databricks on AWS and Azure
Most relevant
Building Batch Data Processing Solutions in Microsoft...
Most relevant
DP-203: Processing in Azure Using Streaming Solutions
Most relevant
Apache Spark 3 Fundamentals
Most relevant
Conceptualizing the Processing Model for Azure Databricks...
Most relevant
Conceptualizing the Processing Model for Apache Spark...
Most relevant
Building Your First ETL Pipeline Using Azure Databricks
Most relevant
Optimizing Apache Spark on Databricks
Most relevant
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser