We may earn an affiliate commission when you visit our partners.
Course image
Bill Howe

Data analysis has replaced data acquisition as the bottleneck to evidence-based decision making --- we are drowning in it. Extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales.

Read more

Data analysis has replaced data acquisition as the bottleneck to evidence-based decision making --- we are drowning in it. Extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales.

In this course, you will learn the landscape of relevant systems, the principles on which they rely, their tradeoffs, and how to evaluate their utility against your requirements. You will learn how practical systems were derived from the frontier of research in computer science and what systems are coming on the horizon. Cloud computing, SQL and NoSQL databases, MapReduce and the ecosystem it spawned, Spark and its contemporaries, and specialized systems for graphs and arrays will be covered.

You will also learn the history and context of data science, the skills, challenges, and methodologies the term implies, and how to structure a data science project. At the end of this course, you will be able to:

Learning Goals:

1. Describe common patterns, challenges, and approaches associated with data science projects, and what makes them different from projects in related fields.

2. Identify and use the programming models associated with scalable data manipulation, including relational algebra, mapreduce, and other data flow models.

3. Use database technology adapted for large-scale analytics, including the concepts driving parallel databases, parallel query processing, and in-database analytics

4. Evaluate key-value stores and NoSQL systems, describe their tradeoffs with comparable systems, the details of important examples in the space, and future trends.

5. “Think” in MapReduce to effectively write algorithms for systems including Hadoop and Spark. You will understand their limitations, design details, their relationship to databases, and their associated ecosystem of algorithms, extensions, and languages.

write programs in Spark

6. Describe the landscape of specialized Big Data systems for graphs, arrays, and streams

Enroll now

What's inside

Syllabus

Data Science Context and Concepts
Understand the terminology and recurring principles associated with data science, and understand the structure of data science projects and emerging methodologies to approach them. Why does this emerging field exist? How does it relate to other fields? How does this course distinguish itself? What do data science projects look like, and how should they be approached? What are some examples of data science projects?
Read more
Relational Databases and the Relational Algebra
Relational Databases are the workhouse of large-scale data management. Although originally motivated by problems in enterprise operations, they have proven remarkably capable for analytics as well. But most importantly, the principles underlying relational databases are universal in managing, manipulating, and analyzing data at scale. Even as the landscape of large-scale data systems has expanded dramatically in the last decade, relational models and languages have remained a unifying concept. For working with large-scale data, there is no more important programming model to learn.
MapReduce and Parallel Dataflow Programming
The MapReduce programming model (as distinct from its implementations) was proposed as a simplifying abstraction for parallel manipulation of massive datasets, and remains an important concept to know when using and evaluating modern big data platforms.
NoSQL: Systems and Concepts
NoSQL systems are purely about scale rather than analytics, and are arguably less relevant for the practicing data scientist. However, they occupy an important place in many practical big data platform architectures, and data scientists need to understand their limitations and strengths to use them effectively.
Graph Analytics
Graph-structured data are increasingly common in data science contexts due to their ubiquity in modeling the communication between entities: people (social networks), computers (Internet communication), cities and countries (transportation networks), or corporations (financial transactions). Learn the common algorithms for extracting information from graph data and how to scale them up.

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Developers and data engineers involved with large-scale data will benefit from developing their programming skills in MapReduce for the purpose of writing algorithms in Hadoop and Spark
Students will gain an understanding of the tradeoffs associated with the different NoSQL systems
Develops understanding of relational databases, NoSQL systems, and MapReduce programming models
Course discusses how MapReduce programming model (as opposed to their implementations) is an important concept in understanding how parallel data manipulation is managed
Teaches how to evaluate key-value stores and NoSQL systems
Develops competency in writing programs in Spark

Save this course

Save Data Manipulation at Scale: Systems and Algorithms to your list so you can find it easily later:
Save

Reviews summary

Data manipulation course

According to students, this data manipulation course is fundamentally robust and covers a wide range of material including SQL, Python, Twitter API, and MapReduce. Introductory students may find the material to be challenging, but most appreciate the practical nature of the projects and assignments.
Projects and assignments involve real-world applications.
"Doing manipulation and calculations directly in the database was a new idea to me..."
Course material spans many data manipulation techniques.
"Very wide and fundamentally robust introduction."
Students without prior experience may struggle.
"If you have no prior experience in python or sql, you should get some before enrolling."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Data Manipulation at Scale: Systems and Algorithms with these activities:
Organize Course Notes and Assignments
Improves your organization and facilitates efficient review of course materials.
Browse courses on Note taking
Show steps
  • Create a system for organizing your notes, assignments, and other course materials.
  • Regularly review and update your notes to ensure they are comprehensive and accurate.
  • Use your organized materials to prepare for exams and assignments.
Solve LeetCode Problems on Data Structures and Algorithms
Sharpens your problem-solving skills and strengthens your understanding of fundamental data science concepts.
Browse courses on Data Structures
Show steps
  • Select LeetCode problems related to data structures and algorithms.
  • Work through the problems, implementing efficient solutions.
  • Review your solutions and identify areas for improvement.
Join a Data Science Study Group
Provides a collaborative environment to discuss concepts, solve problems, and share knowledge.
Browse courses on Group Learning
Show steps
  • Find a study group that aligns with your interests and learning goals.
  • Attend regular meetings and actively participate in discussions.
  • Work together on projects or assignments to reinforce your learning.
Five other activities
Expand to see all activities and additional details
Show all eight activities
Read The Data Science Handbook
Provides a comprehensive overview of the field and its core principles, helping you build a strong foundation.
Show steps
  • Read each chapter thoroughly, taking notes on key concepts and ideas.
  • Complete the end-of-chapter exercises to test your understanding.
  • Discuss the book's content with classmates or colleagues to reinforce your learning.
Follow a Spark Tutorial Series
Enhances your practical skills by guiding you through the process of working with Spark.
Browse courses on Apache Spark
Show steps
  • Choose a reputable tutorial series that covers the fundamentals of Spark.
  • Follow the tutorials step-by-step, implementing the code examples provided.
  • Experiment with different Spark features and functions to deepen your understanding.
Build a Data Visualization Project
Reinforces your understanding of data analysis and presentation techniques.
Browse courses on Data Visualization
Show steps
  • Identify a dataset that you're interested in exploring.
  • Choose a data visualization tool and learn its basic functionality.
  • Create a visualization that effectively communicates the insights from your data.
Develop a Data Analysis Pipeline
Provides hands-on experience in designing and implementing data analysis pipelines, enhancing your practical skills.
Show steps
  • Define the data sources and the desired output of your pipeline.
  • Choose appropriate tools and technologies for data ingestion, processing, and analysis.
  • Implement the pipeline and test its functionality.
Attend Data Science Industry Meetups
Connects you with professionals in the field, exposes you to industry trends, and expands your professional network.
Browse courses on Networking
Show steps
  • Identify relevant meetups in your area or online.
  • Attend meetups and actively participate in discussions.
  • Connect with individuals who share your interests and career goals.

Career center

Learners who complete Data Manipulation at Scale: Systems and Algorithms will develop knowledge and skills that may be useful to these careers:
Data Scientist
Data Scientists mine and analyze massive datasets using advanced algorithms and techniques, with the end goal of discovering actionable insights and patterns from complex data. This course provides foundational knowledge of data manipulation at scale, including relational databases, MapReduce, and NoSQL systems. This powerful skillset will be critical for you as a Data Scientist, as it will allow you to handle large-scale data efficiently and effectively, supporting your efforts to draw meaningful conclusions and make data-driven decisions.
Data Analyst
Data Analysts are responsible for collecting, cleaning, and analyzing data to identify trends and patterns, extracting meaningful insights, and recommending actions. This course will be a great foundation for your work as a Data Analyst, as it covers practical systems, principles, and tradeoffs involved in data manipulation at scale. With this knowledge, you will be well-prepared to tackle the challenges of working with large and complex datasets in your role as a Data Analyst.
Database Administrator
Database Administrators (DBAs) are responsible for managing and maintaining database systems, ensuring data integrity, performance, and security. By taking this course, you will gain a solid foundation in data manipulation at scale, including relational databases and advanced data processing techniques. This knowledge is critical for DBAs, as it will help you manage and optimize database systems effectively, ensuring the availability and reliability of data for your organization.
Data Engineer
Data Engineers design, build, and maintain the infrastructure and systems that support data storage, processing, and analysis. This course will equip you with a deep understanding of the principles and techniques used in data manipulation at scale. You will learn about relational databases, MapReduce, and NoSQL systems, which are essential technologies for Data Engineers. This knowledge will be instrumental in your ability to design and implement scalable and efficient data systems.
Data Architect
Data Architects design and manage the overall data architecture for an organization, ensuring that data is structured and organized to meet the needs of the business. This course will be highly valuable to you in your role as a Data Architect, as it will provide you with a comprehensive understanding of data manipulation at scale. You will learn about the principles, techniques, and tradeoffs involved in designing and implementing scalable data architectures, empowering you to make informed decisions and create efficient data systems.
Software Engineer
Software Engineers design, develop, and maintain software systems. As a Software Engineer specializing in data-intensive applications, you will need to have a strong understanding of data manipulation at scale. This course will provide you with the knowledge and skills you need to work with large datasets efficiently using relational databases, MapReduce, and NoSQL systems.
Statistician
Statisticians collect, analyze, interpret, and present data to help businesses and organizations make informed decisions. This course may be helpful for your work as a Statistician, as it will provide you with a foundation in data manipulation at scale. You will learn about the principles and techniques used to work with large datasets, including relational databases and advanced data processing techniques. This knowledge can help you better understand the data your organization has and how it can be used to make better decisions.
Data Visualization Engineer
Data Visualization Engineers design and build data visualizations that help people understand data. This course may be helpful for your work as a Data Visualization Engineer, as it will provide you with a foundation in data manipulation at scale. You will learn about the principles and techniques used to work with large datasets, including relational databases and advanced data processing techniques. This knowledge will be essential for your work, as you will need to be able to efficiently prepare and process data to create clear and informative visualizations.
Machine Learning Engineer
Machine Learning Engineers design and build machine learning models that can learn from data and make predictions. This course may be helpful for you in your role as a Machine Learning Engineer, as it will provide you with a foundation in data manipulation at scale. You will learn about the principles and techniques used to work with large datasets, including relational databases and advanced data processing techniques. This knowledge will be essential for your work, as you will need to be able to efficiently prepare and process large amounts of data to train and evaluate your machine learning models.
Business Analyst
Business Analysts help businesses understand their data and make better decisions. This course may be helpful for your work as a Business Analyst, as it will provide you with a foundation in data manipulation at scale. You will learn about the principles and techniques used to work with large datasets, including relational databases and advanced data processing techniques. This knowledge can help you better understand the data your business has and how it can be used to improve decision-making.
Cloud Architect
Cloud Architects design and manage cloud computing systems for organizations. This course may be helpful for your work as a Cloud Architect, as it will provide you with a foundation in data manipulation at scale. You will learn about the principles and techniques used to work with large datasets, including relational databases and advanced data processing techniques. This knowledge will be helpful for you when designing and managing cloud-based data systems.
Database Developer
Database Developers design and build databases for organizations. This course may be helpful for your work as a Database Developer, as it will provide you with a foundation in data manipulation at scale. You will learn about the principles and techniques used to work with large datasets, including relational databases and advanced data processing techniques. This knowledge will be essential for your work, as you will need to be able to design and build efficient and scalable databases.
Systems Analyst
Systems Analysts analyze and design computer systems for organizations. This course may be helpful for your work as a Systems Analyst, as it will provide you with a foundation in data manipulation at scale. You will learn about the principles and techniques used to work with large datasets, including relational databases and advanced data processing techniques. This knowledge can help you better understand the data needs of an organization and design systems to meet those needs.
Software Developer
Software Developers design, develop, and maintain software applications. This course may be helpful for your work as a Software Developer, as it will provide you with a foundation in data manipulation at scale. You will learn about the principles and techniques used to work with large datasets, including relational databases and advanced data processing techniques. This knowledge will be essential for your work, as you may need to handle large amounts of data in your software applications.
Web Developer
Web Developers design and develop websites and web applications. This course may be helpful for your work as a Web Developer, as it will provide you with a foundation in data manipulation at scale. You will learn about the principles and techniques used to work with large datasets, including relational databases and advanced data processing techniques. This knowledge can help you better understand the data needs of a website or web application and design systems to meet those needs.

Reading list

We've selected 22 books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Data Manipulation at Scale: Systems and Algorithms.
Provides a comprehensive overview of big data systems, including their architectures, programming models, and use cases. It valuable resource for anyone who wants to learn more about the principles and best practices of big data systems.
Provides a comprehensive overview of data science for business professionals. It covers the key concepts of data mining, data analytics, machine learning, and big data, and provides practical advice on how to use these technologies to improve business decision-making.
Provides a practical introduction to data science using Python, covering the fundamental concepts, techniques, and applications. It is particularly useful for readers who want to learn how to apply data science to real-world problems using Python.
Provides a comprehensive introduction to data science, covering the basics of data manipulation, data visualization, and machine learning. It valuable resource for anyone who wants to learn more about the fundamentals of data science.
Provides a gentle introduction to machine learning, covering the fundamental concepts, algorithms, and applications. It is particularly useful for readers who are new to the field or who want to gain a high-level understanding of machine learning.
Is the definitive guide to Spark, the open-source framework for distributed processing of large data sets. It covers all aspects of Spark, from installation and configuration to programming and troubleshooting.
Provides a comprehensive overview of deep learning, covering the basics of neural networks, convolutional neural networks, and recurrent neural networks. It valuable resource for anyone who wants to learn more about the fundamentals of deep learning.
Provides a comprehensive overview of graph databases, including their theory, use, and applications. It valuable resource for anyone who wants to learn more about graph databases and how to use them to solve real-world problems.
Provides a comprehensive overview of data science. It covers a wide range of topics, including data collection, data analysis, and data visualization.
Provides a comprehensive overview of Spark, covering the basics of Spark SQL, Spark Streaming, and Spark MLlib. It valuable resource for anyone who wants to learn more about the fundamentals of Spark.
Provides a comprehensive overview of machine learning for data science. It covers a wide range of topics, including supervised learning, unsupervised learning, and deep learning.
Provides a hands-on introduction to Hadoop. It covers a wide range of topics, including data storage, data processing, and data analysis.
Provides a comprehensive overview of Hadoop, covering the basics of Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Hadoop YARN. It valuable resource for anyone who wants to learn more about the fundamentals of Hadoop.
Provides a comprehensive overview of NoSQL databases, covering the basics of key-value stores, document databases, and graph databases. It valuable resource for anyone who wants to learn more about the fundamentals of NoSQL databases.
Provides a hands-on introduction to data science. It covers a wide range of topics, including data cleaning, data analysis, and data visualization.
Provides a comprehensive overview of graph databases, covering the basics of graph data models, graph query languages, and graph algorithms. It valuable resource for anyone who wants to learn more about the fundamentals of graph databases.
Provides a comprehensive overview of data-intensive text processing with MapReduce, covering the basics of text tokenization, text classification, and text clustering. It valuable resource for anyone who wants to learn more about the fundamentals of data-intensive text processing with MapReduce.
Provides a comprehensive overview of Elasticsearch, covering the basics of Elasticsearch data model, Elasticsearch query language, and Elasticsearch administration. It valuable resource for anyone who wants to learn more about the fundamentals of Elasticsearch.
Provides a comprehensive overview of MongoDB, covering the basics of MongoDB data model, MongoDB query language, and MongoDB administration. It valuable resource for anyone who wants to learn more about the fundamentals of MongoDB.
Provides a comprehensive overview of machine learning, covering the basics of supervised learning, unsupervised learning, and reinforcement learning. It valuable resource for anyone who wants to learn more about the fundamentals of machine learning.
Provides a comprehensive overview of data mining, covering the basics of data mining concepts, data mining algorithms, and data mining applications. It valuable resource for anyone who wants to learn more about the fundamentals of data mining.

Share

Help others find this course page by sharing it with your friends and followers:
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser