We may earn an affiliate commission when you visit our partners.
Course image
Google Cloud Training

Service level indicators (SLIs) and service level objectives (SLOs) are fundamental tools for measuring and managing reliability. In this course, students learn approaches for devising appropriate SLIs and SLOs and managing reliability through the use of an error budget.

Enroll now

What's inside

Syllabus

Introduction to SRE
This module is intended to bring you up to speed on the concepts underpinning SRE, CRE, and SLOs. If you're already familiar with these concepts, you may still find new information and perspectives in this module, but it is not necessary to complete it.
Read more
Targeting Reliability
In this module we’re going to talk about how you measure the desired reliability of a service. We will address what to consider when setting SLOs for your application within your organization. We'll look at the three principles we use to measure the desired reliability of a service: figuring out what you want to promise and to whom, figuring out the metrics you care about that make your service reliability “good", and finally, deciding how much reliability is good enough.
Operating for Reliability
In this module, we’ll start by introducing a mechanism for quantifying unreliability using something called an error budget. We'll show how error budgets help you decide when to focus on making a service more reliable. And then we'll learn about some of the engineering and operational improvements that can help you do that.
Choosing a Good SLI
In this module we will start off by taking a look at some characteristics of monitoring metrics that can make them useful as SLIs and contrast these against other metrics that are less useful. Because the choice of where to measure an SLI is a key variable, we'll cover the five main ways you can measure an SLI and compare their pros and cons.
Developing SLOs and SLIs
In this module, we'll start off with an overview of our four step process for developing SLOs and SLIs for a user journey. We'll introduce the fictional company that created our example mobile game, the infrastructure that we'll be working with, and the simple user journey we'll be applying the four step process to.
Quantifying Risks to SLOs
In this module we'll be taking a critical look at the availability risks for our example service. We want to answer the question: "are our SLO targets and error budgets realistic?"
Consequences of SLO Misses
In this module, we'll cover best practices for documenting your SLOs, the rationale behind a formal error budget policy and how best to create one and finally, we'll look at an example error budget policy in order to understand the trade-offs and incentives that play out during negotiations when trying to write an error budget policy.

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Engages with fundamentals of SRE, CRE, and SLOs, which are important both in industry and academia
Teaches learners how to create an error budget, a valuable tool for quantifying unreliability
Explores the four-step process for developing SLOs and SLIs, providing a structured approach for learners
Taught by experts from Google Cloud Training, which is recognized for its work in the field of SRE

Save this course

Save Site Reliability Engineering: Measuring and Managing Reliability to your list so you can find it easily later:
Save

Reviews summary

Measuring and managing reliability through sre

Learners say this course provides a comprehensive overview of Site Reliability Engineering (SRE), measuring reliability, and setting Service Level Objectives (SLOs) and Service Level Indicators (SLIs). The course includes hands-on assignments and peer-graded exercises. Reviewers report that assignments accurately reflect real-world scenarios. Many reviewers recommend the course for its clear explanations, engaging content, and practical knowledge.
Challenging but rewarding
"The assignments are very challenging but they are worth the effort!"
"The assessments are not quiz, were challenging due to the lack of guides, and the peer-to-peer reviews were not working well."
"The course is well crafted in a sequence which guides one to understand how should one create and manage SLIs for SLO."
Clear and well-organized
"The course is well structured with 4 peer-graded exercises with the right level of difficulty to really solidify the theory into practice."
"This course is very well presented and enjoyable while learning great things along the way !"
"The course is well crafted in a sequence which guides one to understand how should one create and manage SLIs for SLO."
"It is an amazing course to start journey in SRE. It explains the basic concepts about SRE with practical trainings."
Concepts can be immediately applied
"The exercises for defining SLI and target SLOs were helpful in understanding the theory better."
"Pretty great material, immediately applicable, useful artifacts"
"I was able to put those concepts into practice by completing the assignments."
Can be inconsistent and subjective
"The lack of exemples made me wonder many times how to formulate my answer, but comparing your work with peers is valuable when they did the exercise seriously."
"The grading and the students grading each other on such wide domain is not always fair or professional"
"Peer review portion is terrible"
"Having others who don't know what they are doing grading each other is just a bad idea."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Site Reliability Engineering: Measuring and Managing Reliability with these activities:
Review the principles of reliability engineering
Refresh your understanding of the foundational concepts of reliability engineering.
Browse courses on Reliability Engineering
Show steps
  • Read articles or watch videos about reliability engineering principles.
  • Review the concepts of availability, reliability, and maintainability.
  • Discuss reliability engineering principles with peers or mentors.
Review key server monitoring metrics
Reinforce your understanding of key metrics used in server monitoring.
Show steps
  • Identify the most common server monitoring metrics.
  • Understand how each metric is calculated.
  • Explain the significance of each metric in relation to server performance.
Calculate error budgets for different SLIs
Develop proficiency in calculating error budgets, which are crucial for reliability management.
Browse courses on Error Budgets
Show steps
  • Review the concept of error budgets.
  • Practice calculating error budgets for various SLIs.
  • Analyze the impact of different error budgets on service reliability.
Four other activities
Expand to see all activities and additional details
Show all seven activities
Explore open-source tools for SLI and SLO management
Enhance your knowledge of industry-standard tools used for SLI and SLO management.
Browse courses on Monitoring Tools
Show steps
  • Research popular open-source tools for SLI and SLO management.
  • Select a tool and follow its tutorials to set up and use it.
  • Evaluate the tool's capabilities and limitations.
Develop an SLO for a critical service
Gain practical experience in defining and quantifying service reliability.
Browse courses on Service Level Objectives
Show steps
  • Identify a critical service.
  • Define the desired reliability target.
  • Create an SLO that measures the service's reliability.
  • Present the SLO to stakeholders for feedback.
Gather resources on best practices for SLO and error budget management
Curate a collection of valuable resources for ongoing learning and reference.
Show steps
  • Search for articles, blog posts, and videos on SLO and error budget management best practices.
  • Organize the resources into a central document or online repository.
  • Share the compilation with peers or contribute it to an online community.
Contribute to an open-source project focused on SLI or SLO
Gain hands-on experience and make meaningful contributions to the SLI/SLO community.
Browse courses on SLOs
Show steps
  • Identify an open-source project related to SLI or SLO.
  • Familiarize yourself with the project's codebase and documentation.
  • Identify an area where you can contribute and propose your changes.
  • Implement your changes and submit a pull request.

Career center

Learners who complete Site Reliability Engineering: Measuring and Managing Reliability will develop knowledge and skills that may be useful to these careers:
Site Reliability Engineer
A Site Reliability Engineer designs, implements, and maintains the infrastructure and software that power websites and online services. This course helps build a foundation in the principles and practices of Site Reliability Engineering. Students learn how to measure and manage reliability using service level indicators (SLIs) and service level objectives (SLOs). This course is especially relevant for Site Reliability Engineers who want to develop a deeper understanding of reliability engineering and best practices.
DevOps Engineer
A DevOps Engineer collaborates with software developers and operations teams to ensure that software is built, tested, and deployed reliably and efficiently. This course provides a foundation in the principles and practices of Site Reliability Engineering, which is essential for DevOps Engineers who want to develop a deeper understanding of reliability engineering and best practices.
Cloud Architect
A Cloud Architect designs and manages cloud computing solutions. This course helps build a foundation in the principles and practices of Site Reliability Engineering, which is essential for Cloud Architects who want to develop a deeper understanding of reliability engineering and best practices in the cloud.
Software Engineer
A Software Engineer designs, develops, and maintains software applications. This course provides a foundation in the principles and practices of Site Reliability Engineering, which is beneficial for Software Engineers who want to develop a deeper understanding of reliability engineering and best practices.
Data Engineer
A Data Engineer designs and builds data pipelines and data warehouses. This course provides a foundation in the principles and practices of Site Reliability Engineering, which is beneficial for Data Engineers who want to develop a deeper understanding of reliability engineering and best practices for managing data pipelines and data warehouses.
Quality Assurance Analyst
A Quality Assurance Analyst tests and evaluates software products to ensure that they meet quality standards. This course provides a foundation in the principles and practices of Site Reliability Engineering, which is beneficial for Quality Assurance Analysts who want to develop a deeper understanding of reliability engineering and best practices for testing and evaluating software products.
System Administrator
A System Administrator manages and maintains computer systems and networks. This course provides a foundation in the principles and practices of Site Reliability Engineering, which is beneficial for System Administrators who want to develop a deeper understanding of reliability engineering and best practices for managing and maintaining computer systems and networks.
Network Engineer
A Network Engineer designs, builds, and maintains computer networks. This course provides a foundation in the principles and practices of Site Reliability Engineering, which is beneficial for Network Engineers who want to develop a deeper understanding of reliability engineering and best practices for designing, building, and maintaining computer networks.
Security Engineer
A Security Engineer designs and implements security measures to protect computer systems and networks from unauthorized access and attack. This course provides a foundation in the principles and practices of Site Reliability Engineering, which is beneficial for Security Engineers who want to develop a deeper understanding of reliability engineering and best practices for designing and implementing security measures.
Database Administrator
A Database Administrator manages and maintains databases. This course provides a foundation in the principles and practices of Site Reliability Engineering, which is beneficial for Database Administrators who want to develop a deeper understanding of reliability engineering and best practices for managing and maintaining databases.
Cloud Engineer
A Cloud Engineer designs, builds, and maintains cloud computing solutions. This course provides a foundation in the principles and practices of Site Reliability Engineering, which is beneficial for Cloud Engineers who want to develop a deeper understanding of reliability engineering and best practices for designing, building, and maintaining cloud computing solutions.
IT Manager
An IT Manager plans, organizes, and directs the activities of an organization's IT department. This course provides a foundation in the principles and practices of Site Reliability Engineering, which is beneficial for IT Managers who want to develop a deeper understanding of reliability engineering and best practices for managing an IT department.
Project Manager
A Project Manager plans, organizes, and executes projects. This course provides a foundation in the principles and practices of Site Reliability Engineering, which is may be helpful for Project Managers who want to develop a deeper understanding of reliability engineering and best practices for planning, organizing, and executing projects.
Business Analyst
A Business Analyst analyzes business needs and develops solutions to improve business processes. This course may be helpful for Business Analysts who want to develop a deeper understanding of reliability engineering and best practices for analyzing business needs and developing solutions to improve business processes.
Technical Writer
A Technical Writer writes technical documentation, such as user manuals, white papers, and training materials. This course may be helpful for Technical Writers who want to develop a deeper understanding of reliability engineering and best practices for writing technical documentation.

Reading list

We've selected 12 books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Site Reliability Engineering: Measuring and Managing Reliability.
Provides a comprehensive overview of SRE best practices, including how to measure and manage reliability using SLIs and SLOs.
This novel tells the story of an IT team that uses SRE principles to improve their reliability and performance.
This practical companion to the foundational SRE book provides exercises and templates to help readers apply SRE principles to their own organizations.
Provides a comprehensive overview of reliability engineering, including how to measure and manage reliability.
Provides a deep dive into the design of data-intensive applications, covering topics such as data modeling, consistency, and fault tolerance.
Provides a comprehensive overview of the theory and practice of reliability engineering.
Provides a comprehensive overview of the principles and practices of domain-driven design.
Provides a comprehensive overview of the principles and practices of agile software development.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Site Reliability Engineering: Measuring and Managing Reliability.
Site Reliability Engineering: Measuring and Managing...
Most relevant
Implementing Site Reliability Engineering (SRE)...
Most relevant
Site Reliability Engineering (SRE) Fluency
Most relevant
Identifying and Resolving Application Latency for Site...
Managing Teams for Site Reliability Engineering (SRE)
Establishing a Culture of Reliability
Reliability Engineering Concepts
SRE Fundamentals and Security
Incorporating Site Reliability Engineering (SRE) in Your...
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser