We may earn an affiliate commission when you visit our partners.
Packt Publishing

Reliability in AWS includes the ability of a system to recover from infrastructure or service disruptions. It's essential to acquire computing resources to meet the demand, and mitigate disruptions such as configuration issues or transient network problems.

Read more

Reliability in AWS includes the ability of a system to recover from infrastructure or service disruptions. It's essential to acquire computing resources to meet the demand, and mitigate disruptions such as configuration issues or transient network problems.

In this course, you will first explore the key concepts and core services of AWS and Site Reliability Engineering (SRE). We show you step-by-step how to implement a real-world application that is built via the reliability principles defined within the AWS Well-Architected Framework using the SRE approach. So you can increase the reliability of application architectures on AWS by implementing resilience infrastructure and application resilience.

You will be covering some common architectural patterns used every day by real-world AWS solution architects to build reliable systems and implement fault tolerance into an application architecture running on AWS. While learning how to further increase the reliability of application architectures on AWS by implementing multi-region solutions for disaster recovery on a global scale.

By the end of this course, you will have gained a variety of AWS architecture skills that you can then apply to the real world.

About the Author

Malcolm Orr is a Principal Architect with over 20 years' experience in the IT industry. He has worked for consultancies and end clients in the UK, US, and Asia and delivered complex software and infrastructure solutions in the cloud. Malcolm is currently an application architect for AWS.

Enroll now

What's inside

Learning objectives

  • Understand the core principles of site reliability engineering, and how cloud computing enables this
  • Design applications for fault tolerance, auto-healing, resilience, and reliability
  • Examine a simple python microservice ecosystem and understand its limitations
  • Identify critical stack components, and redesign them so they're resilient and reliable
  • Map design changes to native aws services with ease
  • Deploy redesigned applications in a globally accessible, resilient, and reliable way

Syllabus

The Basics of Site Reliability Engineering

This video will give you an overview about the course.

Reliability is a broad word, what does it mean it todays app world?

   •  Know that reliability is just not availability; users will only wait 3sec for a page to load

   •  Introduce the SRE role and production readiness review

Read more

100% reliability is not possible so how do we determine and meet the objective.

   •  Discuss the various type of failures and cascading failures

   •  Understand SLI>SLO>SLA relationship and error budgets

   •  Set and agree your SLI/SLO

Designing for failure is a critical process.

   •  Design for infrastructure failure

   •  Design for application failure

   •  Design for people failures

SRE is not a team or tool, it is an approach.

   •  Know the difference between DevOps and SRE and know How are SRE teams organized

   •  Learn to get started and get developer buy-in

   •  Explore SRE @ AWS

Test your knowledge
Gaining Resilience and Reliability On AWS

The starting point for any design is availability, so how do you leverage AWS foundational capabilities to get a base level of design.

   •  Review regions and basic networking, Subnets/zones, load balancers and direct connect

   •  Explore advanced network with global accelerators and transit gateways

   •  Review basic AWS SLA structure

S3 forms the basis for a lot of content-based applications, how can you make it reliable.

   •  Recap of S3, it is about objects

   •  Learn about replication

   •  Explore storage classes

Databases are required to manage state, how can you make them reliable.

   •  Recap AWS database tech, SQL and NoSQL

   •  Know about RDS reliability and application integration

   •  Learn about DynamoDB reliability and application integration

Compute resources are required to run applications, how can you make them reliable.

   •  Recap of AWS compute tech, EC2 and serverless

   •  Explore EC2 reliability

   •  Learn about serverless reliability

Deeper dive into Load balancing.

   •  Recap of different ELB’s

   •  Achieve reliability with ALB

   •  Achieve reliability with network

Look at how you run containers reliably on AWS.

   •  Recap of differences between K8 and ECS

   •  Learn about scaling and reliability on ECS

   •  Learn about scaling and reliability on K8

Accepting Failure In Multi-Tier Applications

3 tier architectures work well for traditional application but in the cloud native world of container and Microservices they can easily become a bottle neck.

   •  Know that 3 tier is not bad and can work well in AWS

   •  Let us look at an example MS architecture running on AWS

How do we begin to build resilience for cloud native, Microservices.

   •  Review and understand your infrastructure reliability and make the right choices

   •  Consider a few things when building your application using a MS architecture

   •  Learn how does a service mesh help

State management is one of the trickier areas of MS design.

   •  Look at different types of state and consider platform/config state in detail

   •  Consider session state and how to manage it

   •  Consider application state/data and how to manage it

In order to mitigate infrastructure reliability we need to use common cloud reliability patterns in the application.

   •  Look at measuring health and addressing retires and timeouts

   •  Look at circuit breakers and bulkheads

   •  Look at compensating transactions

Quick review of the case study and approach for our course.

   •  Look at the current state

   •  Look at the drivers and requirements

   •  Look at the future state

Surviving Failure of a Global Scale
Deploying Py-Simple On AWS

Most development happens locally store begin by containerizing our applicational and using AWS tools to store.

   •  Review pull request and code changes

   •  Review our application and use some AWS tools to help

   •  Create our repository with terraform and copy/push code

CI is an important part of any release, here we will build a basic CI pipeline with CodeBuild.

   •  Create the image repo with terraform

   •  Add buildspec.yaml top code and create build project

   •  Run the build

Kubernetes is too much of a learning curve initially for widgets.com so a simpler container platform is used.

   •  Run TF to deploy RDS

   •  Use TF to deploy ECS

   •  Quick review of the AWS UI

We now have all the component parts, we need to create a task.

   •  Configure and deploy our ECS task

   •  Test our API with postman and review CloudWatch metrics

   •  Fail some components

While we have added quite a bit of reliability.

   •  Know that the code is not scalable or modular

Designing Py-Global

Py-simple has some issues and FMA helps use categories and prioritize those issues.

   •  Describe Py-simple

   •  Explain FMA and perform a simple FMA exercise

   •  Describe the next phases

Multi-regionals support is the simplest next step, describe the steps required for this.

   •  Describe clusters and ECR mirroring

   •  Describe networking

   •  Describe RDS (read replicas)

Describing how to split the application.

   •  Lean basic MS design

   •  Split Py-simple and enhance tests

   •  Show code

Describing how use Cognito as a authentication backend.

   •  Use Cognito for auth and authz

   •  Use Cognito and JWT

   •  Show config

Describe how to build a pipeline.

   •  Know what codepipline is

   •  Integrate with CodeBuild for build and test

   •  Learn about CodeDeploy and ECS

Describe using Xray and appMesh.

   •  Review general MS ecosystem flows

   •  Describe appMesh deployment

   •  Describe Xray integration

Using analytics and change data capture.

   •  Describe what data we are capturing

   •  Talk about DB support

   •  Describe the approach

Discuss using aurora for multiregional DB.

   •  Explore what is aurora

   •  Get a view of aurora Postgres

   •  Know about cross region replication

Deploying a Resilient, Fault Tolerant Py-Global Application

ECS is a AWS technology and won't support on-premise development or multi cloud.

   •  Deploy an EKS cluster

   •  Deploy our pycar application using skaffold

   •  Use lens to view our cluster

Aurora can provide greater resilience in regions.

   •  Deploy in region Postgres cluster using terraform

   •  Configure table using config script

   •  Point our application at the new instance

App-mesh can help build more resilient services.

   •  Deploy our services to EKS

   •  Create our mesh configuration

   •  Deploy our proxy and show mesh working

Describes what we have built over the last few sections.

   •  Know about reliable infrastructure

   •  Learn about reliable databases

   •  Learn about reliable compute and developer experience

DNS and CDN are two very important aspects of any reliable web service.

   •  Review how DNS works

   •  Look at route53 in detail

   •  Know what CloudFront do and how does it work with route53

What impact have these changes had on the users/developers.

   •  Know that it is a good impact on users but is the SLA/SLO too high

   •  Explore that user will expect more from Kevin and Ian

How do you address oms of these challenges?

   •  Know that operating model is key otherwise you end up with a platform people can support

   •  Learn that SRE can help with providing SME knowledge and focusing on specific things

   •  Look at release engineering and postmortems specifically

Get a review on the entire course.

   •  Review the topics covered in the course

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Covers common architectural patterns used by AWS solution architects, providing practical knowledge for building reliable systems and implementing fault tolerance in AWS environments
Explores the core principles of Site Reliability Engineering (SRE) and how cloud computing enables it, offering a strong foundation for those in SRE roles
Examines a Python microservice ecosystem, identifying its limitations and redesigning it for resilience and reliability, which is useful for developers working with microservices
Requires familiarity with AWS services, which may necessitate additional learning for those new to the AWS ecosystem before fully benefiting from the course
Focuses on AWS-specific tools and services, which may limit its applicability for those working primarily with other cloud platforms or on-premises infrastructure
Involves deploying applications using Terraform, ECS, and Kubernetes, requiring familiarity with infrastructure-as-code and containerization technologies, which may pose a challenge for some learners

Save this course

Save Site Reliability Engineering on AWS to your list so you can find it easily later:
Save

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Site Reliability Engineering on AWS with these activities:
Review AWS Networking Fundamentals
Solidify your understanding of AWS networking concepts before diving into SRE principles on AWS. This will help you better grasp the infrastructure-level resilience strategies discussed in the course.
Browse courses on AWS Networking
Show steps
  • Review the AWS VPC documentation.
  • Practice creating VPCs and subnets in the AWS console.
  • Experiment with different routing configurations.
Read 'The Phoenix Project'
Understand the cultural and organizational aspects of SRE by reading this popular novel. This will help you advocate for SRE principles within your organization.
Show steps
  • Read the book cover to cover.
  • Identify the key challenges faced by the characters.
  • Relate the challenges to your own experiences.
  • Consider how SRE principles could have helped the characters overcome the challenges.
Read 'Site Reliability Engineering' by Google
Gain a deeper understanding of SRE principles by studying Google's approach. This will provide a strong foundation for applying SRE concepts within the AWS ecosystem.
Show steps
  • Read the book cover to cover.
  • Take notes on key concepts and practices.
  • Relate the concepts to AWS services and architectures.
Four other activities
Expand to see all activities and additional details
Show all seven activities
Implement a Basic Monitoring Dashboard
Practice implementing monitoring and alerting for a simple application deployed on AWS. This will reinforce your understanding of SLIs, SLOs, and error budgets.
Show steps
  • Deploy a simple application on EC2 or Lambda.
  • Use CloudWatch to collect metrics.
  • Create a CloudWatch dashboard to visualize the metrics.
  • Set up CloudWatch alarms to trigger on SLO violations.
Write a Blog Post on AWS Resilience Strategies
Solidify your understanding of AWS resilience strategies by writing a blog post explaining different techniques and best practices. This will help you articulate your knowledge and share it with others.
Show steps
  • Research different AWS resilience strategies.
  • Choose a specific topic, such as multi-region deployments or fault-tolerant architectures.
  • Write a clear and concise blog post explaining the concepts.
  • Include examples and diagrams to illustrate your points.
Simulate Failure Scenarios in AWS
Gain hands-on experience with failure scenarios by simulating them in your AWS environment. This will help you identify potential weaknesses and improve your system's resilience.
Show steps
  • Identify critical components of your application.
  • Simulate failures, such as instance outages or network disruptions.
  • Observe how your system responds to the failures.
  • Identify areas for improvement and implement corrective actions.
Contribute to an Open Source SRE Tool
Deepen your understanding of SRE by contributing to an open-source project related to monitoring, alerting, or automation. This will give you hands-on experience with real-world SRE challenges.
Show steps
  • Identify an open-source SRE tool that interests you.
  • Explore the project's codebase and documentation.
  • Identify a bug or feature that you can contribute to.
  • Submit a pull request with your changes.

Career center

Learners who complete Site Reliability Engineering on AWS will develop knowledge and skills that may be useful to these careers:
Site Reliability Engineer
A Site Reliability Engineer focuses on ensuring that systems are reliable, scalable, and efficient. This course directly aligns with the core responsibilities of a Site Reliability Engineer, as it teaches the key concepts and core services of Site Reliability Engineering and how to apply these to AWS. This course covers fault tolerance, auto-healing, resilience, and reliability, all critical aspects of the role. In particular, the focus on architectural patterns and multi-region solutions for disaster recovery will equip you with the skills a Site Reliability Engineer needs to design and implement robust systems.
Solutions Architect
A Solutions Architect translates business requirements into technical solutions, often involving cloud technologies. This course helps a Solutions Architect because it provides practical experience on how to build reliable systems on AWS, something fundamental to the role. The course not only covers core AWS services but also delves into architectural patterns, fault tolerance, and disaster recovery strategies. As a solutions architect, you will appreciate the hands-on approach to deploying applications and understanding the nuances of multi-region deployments covered by this course. It will help a solutions architect with real-world architecture skills.
Cloud Architect
A Cloud Architect designs and oversees the implementation of cloud computing strategies. This course is a great fit because it delves into building reliable systems on AWS. Covering architectural patterns, fault tolerance, and the use of AWS services for resilience, this course helps build a foundation for the cloud architect. The course specifically explores multi-region solutions, which are significant when designing for global-scale applications, which is also a key responsibility of a Cloud Architect. A Cloud Architect should take this course to enhance knowledge of designing highly resilient and reliable cloud infrastructures.
DevOps Engineer
DevOps Engineers are responsible for automating and streamlining software development and deployment processes, with a focus on infrastructure and operations. The course content, particularly the sections on deploying applications on AWS and implementing CI/CD pipelines using tools like CodeBuild, are directly relevant to the DevOps Engineer. This course is helpful for someone in this role due to its coverage of infrastructure as code using Terraform, containerization with ECS and EKS, and load balancing, all of which build a foundation for a DevOps Engineer. The course emphasis on reliability and fault tolerance also speaks to DevOps principles of building stable systems.
Systems Engineer
Systems Engineers focus on the design, implementation, and management of complex systems. This course is extremely useful because it teaches how to design resilient and reliable systems using AWS. The course covers various topics such as setting up infrastructure, configuring databases, and deploying applications, all of which are areas that a Systems Engineer engages with. This course helps a systems engineer gain skills in designing reliable systems on AWS, which are crucial for their success. The course's focus on fault tolerance and disaster recovery specifically enhance a systems engineer skillset.
Infrastructure Engineer
An Infrastructure Engineer is responsible for building and maintaining the core IT infrastructure, often using cloud-based solutions. The course is a strong fit because it focuses on building infrastructure on AWS, specifically for reliability and fault tolerance. This course covers technologies such as AWS networking, compute, storage, and databases which are crucial for an infrastructure engineer. By learning how to deploy, scale, and make these components reliable, the Infrastructure Engineer will learn to improve and advance their skills.
Cloud Consultant
A Cloud Consultant advises organizations on adopting and optimizing their cloud strategies. This course provides the consultant with valuable knowledge of designing reliable and resilient systems on AWS. The consultant needs the skills this course develops to make recommendations to clients. Specifically, the course's focus on fault tolerance, multi-region solutions, and architectural patterns helps the consultant with the practical knowledge required to guide clients. The deep-dive into AWS services and best practices will prove helpful in the long run.
Software Engineer
Software Engineers design, develop, and maintain software applications. This course can be useful for a Software Engineer, particularly one working in cloud-native environments because it introduces concepts of fault tolerance and application resilience. As a software engineer, understanding how infrastructure and application architecture impact reliability will improve how you write code and design software. The hands-on experience of deploying applications on AWS and configuring infrastructure can also be helpful. The course's focus on microservices is particularly relevant to modern software engineering.
Systems Administrator
A Systems Administrator manages an organization's computer systems and networks. Although this is not an exact fit it may be useful to a Systems Administrator because it enhances their understanding of AWS. The course will show the Systems Administrator to deploy applications, configure services, and understand AWS infrastructure. The course explores how to apply SRE principles in real-world scenarios. For a systems administrator looking to broaden their knowledge of cloud technologies, this course may be useful.
Network Engineer
A Network Engineer designs, implements, and manages network infrastructure, and this course is helpful because it delves into aspects of AWS networking. Though not exclusively about network engineering, the course offers insights into how network configurations affect the reliability of applications, how to implement load balancers, and how to design multi-region networks. This would help a Network Engineer understand how to ensure network uptime and reliability within a cloud architecture. The course offers some insight into network topics such as direct connect, global accelerators, and transit gateways.
Cloud Support Engineer
A Cloud Support Engineer provides technical support to customers using cloud services. This course may be useful for someone in the role because it introduces different aspects of SRE and cloud architecture. The course may also help a Cloud Support Engineer learn about the AWS services they are supporting in production. By exploring topics such as fault tolerance, resilience, and disaster recovery, as well as specific AWS tools, they may improve their skills in problem solving. This course may help prepare the Cloud Support Engineer to troubleshoot issues related to cloud reliability.
Technical Project Manager
A Technical Project Manager oversees software and infrastructure projects. This course may be useful to someone in this role because it provides a high-level overview of how to construct a resilient infrastructure on AWS. Understanding the basics of SRE, fault tolerance, and disaster recovery helps the project manager in the understanding and planning of technical projects. The course's focus on deploying applications and managing AWS resources can assist in managing the technical aspects of projects within a cloud environment. It is useful for the project manager looking to better understand cloud technologies.
Database Administrator
A Database Administrator manages and maintains databases, ensuring their performance and reliability. This course may be helpful as it touches upon database reliability in AWS, specifically concerning RDS and DynamoDB. The course covers topics such as replication and application integration. This may be useful for the Database Administrator because understanding how databases are managed in the cloud is important. The course's discussions on multi-regional database solutions and state management may be helpful for database administration in a cloud environment. However, note that this is not a course primarily about databases.
Release Engineer
A Release Engineer focuses on the process of releasing and deploying software. This course may be useful for this role because it includes content that covers automating deployments, CI/CD pipelines, and containerization. Learning about deploying applications on AWS, especially using tools like CodeBuild and Terraform, would be helpful. The course also touches on topics of multi-region deployments, which is pertinent to the work of a Release Engineer. However, note that this is not a course primarily about release engineering.
Technical Writer
A technical writer creates documentation for technical products and services. This course may help a technical writer because it introduces cloud architecture, system resilience, and fault tolerance. While not a direct fit, understanding these topics may help the technical writer produce better documentation on cloud system best-practices. The course's focus on AWS architecture, and the various services it offers may help inform the technical writer on the subject matter. In particular, the technical writer might find it helpful to learn about AWS services like ECS, and EKS.

Reading list

We've selected two books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Site Reliability Engineering on AWS.
Is considered the bible of SRE. It provides a comprehensive overview of SRE principles and practices as implemented at Google. Reading this book will give you a deeper understanding of the concepts discussed in the course and provide real-world examples of how to apply them. It is valuable as additional reading to provide breadth to the existing course.
This novel illustrates the importance of DevOps principles and practices in improving IT performance and business outcomes. While not directly focused on SRE, it provides valuable context for understanding the cultural and organizational aspects of reliability. It is valuable as additional reading to provide breadth to the existing course.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Similar courses are unavailable at this time. Please try again later.
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2025 OpenCourser