Reliability in AWS includes the ability of a system to recover from infrastructure or service disruptions. It's essential to acquire computing resources to meet the demand, and mitigate disruptions such as configuration issues or transient network problems.
Reliability in AWS includes the ability of a system to recover from infrastructure or service disruptions. It's essential to acquire computing resources to meet the demand, and mitigate disruptions such as configuration issues or transient network problems.
In this course, you will first explore the key concepts and core services of AWS and Site Reliability Engineering (SRE). We show you step-by-step how to implement a real-world application that is built via the reliability principles defined within the AWS Well-Architected Framework using the SRE approach. So you can increase the reliability of application architectures on AWS by implementing resilience infrastructure and application resilience.
You will be covering some common architectural patterns used every day by real-world AWS solution architects to build reliable systems and implement fault tolerance into an application architecture running on AWS. While learning how to further increase the reliability of application architectures on AWS by implementing multi-region solutions for disaster recovery on a global scale.
By the end of this course, you will have gained a variety of AWS architecture skills that you can then apply to the real world.
About the Author
Malcolm Orr is a Principal Architect with over 20 years' experience in the IT industry. He has worked for consultancies and end clients in the UK, US, and Asia and delivered complex software and infrastructure solutions in the cloud. Malcolm is currently an application architect for AWS.
This video will give you an overview about the course.
Reliability is a broad word, what does it mean it todays app world?
• Know that reliability is just not availability; users will only wait 3sec for a page to load
• Introduce the SRE role and production readiness review
100% reliability is not possible so how do we determine and meet the objective.
• Discuss the various type of failures and cascading failures
• Understand SLI>SLO>SLA relationship and error budgets
• Set and agree your SLI/SLO
Designing for failure is a critical process.
• Design for infrastructure failure
• Design for application failure
• Design for people failures
SRE is not a team or tool, it is an approach.
• Know the difference between DevOps and SRE and know How are SRE teams organized
• Learn to get started and get developer buy-in
• Explore SRE @ AWS
The starting point for any design is availability, so how do you leverage AWS foundational capabilities to get a base level of design.
• Review regions and basic networking, Subnets/zones, load balancers and direct connect
• Explore advanced network with global accelerators and transit gateways
• Review basic AWS SLA structure
S3 forms the basis for a lot of content-based applications, how can you make it reliable.
• Recap of S3, it is about objects
• Learn about replication
• Explore storage classes
Databases are required to manage state, how can you make them reliable.
• Recap AWS database tech, SQL and NoSQL
• Know about RDS reliability and application integration
• Learn about DynamoDB reliability and application integration
Compute resources are required to run applications, how can you make them reliable.
• Recap of AWS compute tech, EC2 and serverless
• Explore EC2 reliability
• Learn about serverless reliability
Deeper dive into Load balancing.
• Recap of different ELB’s
• Achieve reliability with ALB
• Achieve reliability with network
Look at how you run containers reliably on AWS.
• Recap of differences between K8 and ECS
• Learn about scaling and reliability on ECS
• Learn about scaling and reliability on K8
3 tier architectures work well for traditional application but in the cloud native world of container and Microservices they can easily become a bottle neck.
• Know that 3 tier is not bad and can work well in AWS
• Let us look at an example MS architecture running on AWS
How do we begin to build resilience for cloud native, Microservices.
• Review and understand your infrastructure reliability and make the right choices
• Consider a few things when building your application using a MS architecture
• Learn how does a service mesh help
State management is one of the trickier areas of MS design.
• Look at different types of state and consider platform/config state in detail
• Consider session state and how to manage it
• Consider application state/data and how to manage it
In order to mitigate infrastructure reliability we need to use common cloud reliability patterns in the application.
• Look at measuring health and addressing retires and timeouts
• Look at circuit breakers and bulkheads
• Look at compensating transactions
Quick review of the case study and approach for our course.
• Look at the current state
• Look at the drivers and requirements
• Look at the future state
Most development happens locally store begin by containerizing our applicational and using AWS tools to store.
• Review pull request and code changes
• Review our application and use some AWS tools to help
• Create our repository with terraform and copy/push code
CI is an important part of any release, here we will build a basic CI pipeline with CodeBuild.
• Create the image repo with terraform
• Add buildspec.yaml top code and create build project
• Run the build
Kubernetes is too much of a learning curve initially for widgets.com so a simpler container platform is used.
• Run TF to deploy RDS
• Use TF to deploy ECS
• Quick review of the AWS UI
We now have all the component parts, we need to create a task.
• Configure and deploy our ECS task
• Test our API with postman and review CloudWatch metrics
• Fail some components
While we have added quite a bit of reliability.
• Know that the code is not scalable or modular
Py-simple has some issues and FMA helps use categories and prioritize those issues.
• Describe Py-simple
• Explain FMA and perform a simple FMA exercise
• Describe the next phases
Multi-regionals support is the simplest next step, describe the steps required for this.
• Describe clusters and ECR mirroring
• Describe networking
• Describe RDS (read replicas)
Describing how to split the application.
• Lean basic MS design
• Split Py-simple and enhance tests
• Show code
Describing how use Cognito as a authentication backend.
• Use Cognito for auth and authz
• Use Cognito and JWT
• Show config
Describe how to build a pipeline.
• Know what codepipline is
• Integrate with CodeBuild for build and test
• Learn about CodeDeploy and ECS
Describe using Xray and appMesh.
• Review general MS ecosystem flows
• Describe appMesh deployment
• Describe Xray integration
Using analytics and change data capture.
• Describe what data we are capturing
• Talk about DB support
• Describe the approach
Discuss using aurora for multiregional DB.
• Explore what is aurora
• Get a view of aurora Postgres
• Know about cross region replication
ECS is a AWS technology and won't support on-premise development or multi cloud.
• Deploy an EKS cluster
• Deploy our pycar application using skaffold
• Use lens to view our cluster
Aurora can provide greater resilience in regions.
• Deploy in region Postgres cluster using terraform
• Configure table using config script
• Point our application at the new instance
App-mesh can help build more resilient services.
• Deploy our services to EKS
• Create our mesh configuration
• Deploy our proxy and show mesh working
Describes what we have built over the last few sections.
• Know about reliable infrastructure
• Learn about reliable databases
• Learn about reliable compute and developer experience
DNS and CDN are two very important aspects of any reliable web service.
• Review how DNS works
• Look at route53 in detail
• Know what CloudFront do and how does it work with route53
What impact have these changes had on the users/developers.
• Know that it is a good impact on users but is the SLA/SLO too high
• Explore that user will expect more from Kevin and Ian
How do you address oms of these challenges?
• Know that operating model is key otherwise you end up with a platform people can support
• Learn that SRE can help with providing SME knowledge and focusing on specific things
• Look at release engineering and postmortems specifically
Get a review on the entire course.
• Review the topics covered in the course
OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.
Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.
Find this site helpful? Tell a friend about us.
We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.
Your purchases help us maintain our catalog and keep our servers humming without ads.
Thank you for supporting OpenCourser.