[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71203":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":23,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":45,"readmeContent":46,"aiSummary":47,"trendingCount":16,"starSnapshotCount":16,"syncStatus":18,"lastSyncTime":48,"discoverSource":49},71203,"howtheysre","upgundecha\u002Fhowtheysre","upgundecha","A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)","",null,"JavaScript",9726,886,232,5,0,1,2,39.84,"Creative Commons Zero v1.0 Universal",false,"main",true,[25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44],"alerting","chaos-engineering","dev-ops","devops","hacktoberfest","hacktoberfest-accepted","incident-management","incident-response","infrastructure","ml-ops","monitoring","observability","on-call","post-mortem","reliability","security","site-reliability-engineering","software-engineering","sre","sre-culture","2026-06-12 02:02:49","# How they SRE\n\n![PRs Welcome](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRs-welcome-brightgreen.svg?style=flat-square) [![CI](https:\u002F\u002Fgithub.com\u002Fupgundecha\u002Fhowtheysre\u002Factions\u002Fworkflows\u002Fworkflow.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fupgundecha\u002Fhowtheysre\u002Factions\u002Fworkflows\u002Fworkflow.yml) [![CodeQL](https:\u002F\u002Fgithub.com\u002Fupgundecha\u002Fhowtheysre\u002Factions\u002Fworkflows\u002Fcodeql.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fupgundecha\u002Fhowtheysre\u002Factions\u002Fworkflows\u002Fcodeql.yml)\n\n![How they SRE](headline.png)\n\n\u003C\u002Fbr>\n\n## Introduction\n\n__How They SRE__ How They SRE is a curated knowledge repository of Site Reliability Engineering (SRE) best practices, tools, techniques, and culture adopted by leading technology or tech-savvy organizations.\n\nNumerous organizations frequently share their insights and expertise, encompassing best practices, tools, and techniques that shape their engineering culture. They do this through various public platforms such as engineering blogs, conferences, and meetups. This repository compiles and presents content gathered from these sources.\n\n### Topics\n\n* Site Reliability Engineering\n* Hiring and Building SRE teams\n* SRE Culture\n* DevOps\n* Monitoring & Observability\n* Alerting\n* Incident Response & Post-Mortem\n* On-Call\n* Testing in Production\n* Chaos Engineering\n* Automation\n* Performance\n* Platform Engineering\n\n## Organizations\n\n\u003Cdetails>\n  \u003Csummary>Achievers\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Enter the Abattoir - Building 'à la carte' gitops tooling](https:\u002F\u002Fachievers.engineering\u002Fenter-the-abattoir-ee5e2019f0b3)\n* [Scaling Production Globally — The service mesh facelift (Part-1)](https:\u002F\u002Fachievers.engineering\u002Fscaling-production-globally-service-mesh-face-lift-part-1-30ad6d393d04)\n* [Scaling Production Globally - Solving observability problems for developers (Part-2)](https:\u002F\u002Fachievers.engineering\u002Fscaling-production-globally-solving-observability-problems-for-developers-part-2-b5416ce5eb8a)\n* [Load Testing Kubernetes: Building a Framework (Part-1)](https:\u002F\u002Fachievers.engineering\u002Fload-testing-kubernetes-building-a-framework-part-1-bdc0af4ae7e2)\n* [Load Testing Kubernetes: Resolving bottlenecks and improving performance (Part-2)](https:\u002F\u002Fachievers.engineering\u002Fload-testing-kubernetes-resolving-bottlenecks-and-improving-performance-part-2-c4f08102f105)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Airbnb\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Automated Incident Management Through Slack](https:\u002F\u002Fmedium.com\u002Fairbnb-engineering\u002Fincident-management-ae863dc5d47f)\n* [Detecting Vulnerabilities With Vulnture](https:\u002F\u002Fmedium.com\u002Fairbnb-engineering\u002Fdetecting-vulnerabilities-with-vulnture-f5f23387f6ec)\n* [Alerting Framework at Airbnb](https:\u002F\u002Fmedium.com\u002Fairbnb-engineering\u002Falerting-framework-at-airbnb-35ba48df894f)\n* [When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb](https:\u002F\u002Fmedium.com\u002Fairbnb-engineering\u002Fwhen-the-cloud-gets-dark-how-amazons-outage-affected-airbnb-66eaf8c0f162)\n* [Intelligent Automation Platform: Empowering Conversational AI and Beyond at Airbnb](https:\u002F\u002Fmedium.com\u002Fairbnb-engineering\u002Fintelligent-automation-platform-empowering-conversational-ai-and-beyond-at-airbnb-869c44833ff2)\n* [Production Secret Management at Airbnb](https:\u002F\u002Fmedium.com\u002Fairbnb-engineering\u002Fproduction-secret-management-at-airbnb-ad230e1bc0f6)\n* [Automating Data Protection at Scale, Part 1](https:\u002F\u002Fmedium.com\u002Fairbnb-engineering\u002Fautomating-data-protection-at-scale-part-1-c74909328e08)\n* [Automating Data Protection at Scale, Part 2](https:\u002F\u002Fmedium.com\u002Fairbnb-engineering\u002Fautomating-data-protection-at-scale-part-2-c2b8d2068216)\n* [Automating Data Protection at Scale, Part 3](https:\u002F\u002Fmedium.com\u002Fairbnb-engineering\u002Fautomating-data-protection-at-scale-part-3-34e592c45d46)\n* [Dynamic Kubernetes Cluster Scaling at Airbnb](https:\u002F\u002Fmedium.com\u002Fairbnb-engineering\u002Fdynamic-kubernetes-cluster-scaling-at-airbnb-d79ae3afa132)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Algolia\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [May 30 SSL incident](https:\u002F\u002Fwww.algolia.com\u002Fblog\u002Fmay-30-ssl-incident\u002F)\n* [A Journey Into SRE](https:\u002F\u002Fwww.algolia.com\u002Fblog\u002Fa-journey-into-sre\u002F)\n* [CI\u002FCDay 2024: What makes a good CI\u002FCD platform?](https:\u002F\u002Fwww.algolia.com\u002Fblog\u002Fengineering\u002Fcicday-2024-what-makes-a-good-ci-cd-platform\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Alibaba Cloud\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Why Are the Top Internet Companies Choosing SRE over Traditional O&M?](https:\u002F\u002Fwww.alibabacloud.com\u002Fblog\u002Fwhy-are-the-top-internet-companies-choosing-sre-over-traditional-o%26m_596099)\n* [Architecture and Practices of Bilibili's Real-time Platform](https:\u002F\u002Fwww.alibabacloud.com\u002Fblog\u002Farchitecture-and-practices-of-bilibilis-real-time-platform_596676)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Asana\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [How Asana uses Asana: Security incident response](https:\u002F\u002Fblog.asana.com\u002F2021\u002F09\u002Fengineering-security-incident-response\u002F#close)\n* [How Asana ships stable web application releases](https:\u002F\u002Fblog.asana.com\u002F2021\u002F01\u002Fasana-engineering-ships-web-application-releases\u002F)\n* [Analysis of recent downtime & what we’re doing to prevent future incidents](https:\u002F\u002Fblog.asana.com\u002F2019\u002F09\u002Fdowntime-what-were-doing-to-prevent-future-downtime\u002F)\n* [Developer environment: Achieving reliability by making it fast to reset](https:\u002F\u002Fblog.asana.com\u002F2017\u002F07\u002Fdeveloper-environment-making-it-reliable-by-making-it-fast-to-reset\u002F)\n* [Three security tactics for every IT leader to consider this fall](https:\u002F\u002Fblog.asana.com\u002F2022\u002F08\u002Fit-security-hybrid-workers\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>ASOS\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Playing the blame-less game](https:\u002F\u002Fmedium.com\u002Fasos-techblog\u002Fplaying-the-blame-less-game-3708f8195344)\n* [A day in the life of… Cat S (Head of Reliability Engineering)](https:\u002F\u002Fmedium.com\u002Fasos-techblog\u002Fa-day-in-the-life-of-cat-smith-head-of-reliability-engineering-629e10a26590)\n* [An AKS Performance Journey: Part 1 — Sizing Everything Up](https:\u002F\u002Fmedium.com\u002Fasos-techblog\u002Fan-aks-performance-journey-part-1-sizing-everything-up-ee6d2346ea99)\n* [An AKS Performance Journey: Part 2 — Networking It Out](https:\u002F\u002Fmedium.com\u002Fasos-techblog\u002Fan-aks-performance-journey-part-2-networking-it-out-e253f5bb4f69)\n* [Cyber Security @ ASOS.com](https:\u002F\u002Fmedium.com\u002Fasos-techblog\u002Fcyber-security-asos-com-7d1d1f346e57)\n* [Security Operations 24x7](https:\u002F\u002Fmedium.com\u002Fasos-techblog\u002Fsecurity-operations-24-x-7-2e90c8e5e7e)\n* [The skills we look for in Cyber Security Incident Response](https:\u002F\u002Fmedium.com\u002Fasos-techblog\u002Fthe-skills-we-look-for-in-cyber-security-incident-response-12b327927e38)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Atlassian\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Best practices for change management in the age of DevOps](https:\u002F\u002Fwww.atlassian.com\u002Fengineering\u002Fbest-practices-for-change-management-in-the-age-of-devops)\n* [Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code](https:\u002F\u002Fwww.atlassian.com\u002Fengineering\u002Fautomated-testing-5-lessons-from-atlassians-kubernetes-team-on-testing-infrastructure-as-code)\n* [How to export Kubernetes events for observability and alerting](https:\u002F\u002Fwww.atlassian.com\u002Fengineering\u002Fhow-to-export-kubernetes-events-for-observability-and-alerting)\n* [Incident Postmortem Template](https:\u002F\u002Fwww.atlassian.com\u002Fincident-management\u002Fpostmortem\u002Ftemplates)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>BackMarket\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [How Back Market SREs prepared for Black Friday](https:\u002F\u002Fmedium.com\u002Fback-market-engineering\u002Fhow-back-market-sres-prepared-for-black-friday-5f017f343408)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Baidu\u003C\u002Fsummary>\n\n### Videos\n\n* [Anomaly Detection on Golden Signals](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fchen-yu)\n* [NetRadar: Monitoring the Datacenter Network](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fchen-yun)\n* [Let the Chaos Begin—SRE Chaos Engineering Meets Cybersecurity](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=x3c0PPkSf14)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Basecamp\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Inside a CODE RED: Network Edition](https:\u002F\u002Fm.signalvnoise.com\u002Finside-a-code-red-network-edition\u002F)\n* [Three Basecamp outages. One week. What happened?](https:\u002F\u002Fm.signalvnoise.com\u002Fthree-basecamp-outages-one-week-what-happened\u002F)\n* [Basecamp 2 and Basecamp 3 search outage report](https:\u002F\u002Fm.signalvnoise.com\u002Fbasecamp-2-and-basecamp-3-search-outage-report\u002F)\n* [Reducing Incident Escalations at Basecamp](https:\u002F\u002Fm.signalvnoise.com\u002Freducing-incident-escalations-at-basecamp\u002F)\n\n### Books\n\n* [Shape Up](https:\u002F\u002Fbasecamp.com\u002Fshapeup\u002Fwebbook)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Bloomberg\u003C\u002Fsummary>\n\n### Videos\n\n* [Capacity Planning and Performance Enhancement with Page Reference Sampling](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon20americas\u002Fpresentation\u002Fchen)\n* [Why SREs can't afford to NOT do Chaos Engineering](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon20americas\u002Fpresentation\u002Fpawlikowski)\n* [Tracing Real-Time Distributed Systems](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Fyakimov)\n* [The Bloomberg Story: Building SRE Teams in an \"Immeasurable\" Organisation](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fsorensen)\n* [Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fchen)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Booking.com\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [How Reliability and Product Teams Collaborate at Booking.com](https:\u002F\u002Fmedium.com\u002Fbooking-com-infrastructure\u002Fhow-reliability-and-product-teams-collaborate-at-booking-com-f6c317cc0aeb)\n* [Incidents, fixes, and the day after](https:\u002F\u002Fmedium.com\u002Fbooking-com-infrastructure\u002Fincidents-fixes-and-the-day-after-c5d9aeae28c3)\n* [Troubleshooting: A journey into the unknown](https:\u002F\u002Fmedium.com\u002Fbooking-com-infrastructure\u002Ftroubleshooting-a-journey-into-the-unknown-e31b524fa86)\n\n### Videos\n\n* [Sailing the Database Seas: Applying SRE Principles at Scale](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon24emea\u002Fpresentation\u002Fandroulidakis)\n* [SLOs for Data-Intensive Services](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Ffouquet)\n* [Benefits of Taking the Less Traveled Road with Containers Infrastructure](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fiacoboaia)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Capital One\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Automate Application Monitoring with Slack](https:\u002F\u002Fwww.capitalone.com\u002Ftech\u002Fsoftware-engineering\u002Fhow-to-automate-application-monitoring-slack-bots\u002F)\n* [Automate AWS Infrastructure with Boto 3: AWS Health Check](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002Fautomate-aws-infrastructure-with-boto-3-aws-health-checks-e51338ba075)\n* [Active-Active Shared-Nothing Database Architecture](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002Factive-active-shared-nothing-database-architecture-304957ffb89)\n* [The 3 R’s of SREs: Resiliency, Recovery & Reliability](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002Fthe-3-rs-of-sres-resiliency-recovery-reliability-5f2f5360a91b)\n* [5 Steps to Getting Your App Chaos Ready](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002F5-steps-to-getting-your-app-chaos-ready-capital-one-a5b7b3cb8e09)\n* [4 Real-World Scenarios That Read Like Chaos Engineering Experiments](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002F4-real-world-scenarios-that-read-like-chaos-engineering-experiments-8dbf40c5f247)\n* [Embrace the Chaos … Engineering](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002Fembrace-the-chaos-engineering-203fd6fc6ff7)\n* [3 Lessons Learned From Implementing Chaos Engineering at Enterprise](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002F3-lessons-learned-from-implementing-chaos-engineering-at-enterprise-28eb3ffecc57)\n* [A Deep Dive Into Seamless Blue\u002FGreen Deployment Using AWS CodeDeploy](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002Fseamless-blue-green-deployment-using-aws-codedeploy-4c36c0bbeef4)\n* [Secure Docker Containers Require Secure Applications](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002Fsecure-docker-containers-require-secure-applications-75eb358abef9)\n* [4 Steps for Pairing the Cloud and DevOps to Improve Resiliency](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002F4-steps-for-pairing-cloud-and-devops-to-improve-resiliency-c72fe2e52b05)\n* [Container Ready Applications with Twelve-Factor App and Microservices Architecture](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002Fcontainer-ready-applications-with-twelve-factor-app-and-microservices-architecture-16af683a767f)\n* [Deploying with Confidence — Minimize Risk, Maximize Resiliency With Canary Deployments on AWS](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002Fdeploying-with-confidence-strategies-for-canary-deployments-on-aws-7cab3798823e)\n* [Architecting for Resiliency](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002Farchitecting-for-resiliency-9ec663db5c94)\n* [Continuous Chaos — Introducing Chaos Engineering into DevOps Practices](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002Fcontinuous-chaos-introducing-chaos-engineering-into-devops-practices-75757e1cca6d)\n* [The Mon-ifesto Part 1: Metrics](https:\u002F\u002Fmedium.com\u002Fcapital-one-tech\u002Fthe-mon-ifesto-part-1-metrics-808f6c944765)\n\n### Major incidents & analysis reports\n\n* [Information on the Capital One Cyber Incident](https:\u002F\u002Fwww.capitalone.com\u002Ffacts2019\u002F)\n* [A Case Study of the Capital One Data Breach](http:\u002F\u002Fweb.mit.edu\u002Fsmadnick\u002Fwww\u002Fwp\u002F2020-16.pdf)\n  \n### Videos\n\n* [Banking on Continuous Delivery - Capital One](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=_DnYSQEUTfo)\n* [Continuous Chaos in DevOps - Capital One](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=U_Uh5RMCwPI)\n* [DevOps at Capital One: Focusing on Pipeline and Measurement](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=6Q0mtVnnthQ)\n* [Automating the Management of the Operational Health of Cloud Accounts at Scale](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fwalls)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Coinbase\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Open Sourcing Coinbase’s Secure Deployment Pipeline](https:\u002F\u002Fblog.coinbase.com\u002Fopen-sourcing-coinbases-secure-deployment-pipeline-ae6c78e25517)\n  \n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>DAZN\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Site Reliability at DAZN](https:\u002F\u002Fmedium.com\u002Fdazn-tech\u002Fsite-reliability-at-dazn-a3ba4af0638d)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>DBS\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Presenting at iThome’s SRE Conference: Our DBS SRE Transformation Journey Thus Far](https:\u002F\u002Fmedium.com\u002Fdbs-tech-blog\u002Fpresenting-at-ithomes-sre-conference-our-dbs-sre-transformation-journey-thus-far-9b6778ce53e8)\n* [Debunking the seven most popular Site Reliability Engineering myths](https:\u002F\u002Fmedium.com\u002Fdbs-tech-blog\u002Fdebunking-the-seven-most-popular-site-reliability-engineering-myths-a3be8d870ff2)\n* [How To Use SRE To Cultivate A Blameless Culture In The Workplace](https:\u002F\u002Fmedium.com\u002Fdbs-tech-blog\u002Fhow-to-use-sre-to-cultivate-a-blameless-culture-in-the-workplace-1981fd1c7871)\n* [Site Reliability Engineering at DBS Bank](https:\u002F\u002Fmedium.com\u002Fdbs-tech-blog\u002Fsite-reliability-engineering-at-dbs-bank-32c02228ccf4)\n* [Automating Configuration Management at Scale](https:\u002F\u002Fmedium.com\u002Fdbs-tech-blog\u002Fautomating-configuration-management-at-scale-5c7927f83df3)\n* [How DBS dispelled the myths of Chaos Engineering](https:\u002F\u002Fmedium.com\u002Fdbs-tech-blog\u002Fhow-dbs-dispelled-the-myths-of-chaos-engineering-e5873ac78c9)\n* [Double, Double Toil and Trouble](https:\u002F\u002Fmedium.com\u002Fdbs-tech-blog\u002Fdouble-double-toil-and-trouble-applying-sre-practices-to-alleviate-toil-for-devops-teams-259b958a10dd)\n\n### Videos\n\n* [SREcon Conversations Asia\u002FPacific with Koon Seng Lim, DBS](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=URwkaRbOLxI&feature=emb_title)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>DeepSource\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Redis diskless replication: What, how, why and the caveats](https:\u002F\u002Fdeepsource.io\u002Fblog\u002Fredis-diskless-replication\u002F)\n* [How to setup Vault with Kubernetes](https:\u002F\u002Fdeepsource.io\u002Fblog\u002Fsetup-vault-kubernetes\u002F)\n* [Breaking down zero downtime deployments in Kubernetes](https:\u002F\u002Fdeepsource.io\u002Fblog\u002Fzero-downtime-deployment\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Dream11\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Deployment At Scale: Story Behind Dream11’s In-House Blue-Green Deployment Platform ‘OneClick’.](https:\u002F\u002Fblog.dream11engineering.com\u002Fdeployment-at-scale-story-behind-dream11s-in-house-blue-green-deployment-platform-oneclick-b2c761b12896)\n* [Enhancing security and trust with AWS WAFv2](https:\u002F\u002Fblog.dream11engineering.com\u002Fenhancing-security-and-trust-with-aws-wafv2-8b050b1cba37)\n* [Lessons learned from running GraphQL at scale](https:\u002F\u002Fblog.dream11engineering.com\u002Flessons-learned-from-running-graphql-at-scale-2ad60b3cefeb)\n* [Break circuits, save Kong 🦍](https:\u002F\u002Fblog.dream11engineering.com\u002Fbreak-circuits-save-kong-3680d88a0639)\n* [Finding Order in Chaos: How We Automated Performance Testing with Torque](https:\u002F\u002Fblog.dream11engineering.com\u002Ffinding-order-in-chaos-how-we-automated-performance-testing-with-torque-6eb63706fcea)\n* [Maintaining hyper-sonic releases at Dream11](https:\u002F\u002Fblog.dream11engineering.com\u002Fmaintaining-hyper-sonic-releases-at-dream11-c26f2145fe28)\n* [To Scale In Or Scale Out? Here’s How We Scale at Dream11](https:\u002F\u002Fblog.dream11engineering.com\u002Fto-scale-in-or-scale-out-heres-how-we-scale-at-dream11-f88ef5e71cbc)\n* [Building Scalable Real Time Analytics, Alerting and Anomaly Detection Architecture at Dream11](https:\u002F\u002Fblog.dream11engineering.com\u002Fbuilding-scalable-real-time-analytics-alerting-and-anomaly-detection-architecture-at-dream11-e20edec91d33)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Dropbox\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Dropbox Engineering Career Framework - Reliability Engineer (SRE)](https:\u002F\u002Fdropbox.github.io\u002Fdbx-career-framework\u002F)\n* [Atlas: Our journey from a Python monolith to a managed platform](https:\u002F\u002Fdropbox.tech\u002Finfrastructure\u002Fatlas--our-journey-from-a-python-monolith-to-a-managed-platform)\n* [Monitoring server applications with Vortex](https:\u002F\u002Fdropbox.tech\u002Finfrastructure\u002Fmonitoring-server-applications-with-vortex)\n* [Athena: Our automated build health management system](https:\u002F\u002Fdropbox.tech\u002Finfrastructure\u002Fathena-our-automated-build-health-management-system)\n* [Interested in becoming a Site Reliability Engineer?](https:\u002F\u002Ftammybutow.medium.com\u002Fgraduating-from-bootcamp-and-interested-in-becoming-a-site-reliability-engineer-b69a38ce858b)\n\n### Videos\n\n* [Service Discovery Challenges at Scale](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fnigmatullin)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>eBay\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Resiliency and Disaster Recovery with Kafka](https:\u002F\u002Ftech.ebayinc.com\u002Fengineering\u002Fresiliency-and-disaster-recovery-with-kafka\u002F)\n* [SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue](https:\u002F\u002Ftech.ebayinc.com\u002Fengineering\u002Fsre-case-study-triage-a-non-heap-jvm-out-of-memory-issue\u002F)\n* [SRE Case Study: Mysterious Traffic Imbalance](https:\u002F\u002Ftech.ebayinc.com\u002Fengineering\u002Fsre-case-study-mysterious-traffic-imbalance\u002F)\n* [Zero Downtime, Instant Deployment and Rollback](https:\u002F\u002Ftech.ebayinc.com\u002Fengineering\u002Fzero-downtime-instant-deployment-and-rollback\u002F)\n* [How eBay’s Notification Platform Used Fault Injection in New Ways](https:\u002F\u002Finnovation.ebayinc.com\u002Ftech\u002Fengineering\u002Fhow-ebays-notification-platform-used-fault-injection-in-new-ways\u002F)\n\n### Video\n\n* [Madaari: Ordering for the Monkeys](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fraina)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Epic Games\u003C\u002Fsummary>\n\n### Video\n\n* [AWS re:Invent 2018: Epic Games Uses AWS to Deliver Fortnite to 200 Million Players](https:\u002F\u002Fyoutu.be\u002FMCLrA401vHw)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Etsy\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Improving the Deployment Experience of a Ten-Year Old Application](https:\u002F\u002Fcodeascraft.com\u002F)\n* [How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020](https:\u002F\u002Fcodeascraft.com\u002F2021\u002F02\u002F25\u002Fhow-etsy-prepared-for-historic-volumes-of-holiday-traffic-in-2020\u002F)\n* [Your brain on progress](https:\u002F\u002Fincrement.com\u002Freliability\u002Fbrain-on-progress\u002F)\n* [Etsy’s Debriefing Facilitation Guide for Blameless Postmortems](https:\u002F\u002Fcodeascraft.com\u002F2016\u002F11\u002F17\u002Fdebriefing-facilitation-guide\u002F)\n* [Opsweekly: Measuring on-call experience with alert classification](https:\u002F\u002Fcodeascraft.com\u002F2014\u002F06\u002F19\u002Fopsweekly-measuring-on-call-experience-with-alert-classification\u002F)\n* [Demystifying Site Outages](https:\u002F\u002Fblog.etsy.com\u002Fnews\u002F2012\u002Fdemystifying-site-outages\u002F)\n* [Blameless PostMortems and a Just Culture](https:\u002F\u002Fcodeascraft.com\u002F2012\u002F05\u002F22\u002Fblameless-postmortems\u002F)\n* [Measure Anything, Measure Everything](https:\u002F\u002Fcodeascraft.com\u002F2011\u002F02\u002F15\u002Fmeasure-anything-measure-everything\u002F)\n\n### Videos\n\n* [Velocity 09: John Allspaw and Paul Hammond, \"10+ Deploys Pe](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=LdOe18KhtT4)\n* [Migrating a Monolith to the Cloud](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fgovande)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Expedia\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Automating Performance Standards](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Fautomating-performance-standards-b51efc92d237)\n* [Error Budget Policy - Part 1 - Adoption at Expedia Group](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Ferror-budget-policy-adoption-at-expedia-group-7d80d41c4a8b)\n* [Error Budget Policy - Part 2 - Practices at Expedia Group](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Ferror-budget-policies-in-practice-4c98f56a28c1)\n* [Using Fault-Injection to Improve our new Runtime Platform’s Reliability](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Fusing-fault-injection-to-improve-our-new-platforms-reliability-656b1147b132)\n* [Learning from Incidents at Expedia Group](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Flearning-from-incidents-at-expedia-group-51a8c72a4286)\n* [Improving Vrbo Homepage Loading Experience](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Fimproving-vrbo-homepage-loading-experience-e4b2207535f4)\n* [Troubleshooting 502 errors: ECS Checklist](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Ftroubleshooting-502-errors-ecs-checklist-9da383399d96)\n* [Getting Started with Elasticsearch](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Fgetting-started-with-elastic-search-6af62d7df8dd)\n* [All about ISTIO-PROXY 5xx Issues](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Fall-about-istio-proxy-5xx-issues-e0221b29e692)\n* [Autoscaling in Kubernetes: Why doesn’t the Horizontal Pod Autoscaler work for me?](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Fautoscaling-in-kubernetes-why-doesnt-the-horizontal-pod-autoscaler-work-for-me-5f0094694054)\n* [How to Keep Your Kubernetes Deployments Balanced Across Multiple zones](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Fhow-to-keep-your-kubernetes-deployments-balanced-across-multiple-zones-dfe719847b41)\n* [Are Your Dropwizard Latency Metrics Misleading You?](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Fyour-latency-metrics-could-be-misleading-you-how-hdrhistogram-can-help-9d545b598374)\n* [The Cost of 100% Reliability](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Fthe-cost-of-100-reliability-ecb2901f23a4)\n* [Creating Monitoring Dashboards](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Fcreating-monitoring-dashboards-1f3fbe0ae1ac)\n* [Using Bash for DevOps](https:\u002F\u002Fmedium.com\u002Fexpedia-group-tech\u002Fusing-bash-for-devops-7046eed1aa63)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Fastly\u003C\u002Fsummary>\n\n### Videos\n\n* [SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fwohlner)\n* [Resilience Engineering Mythbusting](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fgallego)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>G-Research\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Our SRE Journey at G-Research](https:\u002F\u002Fwww.gresearch.com\u002Fblog\u002Farticle\u002Four-sre-journey-at-g-research\u002F)\n* [The SRE Journey Continues](https:\u002F\u002Fwww.gresearch.com\u002Fblog\u002Farticle\u002Fthe-sre-journey-continues\u002F)\n* [OpenTSDB Meta Cache – trade-offs for performance](https:\u002F\u002Fwww.gresearch.com\u002Fblog\u002Farticle\u002Fopentsdb-meta-cache-trade-offs-for-performance\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Getaround\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [How we handle incidents at Getaround](https:\u002F\u002Fgetaround.tech\u002Fincident-handling-at-getaround\u002F)\n* [Evolution Of Our Continuous Delivery Process](https:\u002F\u002Fgetaround.tech\u002Fcontinuous-integration\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>GitHub\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [How we improved availability through iterative simplification](https:\u002F\u002Fgithub.blog\u002Fengineering\u002Fengineering-principles\u002Fhow-we-improved-availability-through-iterative-simplification\u002F)\n* [How we improved push processing on GitHub](https:\u002F\u002Fgithub.blog\u002Fengineering\u002Farchitecture-optimization\u002Fhow-we-improved-push-processing-on-github\u002F)\n* [How GitHub uses merge queue to ship hundreds of changes every day](https:\u002F\u002Fgithub.blog\u002Fengineering\u002Fengineering-principles\u002Fhow-github-uses-merge-queue-to-ship-hundreds-of-changes-every-day\u002F)\n* [Fixing security vulnerabilities with AI](https:\u002F\u002Fgithub.blog\u002Fengineering\u002Fplatform-security\u002Ffixing-security-vulnerabilities-with-ai\u002F)\n* [GitHub’s Engineering Fundamentals program: How we deliver on availability, security, and accessibility](https:\u002F\u002Fgithub.blog\u002Fengineering\u002Fengineering-principles\u002Fgithubs-engineering-fundamentals-program-how-we-deliver-on-availability-security-and-accessibility\u002F)\n* [How GitHub uses GitHub Actions and Actions larger runners to build and test GitHub.com](https:\u002F\u002Fgithub.blog\u002F2023-09-26-how-github-uses-github-actions-and-actions-larger-runners-to-build-and-test-github-com\u002F)\n* [The GitHub Security Lab’s journey to disclosing 500 CVEs in open source projects](https:\u002F\u002Fgithub.blog\u002F2023-09-21-the-github-security-labs-journey-to-disclosing-500-cves-in-open-source-projects\u002F)\n* [CodeQL team uses AI to power vulnerability detection in code](https:\u002F\u002Fgithub.blog\u002F2023-09-12-codeql-team-uses-ai-to-power-vulnerability-detection-in-code\u002F)\n* [Addressing GitHub’s recent availability issues](https:\u002F\u002Fgithub.blog\u002F2023-05-16-addressing-githubs-recent-availability-issues\u002F)\n* [Building organization-wide governance and re-use for CI\u002FCD and automation with GitHub Actions](https:\u002F\u002Fgithub.blog\u002F2023-04-05-building-organization-wide-governance-and-re-use-for-ci-cd-and-automation-with-github-actions\u002F)\n* [Enabling branch deployments through IssueOps with GitHub Actions](https:\u002F\u002Fgithub.blog\u002F2023-02-02-enabling-branch-deployments-through-issueops-with-github-actions\u002F)\n* [Using ChatOps to help Actions on-call engineers](https:\u002F\u002Fgithub.blog\u002F2021-12-01-using-chatops-to-help-actions-on-call-engineers\u002F)\n* [Partitioning GitHub’s relational databases to handle scale](https:\u002F\u002Fgithub.blog\u002F2021-09-27-partitioning-githubs-relational-databases-scale\u002F)\n* [Increasing developer happiness with GitHub code scanning](https:\u002F\u002Fgithub.blog\u002F2021-09-07-increasing-developer-happiness-github-code-scanning\u002F)\n* [Why (and how) GitHub is adopting OpenTelemetry](https:\u002F\u002Fgithub.blog\u002F2021-05-26-why-and-how-github-is-adopting-opentelemetry\u002F)\n* [Improving large monorepo performance on GitHub](https:\u002F\u002Fgithub.blog\u002F2021-03-16-improving-large-monorepo-performance-on-github\u002F)\n* [Deployment reliability at GitHub](https:\u002F\u002Fgithub.blog\u002F2021-02-03-deployment-reliability-at-github\u002F)\n* [Improving how we deploy GitHub](https:\u002F\u002Fgithub.blog\u002F2021-01-25-improving-how-we-deploy-github\u002F)\n* [Building On-Call Culture at GitHub](https:\u002F\u002Fgithub.blog\u002F2021-01-06-building-on-call-culture-at-github\u002F)\n* [Reducing flaky builds by 18x](https:\u002F\u002Fgithub.blog\u002F2020-12-16-reducing-flaky-builds-by-18x\u002F)\n* [The evolving role of operations in DevOps](https:\u002F\u002Fgithub.blog\u002F2020-12-03-the-evolving-role-of-operations-in-devops\u002F)\n* [Getting started with DevOps automation](https:\u002F\u002Fgithub.blog\u002F2020-10-29-getting-started-with-devops-automation\u002F)\n* [MySQL High Availability at GitHub](https:\u002F\u002Fgithub.blog\u002F2018-06-20-mysql-high-availability-at-github\u002F)\n\n### Major incidents & analysis reports\n\n* [GitHub Availability Report: August 2024](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-august-2024\u002F)\n* [GitHub Availability Report: July 2024](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-july-2024\u002F)\n* [GitHub Availability Report: June 2024](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-june-2024\u002F)\n* [GitHub Availability Report: May 2024](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-may-2024\u002F)\n* [GitHub Availability Report: April 2024](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-april-2024\u002F)\n* [GitHub Availability Report: March 2024](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-march-2024\u002F)\n* [GitHub Availability Report: February 2024](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-february-2024\u002F)\n* [GitHub Availability Report: January 2024](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-january-2024\u002F)\n* [GitHub Availability Report: December 2023](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-december-2023\u002F)\n* [GitHub Availability Report: November 2023](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-november-2023\u002F)\n* [GitHub Availability Report: October 2023](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-october-2023\u002F)\n* [GitHub Availability Report: September 2023](https:\u002F\u002Fgithub.blog\u002Fnews-insights\u002Fcompany-news\u002Fgithub-availability-report-september-2023\u002F)\n* [GitHub Availability Report: August 2023](https:\u002F\u002Fgithub.blog\u002F2023-09-13-github-availability-report-august-2023\u002F)\n* [GitHub Availability Report: July 2023](https:\u002F\u002Fgithub.blog\u002F2023-08-09-github-availability-report-july-2023\u002F)\n* [GitHub Availability Report: June 2023](https:\u002F\u002Fgithub.blog\u002F2023-07-12-github-availability-report-june-2023\u002F)\n* [GitHub Availability Report: May 2023](https:\u002F\u002Fgithub.blog\u002F2023-06-14-github-availability-report-may-2023\u002F)\n* [GitHub Availability Report: April 2023](https:\u002F\u002Fgithub.blog\u002F2023-05-03-github-availability-report-april-2023\u002F)\n* [GitHub Availability Report: March 2023](https:\u002F\u002Fgithub.blog\u002F2023-04-05-github-availability-report-march-2023\u002F)\n* [GitHub Availability Report: February 2023](https:\u002F\u002Fgithub.blog\u002F2023-03-01-github-availability-report-february-2023\u002F)\n* [GitHub Availability Report: January 2023](https:\u002F\u002Fgithub.blog\u002F2023-02-01-github-availability-report-january-2023\u002F)\n* [GitHub Availability Report: December 2022](https:\u002F\u002Fgithub.blog\u002F2023-01-04-github-availability-report-december-2022\u002F)\n* [GitHub Availability Report: November 2022](https:\u002F\u002Fgithub.blog\u002F2022-12-07-github-availability-report-november-2022\u002F)\n* [GitHub Availability Report: October 2022](https:\u002F\u002Fgithub.blog\u002F2022-11-02-github-availability-report-october-2022\u002F)\n* [GitHub Availability Report: September 2022](https:\u002F\u002Fgithub.blog\u002F2022-10-05-github-availability-report-september-2022\u002F)\n* [GitHub Availability Report: August 2022](https:\u002F\u002Fgithub.blog\u002F2022-09-07-github-availability-report-august-2022\u002F)\n* [GitHub Availability Report: July 2022](https:\u002F\u002Fgithub.blog\u002F2022-08-03-github-availability-report-july-2022\u002F)\n* [GitHub Availability Report: June 2022](https:\u002F\u002Fgithub.blog\u002F2022-07-06-github-availability-report-june-2022\u002F)\n* [GitHub Availability Report: May 2022](https:\u002F\u002Fgithub.blog\u002F2022-06-01-github-availability-report-may-2022\u002F)\n* [GitHub Availability Report: April 2022](https:\u002F\u002Fgithub.blog\u002F2022-05-04-github-availability-report-april-2022\u002F)\n* [GitHub Availability Report: March 2022](https:\u002F\u002Fgithub.blog\u002F2022-04-06-github-availability-report-march-2022\u002F)\n* [GitHub Availability Report: February 2022](https:\u002F\u002Fgithub.blog\u002F2022-03-02-github-availability-report-february-2022\u002F)\n* [GitHub Availability Report: January 2022](https:\u002F\u002Fgithub.blog\u002F2022-02-02-github-availability-report-january-2022\u002F)\n* [GitHub Availability Report: December 2021](https:\u002F\u002Fgithub.blog\u002F2022-01-05-github-availability-report-december-2021\u002F)\n* [GitHub Availability Report: November 2021](https:\u002F\u002Fgithub.blog\u002F2021-12-01-github-availability-report-november-2021\u002F)\n* [GitHub Availability Report: October 2021](https:\u002F\u002Fgithub.blog\u002F2021-11-04-github-availability-report-october-2021\u002F)\n* [GitHub Availability Report: September 2021](https:\u002F\u002Fgithub.blog\u002F2021-10-06-github-availability-report-september-2021\u002F)\n* [GitHub Availability Report: August 2021](https:\u002F\u002Fgithub.blog\u002F2021-09-01-github-availability-report-august-2021\u002F)\n* [GitHub Availability Report: July 2021](https:\u002F\u002Fgithub.blog\u002F2021-08-04-github-availability-report-july-2021\u002F)\n* [GitHub Availability Report: June 2021](https:\u002F\u002Fgithub.blog\u002F2021-07-07-github-availability-report-june-2021\u002F)\n* [GitHub Availability Report: May 2021](https:\u002F\u002Fgithub.blog\u002F2021-06-02-github-availability-report-may-2021\u002F)\n* [GitHub Availability Report: April 2021](https:\u002F\u002Fgithub.blog\u002F2021-05-05-github-availability-report-april-2021\u002F)\n* [GitHub Availability Report: March 2021](https:\u002F\u002Fgithub.blog\u002F2021-04-07-github-availability-report-march-2021\u002F)\n* [GitHub Availability Report: February 2021](https:\u002F\u002Fgithub.blog\u002F2021-03-03-github-availability-report-february-2021\u002F)\n* [GitHub Availability Report: January 2021](https:\u002F\u002Fgithub.blog\u002F2021-02-02-github-availability-report-january-2021\u002F)\n* [GitHub Availability Report: December 2020](https:\u002F\u002Fgithub.blog\u002F2021-01-06-github-availability-report-december-2020\u002F)\n* [GitHub Availability Report: November 2020](https:\u002F\u002Fgithub.blog\u002F2020-12-02-availability-report-november-2020\u002F)\n* [GitHub Availability Report: August 2020](https:\u002F\u002Fgithub.blog\u002F2020-09-02-github-availability-report-august-2020\u002F)\n* [GitHub Availability Report: July 2020](https:\u002F\u002Fgithub.blog\u002F2020-08-05-github-availability-report-july-2020\u002F)\n* [Introducing the GitHub Availability Report](https:\u002F\u002Fgithub.blog\u002F2020-07-08-introducing-the-github-availability-report\u002F)\n* [February service disruptions post-incident analysis](https:\u002F\u002Fgithub.blog\u002F2020-03-26-february-service-disruptions-post-incident-analysis\u002F)\n* [October 21 post-incident analysis](https:\u002F\u002Fgithub.blog\u002F2018-10-30-oct21-post-incident-analysis\u002F)\n* [February 28th DDoS Incident Report](https:\u002F\u002Fgithub.blog\u002F2018-03-01-ddos-incident-report\u002F)\n* [Incident Report: Inadvertent Private Repository Disclosure](https:\u002F\u002Fgithub.blog\u002F2016-10-28-incident-report-inadvertent-private-repository-disclosure\u002F)\n\n### Videos\n\n* [One on One SRE](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Ftobey)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>GitLab\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [This SRE attempted to roll out an HAProxy config change. You won't believe what happened next...](https:\u002F\u002Fabout.gitlab.com\u002Fblog\u002F2021\u002F01\u002F14\u002Fthis-sre-attempted-to-roll-out-an-haproxy-change\u002F)\n* [My week shadowing a GitLab Site Reliability Engineer](https:\u002F\u002Fabout.gitlab.com\u002Fblog\u002F2019\u002F12\u002F16\u002Fsre-shadow\u002F)\n* [Update: Elasticsearch lessons learnt for Advanced Global Search](https:\u002F\u002Fabout.gitlab.com\u002Fblog\u002F2020\u002F04\u002F28\u002Felasticsearch-update\u002F)\n* [Lessons in iteration from a new team in infrastructure](https:\u002F\u002Fabout.gitlab.com\u002Fblog\u002F2020\u002F11\u002F09\u002Flessons-in-iteration-from-new-infrastructure-team\u002F)\n* [How we optimized infrastructure spend at GitLab](https:\u002F\u002Fabout.gitlab.com\u002Fblog\u002F2020\u002F10\u002F27\u002Fhow-we-optimized-our-infrastructure-spend-at-gitlab\u002F)\n* [How we scaled async workload processing at GitLab.com using Sidekiq](https:\u002F\u002Fabout.gitlab.com\u002Fblog\u002F2020\u002F06\u002F24\u002Fscaling-our-use-of-sidekiq\u002F)\n* [Inside GitLab: How we release software patches](https:\u002F\u002Fabout.gitlab.com\u002Fblog\u002F2020\u002F05\u002F13\u002Fhow-we-release-software-patches\u002F)\n* [What tracking down missing TCP Keepalives taught me about Docker, Golang, and GitLab](https:\u002F\u002Fabout.gitlab.com\u002Fblog\u002F2019\u002F11\u002F15\u002Ftracking-down-missing-tcp-keepalives\u002F)\n* [How we used delayed replication for disaster recovery with PostgreSQL](https:\u002F\u002Fabout.gitlab.com\u002Fblog\u002F2019\u002F02\u002F13\u002Fdelayed-replication-for-disaster-recovery-with-postgresql\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>GoCardless\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Deploying Software at GoCardless: Open-Sourcing our “Getting Started” Tutorial](https:\u002F\u002Fmedium.com\u002Fgocardless-tech\u002Fdeploying-software-at-gocardless-open-sourcing-our-getting-started-tutorial-ab857aa91c9e)\n* [How we compress Pub\u002FSub messages and more, saving a load of money](https:\u002F\u002Fmedium.com\u002Fgocardless-tech\u002Fhow-we-compress-pub-sub-messages-and-more-saving-a-load-of-money-694b64c3458a)\n* [Fear-free PostgreSQL migrations for Rails](https:\u002F\u002Fgocardless.com\u002Fblog\u002Ffear-free-postgresql-migrations-for-rails\u002F)\n* [Observability at GoCardless: a tale of API performance improvement](https:\u002F\u002Fgocardless.com\u002Fblog\u002Fobservability-at-gocardless-a-tale-of-api-performance-improvement\u002F)\n* [Debugging the PostgreSQL query planner](https:\u002F\u002Fgocardless.com\u002Fblog\u002Fdebugging-the-postgres-query-planner\u002F)\n* [Zero-downtime Postgres migrations - the hard parts](https:\u002F\u002Fgocardless.com\u002Fblog\u002Fzero-downtime-postgres-migrations-the-hard-parts\u002F)\n* [In search of performance - how we shaved 200ms off every POST request](https:\u002F\u002Fgocardless.com\u002Fblog\u002Fin-search-of-performance-how-we-shaved-200ms-off-every-post-request\u002F)\n\n### Major incidents & analysis reports\n\n* [Incident review: Service outage on 25 October 2020, Vault TLS expiry](https:\u002F\u002Fgocardless.com\u002Fblog\u002Fincident-review-service-outage-on-25-october-2020\u002F)\n* [Incident review: API and Dashboard outage on 10 October 2017](https:\u002F\u002Fgocardless.com\u002Fblog\u002Fincident-review-api-and-dashboard-outage-on-10th-october\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>GoDaddy\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Kubernetes Gated Deployments](https:\u002F\u002Fwww.godaddy.com\u002Fengineering\u002F2019\u002F08\u002F13\u002Fkubernetes-gated-deployments\u002F)\n* [Kubernetes External Secrets](https:\u002F\u002Fwww.godaddy.com\u002Fengineering\u002F2019\u002F04\u002F16\u002Fkubernetes-external-secrets\u002F)\n* [Kubernetes - A Practical Introduction for Application Developers](https:\u002F\u002Fwww.godaddy.com\u002Fengineering\u002F2018\u002F05\u002F02\u002Fkubernetes-introduction-for-developers\u002F)\n* [An Intuitive Node.js Client for the Kubernetes API](https:\u002F\u002Fwww.godaddy.com\u002Fengineering\u002F2018\u002F04\u002F10\u002Fan-intuitive-nodejs-client-for-the-kubernetes-api\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Gojek\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Introducing Skynet: Infrastructure as Code for Gojek](https:\u002F\u002Fwww.gojek.io\u002Fblog\u002Fintroducing-skynet\u002F)\n* [Scaling Our Geo-Search Service For 10x Load](https:\u002F\u002Fwww.gojek.io\u002Fblog\u002Fscaling-our-geo-search-service-for-10x-load\u002F)\n* [Why We Swear by the RCA](https:\u002F\u002Fwww.gojek.io\u002Fblog\u002Fwhy-we-swear-by-the-rca)\n* [How We Upgrade Kubernetes on GKE](https:\u002F\u002Fblog.gojek.io\u002Fhow-we-upgrade-kubernetes-on-gke\u002F)\n* [How We Monitor Apache Airflow in Production](https:\u002F\u002Fblog.gojek.io\u002Fhow-we-monitor-apache-airflow-in-production\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Goldman Sachs\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [SecDb Observability Journey](https:\u002F\u002Fdeveloper.gs.com\u002Fblog\u002Fposts\u002Fsecdb-observability-journey)\n* [Chaos Testing an Application on AWS](https:\u002F\u002Fdeveloper.gs.com\u002Fblog\u002Fposts\u002Fchaos-testing-an-application-on-aws)\n* [Forecasting Capacity Outages Using Machine Learning to Bolster Application Resiliency](https:\u002F\u002Fdeveloper.gs.com\u002Fblog\u002Fposts\u002Fforecasting-capacity-outages-using-machine-learning-to-bolster-application-resiliency)\n* [Providing 99.9% Availability and Sub-Second Response Times with Sybase IQ Multiplexes by Using HAProxy](https:\u002F\u002Fdeveloper.gs.com\u002Fblog\u002Fposts\u002Fproviding-999-availability-and-sub-second-response-times-with-sybase-iq-multiplexes-by-using-haproxy)\n* [Building Multi-Region Resiliency with Amazon RDS and Amazon Aurora](https:\u002F\u002Fdeveloper.gs.com\u002Fblog\u002Fposts\u002Fbuilding-multi-region-resiliency-with-amazon-rds-and-amazon-aurora)\n* [Enabling Highly Available Trino Clusters at Goldman Sachs](https:\u002F\u002Fdeveloper.gs.com\u002Fblog\u002Fposts\u002Fenabling-highly-available-trino-clusters-at-goldman-sachs)\n* [Observability at Scale](https:\u002F\u002Fdeveloper.gs.com\u002Fblog\u002Fposts\u002Fobservability-at-scale)\n* [Infrastructure and the Command Chain Pattern](https:\u002F\u002Fdeveloper.gs.com\u002Fblog\u002Fposts\u002Finfrastructure-and-command-chain-pattern)\n* [Mobile CICD with EC2 macOS](https:\u002F\u002Fdeveloper.gs.com\u002Fblog\u002Fposts\u002Fmobile-cicd-with-ec2-macos)\n* [Announcing CatchIT - Source Code Secret Scanner](https:\u002F\u002Fdeveloper.gs.com\u002Fblog\u002Fposts\u002Fcatchit-source-code-secret-scanner)\n* [Building Platforms for Data Engineering](https:\u002F\u002Fdeveloper.gs.com\u002Fblog\u002Fposts\u002Flegend_data_engineering_platforms)\n\n### Videos\n\n* [Granular CPU Capacity Management at Scale with eBPF](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon24emea\u002Fpresentation\u002Fbrighton)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Google\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Accelerating incident response using generative AI](https:\u002F\u002Fsecurity.googleblog.com\u002F2024\u002F04\u002Faccelerating-incident-response-using.html)\n* [Pitfalls and Patterns in Microservice Dependency Management](https:\u002F\u002Fwww.infoq.com\u002Farticles\u002Fpitfalls-patterns-microservice-dependency-management\u002F)\n* [SRE Practices & Processes](https:\u002F\u002Fsre.google\u002Fresources\u002F#practicesandprocesses)\n* [Google site reliability using Go](https:\u002F\u002Fgo.dev\u002Fsolutions\u002Fgoogle\u002Fsitereliability)\n* [Three months, 30x demand: How we scaled Google Meet during COVID-19](https:\u002F\u002Fcloud.google.com\u002Fblog\u002Fproducts\u002Fg-suite\u002Fkeeping-google-meet-ahead-of-usage-demand-during-covid-19)\n* [SRE Classroom: Distributed PubSub](https:\u002F\u002Fsre.google\u002Fresources\u002Fpractices-and-processes\u002Fdistributed-pubsub\u002F)\n* [How SRE teams are organized, and how to get started](https:\u002F\u002Fcloud.google.com\u002Fblog\u002Fproducts\u002Fdevops-sre\u002Fhow-sre-teams-are-organized-and-how-to-get-started)\n\n### Videos\n\n* [Get Your Non-SREs Oncall Ready!](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon24emea\u002Fpresentation\u002Fvan-winkel)\n* [Reliable Data for Large ML Models: Principles and Practices](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon23emea\u002Fpresentation\u002Fmcglohon)\n* [New Grads Becoming New SREs: Catalyzing a “Circle of Life” in Ireland](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon23emea\u002Fpresentation\u002Fpetoff)\n* [SRE for [cyber]security](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon23emea\u002Fpresentation\u002Ffischbach)\n* [Artificial Intelligence: How Much Will It Cost You?](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon23emea\u002Fpresentation\u002Funderwood)\n* [What's the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google](https:\u002F\u002Fyoutu.be\u002FuTEL8Ff1Zvk)\n* [Risk and Error Budgets’ with Seth Vargo and Liz Fong-Jones of Google](https:\u002F\u002Fyoutu.be\u002Fy2ILKr8kCJU)\n* [Pragmatic Automation’ with Max Luebbe of GCP](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=oDcjAcFTFC0&t=0m56s)\n* [Must Watch! - Google SRE YouTube Playlist](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj)\n* [Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon20americas\u002Fpresentation\u002Fstanke)\n* [Implementing Distributed Consensus](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon20americas\u002Fpresentation\u002Fludtke)\n* [The SRE I Aspire to Be](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Faknin)\n* [SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Fperry)\n* [Zero Touch Prod: Towards Safer and More Secure Production Environments](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Fczapinski)\n* [All of Our ML Ideas Are Bad (and We Should Feel Bad)](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Funderwood)\n* [The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Fdesai)\n* [Deploying SRE Training Best Practices to Production: How We SRE'ed Our SRE Education Program](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Fpetoff)\n* [Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Fgleason)\n* [Practical Instrumentation for Observability](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fkrabbe)\n* [What Is ML Ops: Solutions and Best Practices for DevOps of Production ML Services](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fsato)\n* [Unified Reporting of Service Reliability](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fzhang)\n* [How to Trade off Server Utilization and Tail Latency](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fplenz)\n* [Keeping the Balance: Internet-Scale Loadbalancing Demystified](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fnolan-loadbalancing)\n* [From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fvirji)\n* [Mindfulness in SRE: Monitoring and Alerting for One's Self](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Flutz)\n* [Pragmatic Automation](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fluebbe)\n* [Sublinear Scaling in Practice: The 1k SRE Project](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Frath)\n* [Strategies to Edit Production Data](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fqiu)\n* [The Curse of SRE Autonomy and How to Manage It](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fbondi)\n* [Scaling SRE Organizations: The Journey from 1 to Many Teams](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Ffranco)\n* [SRE Classroom - How to Design a Distributed System in 3 Hours](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fthomas)\n* [Using PRDs and User Journeys to Design User-Friendly Tools](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fstockman)\n* [How Google SRE and Developers Work Together](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=DOQqOrHs3VY)\n* [SREcon21 - Experiments for SRE](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=yjusNjAFxFg)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Grab\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Our Journey to Continuous Delivery at Grab (Part 1)](https:\u002F\u002Fengineering.grab.com\u002Four-journey-to-continuous-delivery-at-grab)\n* [Our Journey to Continuous Delivery at Grab (Part 2)](https:\u002F\u002Fengineering.grab.com\u002Fblog\u002F2\u002F)\n* [Designing Resilient Systems: Circuit Breakers or Retries? (Part 1)](https:\u002F\u002Fengineering.grab.com\u002Fdesigning-resilient-systems-part-1)\n* [Designing Resilient Systems: Circuit Breakers or Retries? (Part 2)](https:\u002F\u002Fengineering.grab.com\u002Fdesigning-resilient-systems-part-2)\n* [Designing Resilient Systems Beyond Retries (Part 3): Architecture Patterns and Chaos Engineering](https:\u002F\u002Fengineering.grab.com\u002Fbeyond-retries-part-3)\n* [Orchestrating Chaos using Grab's Experimentation Platform](https:\u002F\u002Fengineering.grab.com\u002Fchaos-engineering)\n* [How We Designed the Quotas Microservice to Prevent Resource Abuse](https:\u002F\u002Fengineering.grab.com\u002Fquotas-service)\n* [How We Scaled Our Cache and Got a Good Night's Sleep](https:\u002F\u002Fengineering.grab.com\u002Fhow-we-scaled-our-cache-and-got-a-good-nights-sleep)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Grammarly\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Scaling AWS Infrastructure to Support Multiple Regions](https:\u002F\u002Fwww.grammarly.com\u002Fblog\u002Fengineering\u002Fscaling-aws-infrastructure\u002F)\n* [Security Operations in an AWS Environment](https:\u002F\u002Fwww.grammarly.com\u002Fblog\u002Fengineering\u002Fsecurity-infrastructure-aws\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Gusto\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Service Level Objectives for On-call Peace of Mind](https:\u002F\u002Fengineering.gusto.com\u002Fslos-for-peace-of-mind\u002F)\n* [Debugging Sidekiq Poison Pills](https:\u002F\u002Fengineering.gusto.com\u002Fdebugging-sidekiq-poison-pills\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Halodoc\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Site Reliability Engineering for Native mobile apps](https:\u002F\u002Fwww.infoq.com\u002Farticles\u002Fsite-reliability-engineering-mobile-apps\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Heroku\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [The Adventures of Rendezvous in Heroku’s New Architecture](https:\u002F\u002Fblog.heroku.com\u002Fengineering)\n* [Incident Response at Heroku](https:\u002F\u002Fblog.heroku.com\u002Fincident-response-at-heroku-2020)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>IBM\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [What is Site Reliability Engineering (SRE)?](https:\u002F\u002Fwww.ibm.com\u002Fcloud\u002Flearn\u002Fsite-reliability-engineering)\n* [AIOps tools and solutions](https:\u002F\u002Fwww.ibm.com\u002Fcloud\u002Faiops)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Indeed\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Indeed SRE: An Inside Look](https:\u002F\u002Fengineering.indeedblog.com\u002Fblog\u002F2022\u002F04\u002Fsre\u002F)\n* [Being Just Reliable Enough](https:\u002F\u002Fengineering.indeedblog.com\u002Fblog\u002F2019\u002F10\u002Fbeing-just-reliable-enough\u002F)\n* [Automating Indeed’s Release Process](https:\u002F\u002Fengineering.indeedblog.com\u002Fblog\u002F2017\u002F03\u002Fautomating-release-process\u002F)\n* [Sloth, a Tool for Inducing Network Failures’ with Preetha Appan of Indeed.com](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon17americas\u002Fprogram\u002Fpresentation\u002Fappan)\n\n### Videos\n\n* [Are We Getting Better Yet? Progress Toward Safer Operations](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon20americas\u002Fpresentation\u002Felman)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Indeed\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [SRE Playbook - Practical Guide](https:\u002F\u002Fblog.jiocinema.com\u002Fsre-playbook-practical-guide\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Khan Academy\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [How Khan Academy Successfully Handled 2.5x Traffic in a Week](https:\u002F\u002Fblog.khanacademy.org\u002Fhow-khan-academy-successfully-handled-2-5x-traffic-in-a-week\u002F)\n* [Evolving our content infrastructure](https:\u002F\u002Fblog.khanacademy.org\u002Fevolving-our-content-infrastructure\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>LinkedIn\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Rethinking site capacity projections with Capacity Analyzer](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2021\u002Frethinking-site-capacity-projections-with-capacity-analyzer)\n* [Insights into a Product SRE team at LinkedIn](https:\u002F\u002Fwww.linkedin.com\u002Fpulse\u002Finsights-product-sre-team-linkedin-zaina-afoulki\u002F?trackingId=mxKJgZ3kp8l2WI9D4UZv7Q%3D%3D)\n* [Hiring SREs at LinkedIn](https:\u002F\u002Fengineering.linkedin.com\u002Fengineering-culture\u002Fhiring-sres-linkedin)\n* [Open source update: School of SRE](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2021\u002Fopen-source-update--school-of-sre)\n* [Fixing Linux filesystem performance regressions](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2020\u002Ffixing-linux-filesystem-performance-regressions)\n* [Production testing with dark canaries](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2020\u002Fproduction-testing-with-dark-canaries)\n* [Smart alerts in ThirdEye, LinkedIn’s real-time monitoring platform](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2019\u002F06\u002Fsmart-alerts-in-thirdeye--linkedins-real-time-monitoring-platfor)\n* [Iris mobile: An open source, mobile interface for incident management](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2019\u002F05\u002Firis-mobile--an-open-source--mobile-interface-for-incident-manag)\n* [LinkedOut: A Request-Level Failure Injection Framework](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2018\u002F05\u002Flinkedout--a-request-level-failure-injection-framework)\n* [Eliminating toil with fully automated load testing](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2019\u002Feliminating-toil-with-fully-automated-load-testing)\n* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 1](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2018\u002F03\u002Fthe-makeup-of-successful-geographically-distributed-sre-teams--p)\n* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 2](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2018\u002F03\u002Fthe-makeup-of-successful-geographically-distributed-sre-teams--p0)\n* [Project STAR*: Streamlining Our On-Call Process](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2018\u002F01\u002Fproject-star-streamlining-our-on-call-process)\n* [Automating Your Oncall: Open Sourcing Fossor and Ascii Etch](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2017\u002F12\u002Fopen-sourcing-fossor-and-ascii-etch)\n* [Resilience Engineering at LinkedIn with Project Waterbear](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2017\u002F11\u002Fresilience-engineering-at-linkedin-with-project-waterbear)\n* [Hiring SREs at LinkedIn, 2017](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2017\u002F07\u002Fhiring-sres-at-linkedin)\n* [Open Sourcing Iris and Oncall](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2017\u002F06\u002Fopen-sourcing-iris-and-oncall)\n* [Building the SRE Culture at LinkedIn](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2017\u002F05\u002Fbuilding-the-sre-culture-at-linkedin)\n* [Failure is Not an Option](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2017\u002F01\u002Ffailure-is-not-an-option)\n* [MTTD and MTTR Are Key](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2016\u002F12\u002Fmttd-and-mttr-are-key)\n* [What Gets Measured Gets Fixed](https:\u002F\u002Fengineering.linkedin.com\u002Fblog\u002F2016\u002F12\u002Fwhat-gets-measured-gets-fixed)\n\n### Videos\n\n* [Growing the Site Reliability Team at LinkedIn: Hiring is Hard -- Greg Leffler](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ZemNg9GYvOA)\n* [9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon20americas\u002Fpresentation\u002Fdoherty)\n* [Weathering the Storm: How Early Warnings Save the Farm](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Fsherwin)\n* [Unconference: Unsolved Problems in SRE](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Fandersen)\n* [Leading without Managing: Becoming an SRE Technical Leader](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fpalino-leading)\n* [Why Does (My) Monitoring Suck?](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fpalino-monitoring)\n* [Traffic Forecasting and Stress Testing Infrastructure](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fsulakhe)\n* [Collective Mindfulness for Better Decisions in SRE](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fandersen-mindfulness)\n* [TCP—Architecture, Enhancements, and Tuning](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fdhakal)\n* [Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Flamba)\n* [Understanding Business Metrics Can Make You a Better SRE](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fsuley)\n* [Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fkehoe)\n* [Differences in SRE Implementations across Companies](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fandersen)\n\n### Tools\n\n* [On-Call](https:\u002F\u002Fgithub.com\u002Flinkedin\u002Foncall)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Loggi\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [The Release Manager model](https:\u002F\u002Fpartiu.loggi.com\u002Fthe-release-manager-model-7af93f9f499f)\n* [SRE Teams #8: Loggi](https:\u002F\u002Fsreteams.substack.com\u002Fp\u002Floggi)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Loveholidays\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Dynamic alert routing with Prometheus and Alertmanager](https:\u002F\u002Ftech.loveholidays.com\u002Fdynamic-alert-routing-with-prometheus-and-alertmanager-f6a919edb5f8)\n* [Making loveholidays 18% faster with HTTP\u002F3](https:\u002F\u002Ftech.loveholidays.com\u002Fmaking-loveholidays-18-faster-with-http-3-1860879528a7)\n* [Enforcing best practice on self-serve infrastructure with Terraform, Atlantis and Policy As Code](https:\u002F\u002Ftech.loveholidays.com\u002Fenforcing-best-practice-on-self-serve-infrastructure-with-terraform-atlantis-and-policy-as-code-911f4f8c3e00)\n* [The 5 principles that helped scale loveholidays](https:\u002F\u002Ftech.loveholidays.com\u002Fthe-5-principles-that-helped-scale-loveholidays-7ea0b0fd3df9)\n* [Realtime Fastly logs with Grafana Loki for under $1 a day](https:\u002F\u002Ftech.loveholidays.com\u002Frealtime-fastly-logs-with-grafana-loki-for-under-1-a-day-5b63ccf32d66)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Macquarie\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Our DevSecOps journey with Golang](https:\u002F\u002Fmedium.com\u002Fmacquarie-engineering-blog\u002Four-devsecops-journey-with-golang-a1af38328c36)\n* [Pipeline Configuration as Code with Kotlin](https:\u002F\u002Fmedium.com\u002Fmacquarie-engineering-blog\u002Fpipeline-configuration-as-code-with-kotlin-dec9ab9ee6fa)\n* [DevOps and Segregation of Duties](https:\u002F\u002Fmedium.com\u002Fmacquarie-engineering-blog\u002Fdevops-and-segregation-of-duties-ea4a7dcc7217)\n* [Macquarie embraces DevOps](https:\u002F\u002Fmedium.com\u002Fmacquarie-engineering-blog\u002Fmacquarie-embraces-devops-30f0fe62496a)\n* [Scaling a Kubernetes Platform across the Enterprise](https:\u002F\u002Fmedium.com\u002Fmacquarie-engineering-blog\u002Fscaling-a-kubernetes-platform-across-the-enterprise-c07a53b6022e)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Mattermost\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Monitoring Cloud Environments at Scale with Prometheus and Thanos](https:\u002F\u002Fmattermost.com\u002Fblog\u002Fmonitoring-cloud-environments-at-scale-with-prometheus-and-thanos\u002F)\n* [How We Use Sloth to do SLO Monitoring and Alerting with Prometheus](https:\u002F\u002Fmattermost.com\u002Fblog\u002Fsloth-for-slo-monitoring-and-alerting-with-prometheus\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Meituan (美团)\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [The development and practice of SRE in the cloud (云端的SRE发展与实践)](https:\u002F\u002Ftech.meituan.com\u002F2017\u002F08\u002F03\u002Fmeituanyun-sre.html)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Mercari\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Who Watches the Watchmen? Keeping an Eye on Our Monitoring Systems](https:\u002F\u002Fengineering.mercari.com\u002Fen\u002Fblog\u002Fentry\u002F20220805-who-watches-the-watchmen-keeping-an-eye-on-our-monitoring-systems\u002F)\n* [What the Microservices SRE Team are doing as SRE Evangelists](https:\u002F\u002Fengineering.mercari.com\u002Fen\u002Fblog\u002Fentry\u002F20220225-cdb2b6deff\u002F)\n* [What it’s like to work as an embedded microservices SRE](https:\u002F\u002Fengineering.mercari.com\u002Fen\u002Fblog\u002Fentry\u002F20220228-work-as-an-embedded-microservices-sre\u002F)\n* [The Merpay SRE Team: Past and future](https:\u002F\u002Fengineering.mercari.com\u002Fen\u002Fblog\u002Fentry\u002F20210831-a91c3dca9d\u002F)\n* [Embedded SRE at Mercari](https:\u002F\u002Fengineering.mercari.com\u002Fen\u002Fblog\u002Fentry\u002F20220221-embedded-sre-at-mercari\u002F)\n* [What the SRE team wants to achieve with the development team](https:\u002F\u002Fengineering.mercari.com\u002Fen\u002Fblog\u002Fentry\u002F20210129-embedded-sre\u002F)\n* [DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry?](https:\u002F\u002Fengineering.mercari.com\u002Fen\u002Fblog\u002Fentry\u002F20201214-devsecops-what-is-it-and-why-is-it-gaining-momentum-in-the-industry\u002F)\n* [How do we share troubleshooting skills](https:\u002F\u002Fengineering.mercari.com\u002Fen\u002Fblog\u002Fentry\u002F2020-01-28-143339\u002F)\n* [Datadog Dashboard at Scale w \u002F Terraform](https:\u002F\u002Fengineering.mercari.com\u002Fen\u002Fblog\u002Fentry\u002F2019-12-09-122134\u002F)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Meta\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Leveraging AI for efficient incident response](https:\u002F\u002Fengineering.fb.com\u002F2024\u002F06\u002F24\u002Fdata-infrastructure\u002Fleveraging-ai-for-efficient-incident-response\u002F)\n* [Improving Meta’s SLO workflows with data annotations](https:\u002F\u002Fengineering.fb.com\u002F2022\u002F08\u002F29\u002Fdeveloper-tools\u002Fimproving-metas-slo-workflows-with-data-annotations\u002F)\n* [SLICK: Adopting SLOs for improved reliability](https:\u002F\u002Fengineering.fb.com\u002F2021\u002F12\u002F13\u002Fproduction-engineering\u002Fslick\u002F)\n* [More details about the October 4 outage](https:\u002F\u002Fengineering.fb.com\u002F2021\u002F10\u002F05\u002Fnetworking-traffic\u002Foutage-details\u002F)\n* [Update about the October 4th outage](https:\u002F\u002Fengineering.fb.com\u002F2021\u002F10\u002F04\u002Fnetworking-traffic\u002Foutage\u002F)\n\n### Videos\n\n* [Scheduling at Scale: eBPF Schedulers with Sched_ext](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon24emea\u002Fpresentation\u002Fhodges)\n* [A Customer Service Approach to SRE](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Flooney)\n* [How (Not) to Scale a Project: A Post-Mortem](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fbagnoli)\n* [Releasing the World's Largest Python Site Every 7 Minutes](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fwong-shuhong)\n* [Using ML to Automate Dynamic Error Categorization](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fdavoli)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Microsoft\u003C\u002Fsummary>\n\n### Videos\n\n* [SLI & Reliability Deep-Dive’ with David N. Blank-Edelman of Microsoft](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=1iMo3SkdQqQ)\n* [Ironies of Automation: A Comedy in Three Parts’ with Tanner Lund of Microsoft](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=U3ubcoNzx9k)\n* [Sustainable Software Engineering & SREs](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon20americas\u002Fpresentation\u002Fjohnson)\n* [Study on Human Factors and Team Culture to Improve Pager Fatigue](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon20americas\u002Fpresentation\u002Fbarteneva)\n* [Prioritizing Trust While Creating Applications](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Fdavis)\n* [Building Resilience: How to Learn More from Incidents](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Fstenning)\n* [A Tale of Two Postmortems: A Human Factors View](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Flund-postmortem)\n* [Availability—Thinking beyond 9s](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Fsrinivasamurthy)\n* [Ironies of Automation: A Comedy in Three Parts](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19asia\u002Fpresentation\u002Flund-comedy)\n* [The Ops in Serverless](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fdavis)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>MIRO\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics](https:\u002F\u002Fmedium.com\u002Fmiro-engineering\u002Fprometheus-high-availability-and-fault-tolerance-strategy-long-term-storage-with-victoriametrics-82f6f3f0409e)\n* [Managing hundreds of servers for load testing: Autoscaling, custom monitoring, DevOps culture](https:\u002F\u002Fmedium.com\u002Fmiro-engineering\u002Fmanaging-hundreds-of-servers-for-load-testing-autoscaling-custom-monitoring-devops-culture-390fd1c7e699)\n* [Reliable load testing with regards to unexpected nuances](https:\u002F\u002Fmedium.com\u002Fmiro-engineering\u002Freliable-load-testing-with-regards-to-unexpected-nuances-6f38c82196a5)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Monzo\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Autoscaling Monzo: How we optimise our platform to be just the right size](https:\u002F\u002Fmonzo.com\u002Fblog\u002F2020\u002F10\u002F19\u002Fautoscaling-monzo)\n* [How we’ve evolved on-call at Monzo](https:\u002F\u002Fmonzo.com\u002Fblog\u002Fhow-weve-evolved-on-call-at-monzo)\n* [How we respond to incidents](https:\u002F\u002Fmonzo.com\u002Fblog\u002F2019\u002F07\u002F08\u002Fhow-we-respond-to-incidents)\n* [How we monitor Monzo](https:\u002F\u002Fmonzo.com\u002Fblog\u002F2018\u002F07\u002F27\u002Fhow-we-monitor-monzo)\n\n### Videos\n\n* [Eventually Consistent Service Discovery](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19emea\u002Fpresentation\u002Fpatel)\n\n### Tools\n\n* [Response](https:\u002F\u002Fgithub.com\u002Fmonzo\u002Fresponse)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Netflix\u003C\u002Fsummary>\n\n### Blog Posts\n\n* [Achieving observability in async workflows](https:\u002F\u002Fnetflixtechblog.com\u002Fachieving-observability-in-async-workflows-cd89b923c784)\n* [Building Netflix’s Distributed Tracing Infrastructure](https:\u002F\u002Fnetflixtechblog.com\u002Fbuilding-netflixs-distributed-tracing-infrastructure-bb856c319304)\n* [Lessons from Building Observability Tools at Netflix](https:\u002F\u002Fnetflixtechblog.com\u002Flessons-from-building-observability-tools-at-netflix-7cfafed6ab17)\n* [Edgar: Solving Mysteries Faster with Observability](https:\u002F\u002Fnetflixtechblog.com\u002Fedgar-solving-mysteries-faster-with-observability-e1a76302c71f)\n* [Telltale: Netflix Application Monitoring Simplified](https:\u002F\u002Fnetflixtechblog.com\u002Ftelltale-netflix-application-monitoring-simplified-5c08bfa780ba)\n* [Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix](https:\u002F\u002Fnetflixtechblog.com\u002Fkeeping-customers-streaming-the-centralized-site-reliability-practice-at-netflix-205cc37aa9fb)\n* [Introducing Dispatch](https:\u002F\u002Fnetflixtechblog.com\u002Fintroducing-dispatch-da4b8a2a8072)\n* [Applying Netflix DevOps Patterns to Windows](https:\u002F\u002Fnetflixtechblog.com\u002Fapplying-netflix-devops-patterns-to-windows-2a57f2dbbf79)\n* [ChAP: Chaos Automation Platform](https:\u002F\u002Fnetflixtechblog.com\u002Fchap-chaos-automation-platform-53e6d528371f)\n* [Starting the Avalanche](https:\u002F\u002Fnetflixtechblog.com\u002Fstarting-the-avalanche-640e69b14a06)\n* [Netflix Chaos Monkey Upgraded](https:\u002F\u002Fnetflixtechblog.com\u002Fnetflix-chaos-monkey-upgraded-1d679429be5d)\n* [Chaos Engineering Upgraded](https:\u002F\u002Fnetflixtechblog.com\u002Fchaos-engineering-upgraded-878d341f15fa)\n* [Automated Failure Testing](https:\u002F\u002Fnetflixtechblog.com\u002Fautomated-failure-testing-86c1b8bc841f)\n* [From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform](https:\u002F\u002Fnetflixtechblog.com\u002Ffrom-chaos-to-control-testing-the-resiliency-of-netflixs-content-discovery-platform-ce5566aef0a4)\n* [Introducing Atlas: Netflix’s Primary Telemetry Platform](https:\u002F\u002Fnetflixtechblog.com\u002Fintroducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a)\n* [FIT: Failure Injection Testing](https:\u002F\u002Fnetflixtechblog.com\u002Ffit-failure-injection-testing-35d8e2a9bb2)\n* [Announcing Security Monkey — AWS Security Configuration Monitoring and Analysis](https:\u002F\u002Fnetflixtechblog.com\u002Fannouncing-security-monkey-aws-security-configuration-monitoring-and-analysis-1f2bfb001708)\n* [Lessons Netflix Learned from the AWS Outage](https:\u002F\u002Fnetflixtechblog.com\u002Flessons-netflix-learned-from-the-aws-outage-deefe5fd0c04)\n* [Scryer: Netflix’s Predictive Auto Scaling Engine](https:\u002F\u002Fnetflixtechblog.com\u002Fscryer-netflixs-predictive-auto-scaling-engine-a3f8fc922270)\n\n### Major incidents & analysis reports\n\n* [Post-mortem of October 22, 2012 AWS degradation](https:\u002F\u002Fnetflixtechblog.com\u002Fpost-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5)\n  \n### Videos\n\n* [Achieving Excellence: SLO Thresholds That Transform Service Quality](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon24emea\u002Fpresentation\u002Fortiz)\n* [AWS re:Invent 2019: A day in the life of a Netflix engineer (NFX202)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0QS1TWLooo0)\n* [When \u002Fbin\u002Fsh Attacks: Revisiting \"Automate All the Things\"](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon20americas\u002Fpresentation\u002Freed)\n* [How Did Things Go Right? Learning More from Incidents](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fsrecon19americas\u002Fpresentation\u002Fkitchens)\n* [Monitoring and Tracing @Netflix Streaming Data Infrastructure](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=DlWYNoLmma8)\n* [Real user performance monitoring at Netflix scale ‐ Martin Spier](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=4RG2DUK03_0)\n* [AWS re:Invent 2017 - Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=rgfww8tLM0A)\n* [AWS re:Invent 2017: Performing Chaos at Netflix Scale (DEV334)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=LaKGx0dAUlo)\n* [Netflix: Multi-Regional Resiliency and Amazon Route 53](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=WDDkLOT8SCk)\n* [Designing Services for Resilience: Netflix Lessons](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=RWyZkNzvC-c)\n* [South Bay SRE Meetup - Netflix Cloud Performance Team](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=uQ0flQOtQEA)\n* [AWS re:Invent 2017: A Day in the Life of a Netflix Engineer III (ARC209)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=T_D1G42G0dE)\n* [How Netflix Uses Kinesis Streams to","该项目是一个精心整理的关于全球技术公司如何实践站点可靠性工程（SRE）的公开资源集合。它汇集了来自领先技术组织的最佳实践、工具、技术和文化，涵盖从SRE团队建设到混沌工程等多个方面，旨在为读者提供全面的SRE知识体系。项目使用JavaScript构建，并通过自动化测试确保内容质量。适合希望提升系统稳定性和运维效率的企业或个人参考学习，特别是在DevOps转型、监控与可观测性增强以及故障响应优化等场景下具有极高价值。","2026-06-11 03:36:34","high_star"]