🙏 A big thank you to our new sponsor, NexusTek! 🙏

NexusTek for Zero Trust Network Access
Thank you NexusTek for sponsoring Fudge Factor!
⬅️ Fudge Sunday - Cloud in Public: Impact Mapping 🧭 Fudge Sunday - Cloud in Public: DevCommsOps ➡️

Fudge Sunday - Cloud in Public: Mean Time To RCA

by Jay Cuthrell

View online

Start the week more informedThis week we continue to take a look at public things for a public cloud.

☁️✅⚠️🛑

This issue is part 4 of a 5 part series

  1. Fudge Sunday - Cloud in Public: Status Dashboards
  2. Fudge Sunday - Cloud in Public: Engineering SLO
  3. Fudge Sunday - Cloud in Public: DevCommsOps
  4. Fudge Sunday - Cloud in Public: Mean Time To RCA
  5. Fudge Sunday - Cloud in Public: Impact Mapping

As of this issue, we now have historical perspectives and definitions for status dashboardsEngineering SLO, and DevCommsOps. Next, let’s talk about cultural values and the innovations which drive continuous improvement in pursuit of publishing timely Root Cause Analysis (RCA) that, in turn, help further the development of key performance indicators (KPIs).

Meant Time To Root Cause Analysis in practice

Last week we covered Who, What, and Where for cloud companies that “write it down” to pursue goals for The Perfect Team. This issue will get to one of the two remaining questions, When, and next week we will explore Why.

Now, perhaps, is time for another neologism. This neologism is Mean Time To RCA. As of now, the only search engine results for “Mean Time To RCA” will likely return this newsletter, and “Mean Time To Root Cause Analysis” will likely return Splunk too.

“Mean Time To RCA” can be viewed through several lenses or perspectives within a learning-focused postmortem culture. While vendors of tooling utilized by SRE and incident management practitioners have a variety of perspectives on the fastest way or most complete approach to get to RCA, they all trend to other Mean Time To X as a foundation (Ishikawa diagrams, Kaizen methods, Cause Maps, Postmortem Templates, etc.). That said, marketing teams for tooling vendors may look for a way to, at best, differentiate or, at worst, obfuscate with a thesaurus approach to naming conventions.

  • If X = R = Respond, Repair, Recovery, Resolve, or Resolution
  • If X = I = Identify, Isolate, or Insights
  • If X = F = Failure, Fix, Fidelity, or Facilitate
  • If X = A = Acknowledge, Activity, or Action
  • If X = D = Determine, Detect, or Diagnose
  • If X = V = Verify or Validate
  • If X = T = Triage or Telemetry
  • If X = C = Confirm, Clarity, or Closure
  • If X = RR… 🤣🤣🤣🤣
  • and so on
  • but it ALL adds up to the time it takes to get to RCA

So, one may wonder if MTTAA is the Mean Time To Another Acronym.🤔

Effectively, Mean Time To RCA (for this series) refers to the time it takes to produce actionable insights from a root cause analysis. The lessons learned will inform, refine, or result in creating KPIs or Objectives and Key Results (OKRs) for the organization as part of a commitment to conspicuous and continuous improvement.

We know there is an increasingly personalized approach to DevCommsOps among hyperscale public cloud service providers. So, we need to understand the impact on Mean Time To RCA from both general public DevCommsOps and the effect from personalized approaches.

To provide examples, let’s examine where Mean Time To RCA is found within the hyperscale public cloud service providers today using our previous searches for “Root Cause Analyses (RCAs) / Incidents.” Once again, the list is in no particular order or weighting other than shorter names to longer names.

IBM Cloud Mean Time to RCA examples:

  • ~5 days for an outage duration of ~3 days
  • ~10 days for an outage duration of ~12 hours
  • ~10 days for an outage duration of ~9 hours
  • ~10 days for an outage duration of ~6 hours
  • ~2 days for an outage duration of ~3 hours
  • ~3 days for an outage duration of ~2 hours
  • And so on

Alibaba Cloud Mean Time to RCA examples:

  • Unable to find any notices that include outage duration
  • Unable to find any links from news coverage of outages
  • And so on?

Microsoft Azure Mean Time to RCA examples:

  • RCA (detailed) can be made available upon request
  • Unable to find any notices with an actual publication date
  • RCA publishing is organized by the start date of an outage
  • Several RCA reference outages lasting to the following day
  • Otherwise, ~1 day for an outage duration of any length (unlikely?)
  • And so on?

Amazon Web Services Mean Time to RCA examples:

  • ~9 days for the April 21, 2001 “disruption” and no duration calculated
  • ~5 days for the July 2, 2012 “event” and no duration calculated
  • ~5 days for the October 22, 2012 “event” based on Twitter update
  • ~5 days for the December 24, 2012 “event” based on Twitter update
  • ~3 days for the December 17, 2012 “event”
  • ~5 days for the June 13, 2014 “disruption” based on Twitter update
  • The August 7, 2014 message URI seems to be recycled from 2011 🤷‍♂️
  • ~3 days for the November 25, 2020 “event”
  • And so on

Google Cloud Platform Mean Time to RCA examples:

  • ~9 days for the October 31, 2019 “incident” duration of ~3 days
  • ~14 days for the May 20, 2021 “incident” duration of ~1 hour
  • And so on

Oracle Cloud Infrastructure Mean Time to RCA examples:

Notes:

In summary, there are stark variations amongst the hyperscalers in expressing Mean Time To RCA. Further, it is reasonable to expect the market will drive demand for standards that normalize the variations.

At the same time, DevCommsOps mixes public and personalized views that are unique to the customer experience. Further, the drive for personalization will result in Mean Time To RCA for the customer informed by their unique specific dependency mapping. The Azure and Oracle Cloud approaches will appeal to particular Enterprise customers.

As a reminder, we have established definitions for status dashboards, Engineering SLODevCommsOps, and Mean Time To RCA. We have a baseline that is ready to compare general public dependencies and customer personalized views of the underlying dependencies among hyperscale public cloud service providers.

Our last issue in the series will look at the increasing importance of dependency mapping across hyperscale public cloud service providers. Finally, we will consider business value engineering and customer journeys.

Stay tuned!

Disclosure

I am linking to my disclosure.


View this page on GitHub.

⬅️ Fudge Sunday - Cloud in Public: Impact Mapping 🧭 Fudge Sunday - Cloud in Public: DevCommsOps ➡️
Share and discuss on LinkedIn or HN