⬅️ I Want To Thank You 🧭 Fediverse Moving over the Face of the Waters ➡️
Don't Worry 'Bout SRE
by Jay CuthrellShare and discuss on LinkedIn or HN
Music: Frank Sinatra - “Don’t Worry ‘Bout Me” (1966)
Getting Informed
This week our topic comes from Substack Chat. 🤔🙏🤓
Thanks for reading Fudge Sunday! Subscribe for free to receive new posts and support my work.
Is it realistic to worry if systems like Twitter will go down without SRE support?
What kinds of cloud support requires human intervention and what is just automatic?
First, let’s start with a quick definition for Site Reliability Engineering (SRE). “SRE is what you get when you treat operations as if it’s a software problem” per Google.
Next, I’ll explore both questions posed in this topic as they relate to SRE and the junction of human intervention and automation. Strap in tight.
Why not call it a day, the sensible way 🎶
Is it realistic to worry if systems like Twitter will go down without SRE support?
It’s reasonable to take precautions with any feature of any platform — including Twitter — that assumes an extended SRE reduction that prevents timely response to feature uptime impacting incidents can become a longer term condition.
I make the distinction about the feature of any platform for a reason. Specifically, if you recall from the SRE Engagement Model1, there are service lifecycle considerations.
Some early Twitter users from 2007-2012 might recall the early scaling challenges and the fail whale. Now, a decade later, the fail whale is a far more rare sighting2 that has been replaced with more subtle degradation of feature availability to manage user experiences.
From a completely speculative point of view, SRE shifting to focus on the error budgets3 of the service lifecycle for new features could come at the expense other existing features. Or, for short, what was deemed important to the existing user (you?) may become far less important to any company that shifts focus to the attraction of the new user (not you?) associated with top line revenue growth.
Recent newsworthy outages for Twitter in the range of 45 minutes4 are useful to compare to both the duration (hours) and frequency of the early years. As such, prior pontification and speculation on massive outages5 are less likely than incremental shifts in SRE priorities.
For what it’s worth, the last tweet from “TwitterSRE” was over a year ago and “TwitterEng(ineering)” was over a month ago. Clearly, there are other priorities outside of tweets from the SRE and Engineering teams left at Twitter now.
meetup.comLogin to Meetup | MeetupFind groups that host online or in person events and meet people in your local community who share your interests.12:03 AM ∙ Nov 11, 2021
11Likes6Retweets
Twitter Engineering @TwitterEng
What is your favorite metasyntactic variable?
As for me, I was a user of Twitter for over a decade and traded in Twitter shares when they were a public company. I have no insider information, know exactly nobody at Twitter now, and made my personal decision about gradually diminishing my Twitter use well over five years ago .
That said…
For any company that has built and staffed their own SRE team or outsources their SRE team, there is the risk of staff turnover and operating capital to meet payroll and pay the bills. Indeed, a sudden en masse SRE resignation that is compounded by a decline in revenue that impacts paying the bills for the SRE as a Service provider that is used to outsource SRE roles to a third party… or the inability to find and reliably pay any of the colocation bills, cloud bills, domain name registration renewals, etc… etc… would be the proverbial nail6 of both permantently offline and presently still online Internet legends.
Look out for yourself should be the rule 🎶
What kinds of cloud support requires human intervention and what is just automatic?
Cloud support that requires a human intervention is typically when a case (ticket) is successfully opened after satisfying the required disposition coding or mandatory triage categories. This assumes the cloud support website or portal itself is functional.
Google Cloud Customer Care Portal for example…
The requirement for humans varies. More human intervention is required when automation adoption based upon quantitative measurements is lower or deferred by favoring qualitative observation, individual decisions, and bespoke actions.
To unpack that a bit more, consider that qualitative observations are subjective. Disposition is the determination of when an issue is raised (a ticket!) that a human might be responsible for bringing to resolution.
[ sarcasm ] No ticket? No problem! Mission accomplished! [ / sarcasm ]
On the other hand, getting your now benign end user activity caught up in the previously determined suspicious end user activity window for a crudely tuned spatial or temporal parameter will subjectively be a pain the rear end. So anyone that has felt the arbitrary hand of automation gone awry will nod their head on this cloud condition as they attempt to communicate with their cloud service provider, managed service provider, or colocation provider to begin manual attestation of identity to resolve and get back to work again.
Now imagine that last paragraph in the wake of sustained or seemingly stochastic key staff turnover.
Now imagine new machines, new keys7, new networks, and new process drift during the timespan associated with all of these items.
Now imagine previously moving exabytes of data8 around to reflect new company goals.
Now imagine any or all of these decisions needing to be altered or reversed.
As a background, it’s useful to call back to a prior topic from June 2022: POSSE is my goal to AOYP and RYO. Or, for short, if your content lives on a platform run by someone else then you should make regular backups of what you can.
The reason I share this advice isn’t because I am rooting for the demise of any service in particular like Twitter. It’s just realistic because human beings are also notoriously unreliable things9 and they are likely to change careers if not their minds about what, subjectively, is important over time.
Lastly, it is important to remember that burnout is real. So, big #hugops to any SRE teams that stay, join in the future, or find themselves running and maintaining legacy scripts that arrived in their digital laps with less than precise usage guides that are now lacking in deep dependency mapping.
So, what will be the next online service feature that gets deprecated or go offline?
Until then… Place your bets!
Be like Kermit y’all
Disclosure
I am linking to my disclosure.
Read: Google - Site Reliability Engineering
Read: How Twitter Slayed the Fail Whale
Read: How maintenance windows affect your error budget
Read: Twitter experiences longest global outage in years
Read: Here’s how a Twitter engineer says it will break in the coming weeks
Read: For Want of a Nail
Read: How we rolled out security keys at Twitter
Read: Scaling data access by moving an exabyte of data to Google Cloud
Watch: Machines of Loving Grace: Butterfly Wings
Topics:
✍️ 🤓 Edit on Github 🐙 ✍️
⬅️ Previously: I Want To Thank You
➡️ Next: Fediverse Moving over the Face of the Waters
Share and discuss on LinkedIn or HN
-
Get Fudge Factor each week