3:37 AM: The phone rings, dragging you out of sleep and into a living nightmare. On-call has begun.
Most of us Network Engineers have been part of on-call rotations throughout our careers, and depending on your role, it may not be so bad. But more often than not, on-call is a dreaded shadow that hangs over you– a cycle you can’t seem to break.
I served as part of a six-week on-call rotation for more than 5 years on an operations team. In that time, I felt the toll it takes if you don’t prepare yourself mentally. My on-call weeks were filled with stress, sleep deprivation, and ultimately feelings of helplessness.
But the worst challenges came from the pressures I created for myself, inventions of my own imagination, that I constantly needed to be reminded were unfounded.
The Four Fears
Much of the stress from on-call fell into one of these four fears.
1. Fear of the unknown.
The unknown nature of when a call might come kept me on edge.
After being interrupted again and again while trying to maintain some aspect of social life, I started to isolate myself to spare the pain of canceling plans and missing out.
On the rare nights and days when I wasn’t called, I might as well have been. On-call took over my schedule either way as I shut out the world. This fear also manifested itself in my sleep, making me spend those precious nighttime hours lying awake, worrying about being called. By the time the call came, I might have only just fallen asleep. Self-torment at it’s finest.
2. Fear of failing to resolve the issue.
The fear of failure is common, but this was the most ridiculous fear that haunted me. As the highest level engineers in the organization, our team was the final stop for all technical problems. Failure, quite literally, was not an option. We did whatever was necessary to get to restore the network.
I’d worry about being faced with a problem I couldn’t overcome, but in all the outages I worked, this was never the case. That doesn’t mean I always knew what to do, not by a longshot, but there wasn’t a single issue that went unresolved. The question was just how long and how many attempts it would take to get there.
3. Fear of letting my team down.
This one was the hardest to deal with, and was linked to “fear of failure”: I felt responsible for protecting the team from outages.
I didn’t want to engage additional people or reach out for peer reviews when I was up against a wall, especially in the wee hours of the night. I wanted to spare my team any involvement. It’s absurd because the team always had my back. And if the situation were reversed, I’d always be there for them.
Despite knowing this, I struggled to ask for help.
4. Fear of on-call is never-ending.
It doesn’t take much sleep deprivation to compromise your decision-making capabilities. And when you’ve barely slept, and are only a few days into an on-call week, there’s a highly irrational fear that the experience will last forever.
Setting a timer to countdown until on-call ended was necessary to see time pass even though it felt like it was standing still. In these moments of mania, I’d tell myself, “a year from now, this won’t matter.”
None of these fears were rational, and yet, despite knowing this, I couldn’t stop these feelings early on.
The reason I was able to survive on-call, was by reverse-engineering the stressors, creating a framework applicable to any situation, especially when it felt like I was out of options.
The New OSI Model
We all know the first step of troubleshooting is to work the OSI model layer by layer.
Introducing the “Oh Sh!t Inspiration (OSI) Model”
These are the decisions you would make if you were firing on all cylinders but can overlook when your brain is in a state of fatigue and stress. By making your own “OSI” model, you’ll know what to do when you are three days into on-call with only a few hours of sleep and are working an outage on 16-year-old proprietary equipment no one knew existed.
The following list was first drafted during one of my particularly brutal on-call nights in November 2013. Feel free to use it as a jumping-off point for your own OSI model. Despite these common and somewhat simplistic themes, I’ve returned to this list time and time again when overwhelmed.
This isn’t some new-age magic; it’s hardcore science. When the body is under stress, we take shallow breaths and deprive our muscles and brain of oxygen. There’s a reason panic attacks are accompanied by shortness of breath. Taking a few moments to breathe will help clear the mind and control anxiety.
2. Trust, but Verify
There’s a snarky saying, “Your lack of planning is not my emergency,” which I’ve found especially accurate at a certain organizational scale. Just because someone wants to make something your problem doesn’t mean you have to take on that responsibility.
Our team was the escalation point for the NOC, and it was oh-so-convenient to hit the “easy button” by escalating to the on-call. Often this was due to poor management, or for political reasons, and a technical escalation was not required.
Another common unwarranted escalation was non-urgent issues, problems that could be solved during standard hours. Because the NOC operates 24/7, running multiple shifts, they had no incentive to escalate at appropriate times.
Ideally, you’re called on outages and not just noise, but understanding the why of who’s calling will preserve some level of sanity and help you determine the appropriate response.
3. Net-fix and Chill
Depending on your organization’s business, it may feel like the weight of the world has fallen on your shoulders.
I’ve worked on rolling broadcast storm outages on layer 2 networks supporting emergency services. It can feel like a life or death situation because sometimes it is.
When the pressure is on, and you’re out of ideas, one of the greatest hacks around is to put yourself in someone else’s shoes. Think of the best person to handle this situation, someone that you know would resolve it easily, and do what they would do. It sounds silly, but it works surprisingly well.
4. Call Backup
There are occasions when you’ve exhausted every option and don’t know where to go next. Don’t be afraid to reach out to vendor support, your team, and even your management. Obviously, Google is your friend, but you also have a whole network (pun intended) of people who support you.
I keep a list of co-workers and their specializations so I can call on their strengths when needed.
5. Debug it
The issues that nag or scare us are useful signposts for where we need to spend more time studying. Every hard escalation is worth using as a learning opportunity for topics that require further education and training. The most painful outages were also the moments where I grew the most as an engineer.
It hurts to learn your weaknesses, but there’s nothing more valuable.
6. Know the Cost
On-calls shouldn’t be consistently brutal, and if they are, it’s likely there’s a systemic issue.
If you’ve worked to fill in your knowledge gaps and pushed for organizational change without any improvement, it’s time to reevaluate. Ask yourself regularly if this is still the right fit for you. Ideally, the answer is always yes, but if it’s not, there’s no need to stay in abusive positions with uncaring management.
It’s probably time to start looking for another job.
Over the past five years, I’ve worked hundreds of hours on-call, and once I had a framework to fall back on, it became significantly easier. While it was never effortless, it was manageable. My hope is that by creating your own OSI model, you can survive any on-call nightmare.
If you have a tip or strategy for getting through on-call let us know in the comments.