The Tyranny of 'ETA?': Leading Through an Outage

It’s 2:00 PM on a Tuesday. Monitoring alerts turn red. Customer support tickets spike. The site is down.

And then, inevitable as the tide, a message pops up in the incident channel from an executive: "ETA?"

It is a delicate balance when pressures are high, teams are stressed, and leadership is pushing for updates. But as an engineering leader, it is not your job to simply pass down the stress to the teams while they are desperately attempting to solve a crisis.

The Psychology of Panic Coding

When you demand an immediate timeline from an engineer who is deep in the logs trying to figure out why the database locked up, you aren't getting accurate information. You are simply increasing their cognitive load.

A stressed-out team will inevitably gravitate towards the first fix rather than the best fix just to alleviate the pressure.

They might restart a service without capturing the heap dump needed to understand the root cause.
They might manually patch a server instead of fixing the configuration script, ensuring the bug returns on the next deploy.

Your goal is recovery, but your pressure is incentivizing a band-aid.

Preparation: The Antidote to Chaos

The time to manage an outage is not during the outage. It is in the months leading up to it. You avoid the boiling point by building systems that allow for safety and speed.

The Panic Button: Have easily accessible rollback and reversion mechanisms in place. If a deploy breaks the site, the fix shouldn't be "write new code," it should be "press the undo button."
Permissionless Repair: Build trust by enabling independent engineers to make proactive changes without permission barriers. If an engineer needs approval from a VP to restart a server, your recovery time will always be too slow.
Observability, Not Just Monitoring: Surface potential issues before they become too impactful. Monitoring tells you the site is down; well-implemented observability tells you why (e.g., "Latency spiked in the payment service after the last config change").

The Leader's Role: Shield and Support

There is a human element to reliability. You can build resilience by delegating high-stakes tasks to engineers during normal operations to establish communication habits and reduce the level of discomfort when urgent issues do arise.

When the sudden problems do occur, your role shifts from "Manager" to "Support Staff."

Minimize Distractions: Designate an Incident Commander. Everyone else (especially executives) stays out of the dedicated troubleshooting channel.
Keep Communication Flowing: You handle the updates to the business. You tell the CEO, "We are investigating, I will update you in 30 minutes." You buy the team the silence they need to think.
Offer Help, Not Pressure: Don't ask "When will it be up?" Ask "Do you need me to get the database vendor on the phone?"

Reliability is a culture. If you lead with trust and preparation, your team will respond with competence. If you lead with panic, they will respond with patches.