Preserve Operational Excellence; Prevent Operational Drift

Complacency in cybersecurity allows for a “normalization of deviance.” Operational excellence asks us to build habits. Normalization of deviance is what happens when those habits decay without anyone noticing, or simply accepted because surfacing it would be inconvenient.

Share
Preserve Operational Excellence; Prevent Operational Drift
Space shuttle Atlantis lifting off on its STS-27 mission on Dec. 2, 1988. Credit: NASA

Normalization of Deviance in Cybersecurity

In the McManShow, and in my day-to-day, I focus on “operational excellence”, the discipline of healthy habits, planning, and clear communication; it is the force multiplier that turns technical expertise into positive organizational impact. 

Why is operational excellence so critical in a field like cybersecurity?

Cybersecurity risk management can never be viewed as a one-time project. You cannot “set it and forget it.” As the environment changes (new AI tool anyone?), and threat actor tactics evolve, so too must an organization's detection schemes and operational capabilities.

The opposite of operational excellence is complacency, or “operational drift.” It is in an environment of operational drift where the early or ongoing signs of an incident are often missed.

Complacency in cybersecurity allows for a “normalization of deviance.” This term is largely credited to sociologist Diane Vaughan's post-mortem of the 1986 Challenger disaster. She used it to describe the slow, almost invisible process by which organizations come to accept warning signs that should have stopped them cold. NASA's own Safety & Mission Assurance bulletin defines it bluntly: "The gradual process through which unacceptable practice or standards become acceptable." The O-rings on the solid rocket boosters were eroding on flight after flight. Each successful launch became evidence that the erosion was tolerable, until, of course, it wasn't.

Threat actors thrive on tolerated erosion. Investigation of any cyber or privacy incident reveals breadcrumbs that may have allowed for the event to be prevented or minimized.  the lack of follow-up to DLP alerts months after a new tool has been implemented; program I've assessed, advised, or run has carried some version of: "We know that's not great, but it's been like that for a while and nothing has happened." That sentence should make any of us nervous. It's the cyber equivalent of an eroded O-ring, and it shows up in two places more than anywhere else: in how we manage risk, and in how we run detection and response.

THE CYBER VERSION OF AN ERODED O-RING

Operational excellence asks us to build habits. Normalization of deviance is what happens when those habits decay without anyone noticing, or simply accepted because surfacing it would be inconvenient. Vaughan's point was never that NASA engineers and those of its contractors were reckless. They were brilliant and conscientious people who had simply lived with a problem long enough that it stopped looking like a problem.

If you have spent any time in a security operations center, on a GRC team, or as an outside advisor, you have seen this dynamic. Risks sit idle on the risk register with no update; more and more temporary exceptions for external media or disablement of MFA become permanent; no new detection rules have been implemented this year. None of these are catastrophic zero-days on their own. That is precisely the danger.

RISK MANAGEMENT: EXCEPTIONS THAT FORGET THEY WERE EXCEPTIONS

Cybersecurity risk management is, at its core, a discipline of trade-offs. We document a control, we identify the threat it mitigates, and when we can't fully implement it, we accept a piece of residual risk on behalf of the business. Acceptance is where normalization of deviance creeps in.

The mechanics are familiar to anyone who has run a risk program:

The temporary exception that becomes permanent: A business unit needs to spin up a workload without MFA for a 30-day pilot. The pilot succeeds, the workload moves to production, and the exception simply… stays. By the time anyone revisits it, the original requestor has changed roles and the compensating controls were never implemented.

Residual risk drift: The risk you accepted in 2023 was scored against a 2023 environment. The threat landscape has moved, the asset has accumulated more sensitive data, and the integrations have multiplied. The acceptance is the same; the underlying risk is not.

The "we'll fix it next quarter" loop, essentially, continuing technical debt. A finding gets deferred once for legitimate reasons. The deferral becomes the precedent. By the fourth quarter in a row, no one is asking why anymore.

Each of these looks reasonable in isolation. Stacked on top of each other across an enterprise, they are how programs end up materially out of step with their own stated risk appetite, and are ripe for threat actor exploitation. The lesson from Vaughan is not that any single deviation is fatal; it is that the act of repeatedly tolerating small deviations rewires what the organization considers normal.

Practical countermeasures here are not glamorous, and that is why they get skipped. Exceptions need expiration dates that actually trigger something. Risk acceptances need to be re-signed on a defined cadence, by a named owner, against the current environment rather than the one that existed when the acceptance was first granted. The risk register needs to be a living document that someone is paid to argue with, not a SharePoint artifact that gets dusted off for the audit.

DETECTION & RESPONSE: RULES THAT QUIETLY STOP WORKING

If risk management is where deviance gets formalized on paper, detection and response is where it gets baked into operational muscle memory. The D&R version of Vaughan's O-ring is the detection rule that nobody trusts but nobody removes.

The following are patterns that you may recognize:

Rule decay: A detection written against a 2021 endpoint footprint does not necessarily fire correctly against a 2026 environment full of new SaaS applications and non-human identities. Logs change schemas. Field names get renamed. The rule still "exists" in the data lake, but not in an effective state. It catches its original test use cases, and that is about it. 

Suppression that outlives its justification: Analysts mute a noisy rule on a Friday afternoon to get through the queue. The mute persists. By the time anyone audits the suppression list, no one remembers who created the suppression or whether the underlying noise was ever resolved.

Tuning to the false positive rather than the threat: A rule throws too many alerts, so the threshold is raised. Then raised again. Eventually the rule only fires on the most obvious version of the behavior, which a real adversary will avoid.

The runbook that doesn't match the tooling anymore: The IR playbook references a console that was decommissioned, a vendor that was replaced, or a Slack channel that no longer has the right people in it. Until an incident hits, no one notices.

None of these failures look like negligence the moment they happen. Each one is a small, defensible operational decision, and is easy to deprioritize in the face of major product pushes. That is what makes them so insidious, and why detection engineering as a discipline has rightly emphasized treating detections as code, with version control, testing, and a defined lifecycle, rather than as a set-and-forget configuration.

AI: DETECTION DRIFT AT MACHINE SPEED

Everything I just described accelerates when machine learning is involved. The newer generation of detection capabilities — UEBA, behavioral analytics, ML-tuned anomaly detection — carries a flavor of normalization of deviance that is harder to spot because the "tuning" is happening inside a model rather than in a rule you can read.

Two failure modes are worth naming explicitly:

Concept drift: The model was trained on a baseline of "normal" behavior in your environment. As your environment changes with new SaaS adoption, M&A, agentic AI tools issuing API calls on behalf of users, the baseline moves. If the model is not retrained or evaluated against ground truth, it will quietly start treating genuinely anomalous behavior as normal.

Feedback loop poisoning: Many ML detection systems learn from analyst dispositions. If the SOC is consistently dispositioning a class of alerts as benign because they're noisy, the model will learn to stop surfacing them and may eventually miss the needle in the haystack it was intended to detect. The deviance has been encoded into the detection logic itself.

The same principle Vaughan identified at NASA applies here, just with a faster clock. The system is learning what its operators are willing to tolerate, and over time it will optimize for that tolerance rather than for the underlying mission. 

HOW TO SPOT NORMALIZATION OF DEVIANCE IN YOUR PROGRAM

I have thought through a short, honest checklist for any cybersecurity leader to validate their operational excellence: 

1. How many policy exceptions are past their original expiration date? How many were signed by someone no longer in the role? How many were scored against an environment that no longer exists?

2. Audit your suppression and mute lists. For each suppression, can someone on the team explain who created it, when, and why? If not, treat the suppression as suspect until proven otherwise.

3. Sample your detection rules. Pick ten rules at random. When did each one last fire? When was each one last reviewed? When was each one last tested against a known-bad sample?

4. Test your IR runbooks against current tooling. Do the consoles, contacts, and channels referenced still exist? When was the last tabletop that actually exercised them end-to-end?

5. Reconcile your control framework against reality. For each control, who owns it today, and when did they last confirm it's operating as designed?

6. For ML-driven detections, ask for a drift report.

7. Listen for the language. "That's always been like that." "We've never had a problem with it." "The business won't let us change it." These are the cybersecurity equivalent of NASA's O-ring data, and they deserve the same scrutiny.

CLOSING THE LOOP ON OPERATIONAL EXCELLENCE

Operational excellence isn't the absence of compromise. Every program lives with risk it can't fully eliminate. There is no “perfect” in cybersecurity. Preventing normalization of deviance in a cybersecurity program, however, is a crucial component in that ongoing pursuit of perfection. 

Have any thoughts or feedback? As always, reach out!


REFERENCES

- NASA Safety & Mission Assurance, "Normalization of Deviance," safety message, November 2014. https://sma.nasa.gov/docs/default-source/safety-messages/safetymessage-normalizationofdeviance-2014-11-03b.pdf

- Diane Vaughan, "The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA." https://books.google.com/books?id=erYjCwAAQBAJ&pg=PT30#v=onepage&f=false