Blog/

When the Wall Is Breached — Designing Medical Systems with Safety-II

Safety-I builds walls against failure. Safety-II asks how people succeed despite failure. In disaster medicine, the difference determines whether a forced evacuation loses patients or saves them.

The 95% We Ignore

In 2014, Erik Hollnagel published a paper that quietly changed how safety scientists think. His argument was simple and unsettling: the safety field had spent decades studying the small percentage of cases where things go wrong, while almost completely ignoring the vast majority of cases where things go right.

He called the traditional approach Safety-I — identify failure modes, build defenses, prevent bad outcomes. It's the backbone of every checklist, every failsafe, every "are you sure?" dialog box. And it works. Except when it doesn't.

The alternative, Safety-II, starts from a different question: instead of asking why things occasionally fail, ask why they usually succeed. The answer isn't "because we built good barriers." The answer is "because people adapt." Nurses improvise. Doctors deviate from protocol when the protocol doesn't fit. Logistics staff find workarounds that nobody planned for.

Safety-I treats this variation as a problem. Safety-II treats it as the primary source of resilience.

The Scenario That Broke Our Safety-I Thinking

We were building xGrid, a disaster medical logistics platform that runs offline on Raspberry Pi devices. Early in development, our design was thoroughly Safety-I: barcode scanning, identity verification, multi-step confirmation dialogs, role-based access controls.

Then we ran a tabletop exercise. The scenario:

Three of eight medical stations are forced to evacuate simultaneously. At the moment of evacuation, the following operations are in progress:

  • A blood transfusion with strict chain-of-custody requirements
  • An active surgery that must be interrupted and resumed elsewhere
  • Medications in transit between stations that need reallocation

Our Safety-I system handled the individual checks fine. What it couldn't handle was the situation. The system assumed stable operating conditions. The real world provided the opposite.

We needed something that didn't just prevent errors in normal conditions, but actively helped people succeed in abnormal ones.

Four Principles We Learned

1. Make Uncertainty Visible

The instinct in software design is to present clean, confident information. Numbers without qualifiers. Status indicators that are green or red, never yellow.

In a disaster scenario, this confidence is dangerous. If inventory data hasn't synced in three days, showing the last known count as if it's current leads to bad decisions. A nurse might skip a physical count because the system says there are 12 units. There might be 4.

xGrid tags every data point with freshness metadata. When sync exceeds a threshold, the interface shows a clear stale marker. This isn't a bug indicator — it's an honesty signal. It tells the operator: "This number might be wrong. Verify before acting."

The counterintuitive insight: a system that admits uncertainty is more trustworthy than one that hides it. Users learn to trust the system precisely because it tells them when not to trust the data.

2. Degrade Gracefully, Not Permissively

When systems enter emergency mode, designers face a dilemma. Lock everything down and risk blocking critical operations? Or open everything up and risk errors?

Most systems choose one extreme. xGrid chooses neither.

In emergency mode, we strip non-essential inputs. Address fields, insurance numbers, detailed allergy documentation — all skippable. But three confirmation steps remain mandatory regardless of system state:

  • Patient identity verification
  • Medication name and dosage confirmation
  • Blood product cross-match verification

The design philosophy: people under pressure will naturally skip steps. That's not a bug in human behavior — it's an adaptation. The system's job is to let them skip the steps that can wait while making it impossible to skip the steps that can't.

3. Design Records for Learning, Not Liability

Every handoff in xGrid generates a record. Every medication dispense, every transfer, every inventory movement. The data is granular enough to reconstruct a complete event timeline.

But we designed the recording system for a specific purpose that most audit systems miss: learning from success, not just from failure.

Consider: after a particularly smooth evacuation, the records showed that a nurse had proactively scanned blood product data to her phone three minutes before the evacuation order came. This wasn't in any protocol. It was improvisation. And it saved fifteen minutes of re-scanning at the receiving station.

Safety-I would never discover this, because Safety-I only investigates when something goes wrong. Safety-II investigates routinely, because the causes of success are as valuable to understand as the causes of failure. That nurse's improvisation is now part of the recommended workflow.

4. Provide a Legal Way to Break the Rules

Normal medication dispensing: physician prescribes → pharmacist verifies → nurse confirms → dispense. Three gates. Solid Safety-I.

Disaster reality: the physician is triaging twenty patients. The pharmacist is at another station. A patient is hemorrhaging. The nurse needs hemostatic medication now.

If the system says "no prescription, no dispensing," the nurse will find a way around it — paper notes, verbal orders, direct cabinet access. The medication gets administered. The system records nothing.

xGrid offers a 24-hour break-glass override. A nurse can activate emergency authorization. The system logs the action, tags it as emergency-authorized, and flags it for review. The operation proceeds with a complete audit trail.

This isn't a safety bypass. It's a safety feature. The most dangerous thing a system can do is force people to work outside it, because then you lose all visibility. A legitimate, recorded rule-break is infinitely safer than an invisible workaround.

The Evidence: Patient I

Our E2E test suite includes Patient I — the Consolidation Test. It's designed to be brutal.

Setup: 8 stations, 3 forced to evacuate simultaneously. During evacuation:

  • Active blood transfusion (chain of custody must be maintained)
  • Surgery in progress (must interrupt, document via ISBAR handoff, transport, resume)
  • Medications in inter-station transit (must reallocate to surviving stations)

Results:

VerificationResult
Patient identity preserved across transferPass
Blood chain of custody maintained, cross-match at new stationPass
Surgery resumed with complete ISBAR handoff recordPass
Inventory reconciled across all merged stationsPass
34 steps executed, 34 passed, zero data lossPass

This is what Safety-II looks like in practice. The scenario isn't "prevent evacuation" — you can't. The scenario is "make evacuation succeed." Every data point survived. Every handoff was recorded. Every safety gate held where it mattered, and yielded where it needed to.

Not a Replacement. A Complement.

Safety-II doesn't argue against barriers. xGrid still has barcode scanning, identity verification, dosage checks, role-based access. These Safety-I mechanisms prevent the errors that are preventable.

Safety-II handles everything else — the cases where conditions are too degraded for normal procedures, where the protocol assumes resources you don't have, where the textbook answer doesn't fit the situation in front of you.

Safety-I

  • Core question
    What goes wrong?
  • Design strategy
    Eliminate failure modes
  • View of people
    Risk source to constrain
  • Learning trigger
    After incidents
  • View of variation
    Threat to control

Safety-II

  • Core question
    What goes right?
  • Design strategy
    Enhance adaptive capacity
  • View of people
    Resilience source to support
  • Learning trigger
    During normal operations
  • View of variation
    Necessary adaptation

In stable environments, Safety-I is usually sufficient. In disaster environments, it's necessary but not sufficient.

The wall will be breached. The question that matters is what happens next.


Related: The Walkaway Test — Designing Software That Outlives Its Creators