Without bending the analogy past the breaking point, compare the CS/Archegos failure to, say, how the Three Mile Island nuclear plant could fail or the Deepwater Horizon disaster occur:
- Paul Weiss’ finding that there was no fraud or illegality would be equivalent to a finding that there was no sabotage or intentional misconduct by the plant operators.
- Similarly, the finding that “the architecture of risk controls and processes” at CS was fundamentally sound means the design of the power plant or the oil well was intrinsically sound and the embedded safeguards well designed and capable of functioning as expected.
But the Journal’s litany of relaxed, waived, neglected, and suspended safeguards–by a series of individuals doubtlessly well-intentioned in each moment at every seemingly innocent and inconsequential decision-point–nails our hypothesis that CS is a complex and not “merely” a complicated system.
From the engineering profession, what can we learn about complex systems and their failure modes?
My source material for what immediately follows is Richard Cook, MD’s How Complex Systems Fail, published in 2000 by he Cognitive technologies Laboratory at the University of Chicago. I hope Dr. Cook forgives my extensive excerpt, but in consultation with a friend who is a lifelong engineer now at the top of his field, and a true pro, Cook’s is head and shoulders the most comprehensive and yet succinct treatment of complex systems out there, albeit at first blush, counterintuitive (see our point above about this being an exotic and unfamiliar mode of thought for us humans).
I have edited Cook’s piece extensively but have not altered any of his words. As publisher of this column, I also have taken the liberty of highlighting issues that I think are especially germane to the lawyer audience (thus: ***)
- Complex systems are intrinsically hazardous systems, ***
All of the interesting systems (e.g. transportation, healthcare, power generation) are inherently and unavoidably hazardous by the own nature. (Lawyers think “hazardous” means flawed and defective; it does not.)
- Complex systems are heavily and successfully defended against failure
The high consequences of failure lead over time to the construction of multiple layers of defense against failure. The effect of these measures is to provide a series of shields that normally divert operations away from accidents.
- Catastrophe requires multiple failures – single point failures are not enough ***
The array of defenses works. System operations are generally successful. Put another way, there are many more failure opportunities than overt system accidents. (Fingering individual “points of failure” for blame profoundly misapprehends how complex systems function; there is almost an infinite number of possible failure points but the system safeguards itself from any one causing disaster.)
- Complex systems contain changing mixtures of failures latent within them. (See above)
The complexity of these systems makes it impossible for them to run without multiple flaws being present. Eradication of all latent failures is limited primarily by economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident.
- Post-accident attribution to a ‘root cause’ is fundamentally wrong. ***
Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident; [identifying] the ‘root cause’ of an accident is impossible. The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes.
- Hindsight biases post-accident assessments of human performance.***
Knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case. This means that ex post facto accident analysis of human performance is inaccurate. The outcome knowledge poisons the ability of after-accident observers to recreate the view of practitioners before the accident of those same factors. It seems that practitioners “should have known” that the factors would “inevitably” lead to an accident.
- Human operators have dual roles: as producers & as defenders against failure.
The system practitioners operate the system in order to produce its desired product and also work to forestall accidents. Outsiders rarely acknowledge the duality of this role. In non-accident filled times, the production role is emphasized. After accidents, the defense against failure role is emphasized. At either time, the outsider’s view misapprehends the operator’s constant, simultaneous engagement with both roles. (Note CS’s desire to stay on Archegos’ good side and to accommodate its wishes–“[revenue] production mode.”)
- All practitioner actions are gambles.***
After accidents, the overt failure often appears to have been inevitable and the practitioner’s actions as blunders. That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.
- Change introduces new forms of failure.
The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology. Because new, high consequence accidents occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.
- Views of ‘cause’ limit the effectiveness of defenses against future events.***
Post-accident remedies for “human error” are usually predicated on obstructing activities that can “cause” accidents. These end-of-the-chain measures do little to reduce the likelihood of further accidents. In fact that likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly. Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult.
- Failure free operations require experience with failure.***
Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the “edge of the envelope”. It also depends on providing calibration about how their actions move system performance towards or away from the edge of the envelope.(Guardrails can help avoid catastrophe, but training drivers to deal with “edge” conditions is more universally powerful and preserves the system intact.)
Here endeth our reading.
Let’s bring it back to Paul Weiss’s autopsy of the CS/Archegos debacle. Understanding the complexity of CS as a “system,” it appears more justifiable than ever that they found no illegality, no intentional misconduct, and a host of decisions and non-decisions that collectively produced -($5.5-billion).
I submit that Paul Weiss’s analysis, premised on categorizing CS as “complex” and not merely “complicated,” gave permission to the lawyers working on the autopsy to reject our oh-so-familiar assumptions about cause, responsibility, notice, and so forth. Applying those comfortable and deeply ingrained thought patterns would lead directly to a profound misapprehension of the essence of the CS/Archegos debacle.
It’s safe to wager that in our 21st Century world, the preponderance of complex systems across our society and economy is only going to grow. Be prepared to see them for what they are.
Excellent report, Bruce. In addition to the two valuable sources (HBR on p 2 and Cook’s report on p 3) you linked, there is a classic, accessible, book-length work, Charles Perrow’s Normal Accidents: Living with High-Risk Technologies (Princeton University Press, 1999) that some ASE readers may find interesting and useful. It turns out that there are solid grounds in the science of cognition why even “the smartest guys in the room” cannot intuit how nonlinear systems with feedback and close coupling will work, nor readily assign blame when consequences move from risk to an actual event.
If – more likely, when – we act in a fiduciary role, do we need to inform ourselves how complex systems may be involved and undertake to include our understanding and insights in our advice? Thanks to ASE for raising the issues and offering some useful background and guidance.