Reliability and Risk
The Challenge of Managing Interconnected Infrastructures
Emery Roe and Paul R. Schulman



Not long ago a California statewide emergency manager, with wide public- and private-sector experience in infrastructure operations, made a remarkable statement to us: “These infrastructures are more interconnected than we can imagine.” We have discovered the profundity and at the same time the practicality of his insight in writing this book.

Our argument here is that modern infrastructures, which are among our greatest sociotechnological achievements, providing vital services that underpin much of our conception of social modernity, are precarious in newer and more troubling ways than experienced in the past or explained by current theories, analytic frameworks, and advanced computer modeling.

We as a society have evolved a complexity in these systems—in particular their complex interconnections—that makes them difficult to understand. Even operators, risk analysts, regulators, and policy makers, both inside and outside infrastructure organizations, can themselves scarcely anticipate all system-to-system vulnerabilities and the likelihood of reciprocal failure. When such failure happens, we often discover causes only after the fact, and even then much debate attends what caused what.

This is the challenge addressed in this book: We, as citizens and consumers, have become extraordinarily dependent on large sociotechnical systems whose complexity experts, politicians and policy makers, and the public do not adequately comprehend. Yet system performance and failure affect life and death more than ever before in modern society. In tackling this challenge, our book addresses the question, How can policy makers, specialists, and the informed public better understand the nature of both reliability and risk in interconnected systems whose interactions may be very difficult to foresee?

At the same time this book seeks to advance our understanding of organizational reliability in the management of complex sociotechnical systems. For quite some time an important part of the research into this subject has been polarized around two competing perspectives. One, reflected in what has been termed “normal accident” research, argues that, given the technical characteristics of large, complex, and tightly interconnected physical systems, their failure is inevitable, with potentially catastrophic social effects. Their technical properties make organizational design and managerial strategy ineffective in preventing these failures. These complex technical systems are accidents waiting to happen, and they will happen given enough time for the systems to fully express themselves.

The competing perspective has been reflected in high reliability organization (HRO) research. This research, based on case studies in selected organizations—originally a nuclear aircraft carrier, a nuclear power plant, and an air traffic control center—argues that special organizational features and demanding managerial practices could forestall Murphy’s Law and prevent catastrophic events from happening.

This debate between these two views has been going on for over two decades and seems to have a life of its own. A number of attempts have been made to reconcile the arguments or provide alternatives to them (e.g., Rijpma 1997 or, more recently, Leveson et al. 2009; Shrivastava, Sonpar, and Pazzaglia 2009a; and Amalberti 2013). This book offers our own, third perspective. It is founded on the argument that the two main approaches have rested on an insufficient understanding of both reliability and risk. Both prior perspectives were focused on only two divergent conditions, or states, for technical systems: normal operations or major system failure and its disastrous effects. Reliability meant maintaining the first state while preventing the second. Many formal risk assessment methodologies assume primarily these two states.

But we demonstrate that interconnected infrastructures can assume system states in between and beyond normal operations and failure. We identify and provide examples of the states of normal operations, disruption, restoration, failure, recovery, and the establishment of a new normal. “Reliability” has different meanings not just for normal and failed operations but also for system disruption and its restoration, recovery, and the establishment of a new normal. This is so because each system state presents its own distinctive forms of risk. In our third perspective, ensuring reliability and managing risks are themselves interconnected because they vary across different system states. To insist that reliability and risk are in tension because the acceptance of risk too often degrades and rarely enhances reliability no longer captures the full picture. As we show, interconnected systems are too complex with respect to reliability and risk to understand them through that polarity.

Our descriptions and analysis, based on our years of firsthand observations, interviews, documents, and case studies across multiple infrastructures (see the appendix), establish the impact of interconnectivity and how the character of interconnectivity itself changes across different states. We highlight the role of infrastructure operators in managing this interconnectivity and promoting different forms of reliability across multiple system conditions. We also explain how current definitions of “the infrastructure crisis” go wrong and how policies and regulatory approaches following from this misperception intensify infrastructure vulnerability and undermine different forms of reliability. We conclude with an assessment of the future of infrastructure reliability, given increasing interconnectivity and the managerial, policy, and regulatory challenges that high reliability poses. Accordingly, our analysis starts as largely descriptive and explanatory with respect to present conditions and shifts later to the wider implications and suggestions for the future.