Mastering the Art of Anticipating Uncommon Failures

“`html

On December 21, 2022, just as the peak festive travel period commenced, Southwest Airlines experienced a chain reaction of failures in their scheduling, initially instigated by intense winter weather in the Denver region. However, these challenges proliferated throughout their network, ultimately leaving over 2 million travelers stranded and resulting in losses totaling $750 million for the airline over the following 10 days.

How did a localized weather event trigger such a far-reaching failure? Researchers at MIT have investigated this highly publicized incident as a case study illustrating scenarios where normally functioning systems suddenly collapse and induce a domino effect of failures. They have crafted a computational framework for utilizing a combination of limited data regarding rare failure incidents, alongside more extensive data on standard operations, to trace back and identify the underlying causes of the failure, with the hope of discovering methods to modify the systems to avert such failures in the future.

The results were shared at the International Conference on Learning Representations (ICLR), which took place in Singapore from April 24-28, presented by MIT doctoral candidate Charles Dawson, professor of aeronautics and astronautics Chuchu Fan, and colleagues from Harvard University and the University of Michigan.

“The inspiration behind this research is the frustration encountered when interacting with these complex systems, where it can be quite challenging to comprehend what is transpiring behind the scenes that leads to the issues or failures we observe,” remarks Dawson.

This new research builds on earlier investigations from Fan’s lab, where they explored hypothetical failure prediction scenarios, including groups of robots collaborating on tasks, or intricate systems like the power grid, aiming to find ways to foresee how these systems might fail. “The objective of this project,” Fan explains, “was to transform that into a diagnostic instrument that we could deploy on real-world systems.”

The concept was to enable someone to “provide us data from a time when this real-world system encountered an issue or a failure,” Dawson explains, “and we could attempt to identify the root causes, offering a glimpse behind the complexity.”

The purpose is for the methods they developed “to apply to a broad category of cyber-physical challenges,” he states. These challenges occur when “an automated decision-making component interacts with the unpredictability of the real world,” he clarifies. There are existing tools for testing software systems that function independently, yet complexity emerges when that software must engage with physical elements performing tasks in a tangible environment, encompassing the scheduling of aircraft, the movements of autonomous vehicles, the collaborations of robotics teams, or the management of inputs and outputs on an electric grid. In such circumstances, he notes, “the software might make a choice that appears satisfactory initially, but then it leads to a cascade of knock-on effects that create more chaos and uncertainty.”

A significant distinction, however, is that within systems like robotic teams, unlike airplane scheduling, “we have access to a model in the robotics domain,” states Fan, who is a principal investigator in MIT’s Laboratory for Information and Decision Systems (LIDS). “We possess a good understanding of the physics involved in robotics, and we can craft a model” that accurately reflects their operations. However, airline scheduling incorporates processes and systems that represent proprietary business intelligence, prompting researchers to seek ways to deduce what lay behind the decisions using only the relatively sparse publicly accessible data, which essentially comprised just the actual arrival and departure times of each aircraft.

“We gathered all this flight information, but there exists an entire system of scheduling behind it, and we lack insight into how the system operates,” Fan explains. Furthermore, the available data related to the actual failure represents merely several days’ worth, contrasted with years of data concerning typical flight operations.

The effects of the weather phenomena in Denver during the week of Southwest’s scheduling debacle were evident in the flight data, specifically seen through the longer-than-usual turnaround times between landing and takeoff at the Denver airport. However, understanding how that impact cascaded through the network was less clear and necessitated deeper analysis. The crucial element turned out to revolve around the concept of reserve aircraft.

Airlines typically retain a number of aircraft in reserve at various locations to ensure that if complications arise with a scheduled flight’s aircraft, a substitute can be swiftly assigned. Southwest solely utilizes one type of aircraft, making such substitutions more straightforward. However, most airlines operate under a hub-and-spoke model, concentrating reserve aircraft at designated hub airports, while Southwest lacks such hubs, resulting in their reserve planes being more dispersed throughout their network. The manner in which those planes were utilized emerged as a significant factor in the developing crisis.

“The challenge lies in the absence of public data regarding the locations of the aircraft across the Southwest network,” Dawson notes. “Our method allows us to examine the public data on arrivals, departures, and delays to infer what the hidden parameters of those aircraft reserves could be, to clarify the observations we were witnessing.”

What they discovered was that the allocation of reserves served as a “leading indicator” of the problems culminating in a national crisis. Some areas of the network directly impacted by the weather managed to recover swiftly and resume their schedules. “However, upon examining other regions in the network, we observed that these reserves simply weren’t available, leading to worsening conditions.”

For instance, the data indicated that the reserves in Denver were quickly diminishing due to weather-related delays, but it also enabled them to trace this failure from Denver to Las Vegas,” he points out. Although there was no severe weather in Las Vegas, “our method continued to show a steady decline in the number of aircraft available for flights departing from Las Vegas.”

He mentions that “our findings revealed circulations of aircraft within Southwest’s network, where a plane may start its day in California, fly to Denver, and conclude its day in Las Vegas.” In the situation involving this storm, the cycle was disrupted. Consequently, “this single storm in Denver broke the cycle, and abruptly the reserves in Las Vegas, which was unaffected by the weather, began to decline.”

Ultimately, Southwest was compelled to undertake a drastic action to remedy the issue: they had to perform a “hard reset” of their entire system, canceling all flights and relocating empty aircraft across the country to reestablish their reserves.

Collaborating with specialists in air transportation systems, the researchers constructed a model outlining how the scheduling system is intended to function. Subsequently, “our method operates by essentially trying to reverse-engineer the model.” By scrutinizing the observed outcomes, the model facilitates tracing back to ascertain what types of initial conditions could have led to those results.

While the information on the actual failures was sparse, the abundant data on regular operations assisted in instructing the computational model on “what is feasible, what is possible, and what falls within the realm of physical capability,” Dawson articulates. “This provides us with the domain knowledge to then state, in this extreme event, given the realm of what’s practicable, what’s the most probable reason” for the failure.

This could pave the way for a real-time monitoring system, he proposes, where data on usual operations are perpetually contrasted with current data to assess emerging trends. “Are we trending back to normal, or are we steering toward extreme events?” Detecting signs of impending issues could facilitate preemptive actions, such as proactively reallocating reserve aircraft to areas where problems are anticipated.

Efforts to develop such systems are continuing in her lab, Fan confirms. In the meantime, they have developed an open-source tool for analyzing failure systems, referred to as CalNF, which is accessible for public use. Concurrently, Dawson, who completed his doctorate last year, is serving as a postdoc to apply the methods developed in this project to uncover failures in power networks.

The research team also comprised Max Li from the University of Michigan and Van Tran from Harvard University. The initiative received support from NASA, the Air Force Office of Scientific Research, and the MIT-DSTA program.

“`

Leave a Reply Cancel reply