Predicting Rare Kinds of Failures

Introduction to System Failures

On Dec. 21, 2022, just as peak holiday season travel was getting underway, Southwest Airlines went through a cascading series of failures in their scheduling, initially triggered by severe winter weather in the Denver area. But the problems spread through their network, and over the course of the next 10 days the crisis ended up stranding over 2 million passengers and causing losses of $750 million for the airline.

Understanding the Failure

How did a localized weather system end up triggering such a widespread failure? Researchers at MIT have examined this widely reported failure as an example of cases where systems that work smoothly most of the time suddenly break down and cause a domino effect of failures. They have now developed a computational system for using the combination of sparse data about a rare failure event, in combination with much more extensive data on normal operations, to work backwards and try to pinpoint the root causes of the failure, and hopefully be able to find ways to adjust the systems to prevent such failures in the future.

The Research Findings

The findings were presented at the International Conference on Learning Representations (ICLR), which was held in Singapore from April 24-28 by MIT doctoral student Charles Dawson, professor of aeronautics and astronautics Chuchu Fan, and colleagues from Harvard University and the University of Michigan. “The motivation behind this work is that it’s really frustrating when we have to interact with these complicated systems, where it’s really hard to understand what’s going on behind the scenes that’s creating these issues or failures that we’re observing,” says Dawson.

The Goal of the Project

The new work builds on previous research from Fan’s lab, where they looked at problems involving hypothetical failure prediction problems, she says, such as with groups of robots working together on a task, or complex systems such as the power grid, looking for ways to predict how such systems may fail. “The goal of this project,” Fan says, “was really to turn that into a diagnostic tool that we could use on real-world systems.” The idea was to provide a way that someone could “give us data from a time when this real-world system had an issue or a failure,” Dawson says, “and we can try to diagnose the root causes, and provide a little bit of a look behind the curtain at this complexity.”

Cyber-Physical Problems

The intent is for the methods they developed “to work for a pretty general class of cyber-physical problems,” he says. These are problems in which “you have an automated decision-making component interacting with the messiness of the real world,” he explains. There are available tools for testing software systems that operate on their own, but the complexity arises when that software has to interact with physical entities going about their activities in a real physical setting, whether it be the scheduling of aircraft, the movements of autonomous vehicles, the interactions of a team of robots, or the control of the inputs and outputs on an electric grid.

Analyzing the Failure

One key difference, though, is that in systems like teams of robots, unlike the scheduling of airplanes, “we have access to a model in the robotics world,” says Fan, who is a principal investigator in MIT’s Laboratory for Information and Decision Systems (LIDS). “We do have some good understanding of the physics behind the robotics, and we do have ways of creating a model” that represents their activities with reasonable accuracy. But airline scheduling involves processes and systems that are proprietary business information, and so the researchers had to find ways to infer what was behind the decisions, using only the relatively sparse publicly available information, which essentially consisted of just the actual arrival and departure times of each plane.

The Role of Reserve Aircraft

The impact of the weather events in Denver during the week of Southwest’s scheduling crisis clearly showed up in the flight data, just from the longer-than-normal turnaround times between landing and takeoff at the Denver airport. But the way that impact cascaded though the system was less obvious, and required more analysis. The key turned out to have to do with the concept of reserve aircraft. Airlines typically keep some planes in reserve at various airports, so that if problems are found with one plane that is scheduled for a flight, another plane can be quickly substituted.

Conclusion

The research team has developed a method to analyze failure systems, which could lead to a real-time monitoring system, where data on normal operations are constantly compared to the current data, and determining what the trend looks like. This could allow for preemptive measures, such as redeploying reserve aircraft in advance to areas of anticipated problems. Work on developing such systems is ongoing in her lab, Fan says. In the meantime, they have produced an open-source tool for analyzing failure systems, called CalNF, which is available for anyone to use.

FAQs

Q: What triggered the Southwest Airlines scheduling crisis?
A: The crisis was initially triggered by severe winter weather in the Denver area.
Q: How many passengers were affected by the crisis?
A: Over 2 million passengers were stranded due to the crisis.
Q: What is the goal of the project developed by the MIT researchers?
A: The goal is to develop a diagnostic tool that can be used to diagnose the root causes of failures in complex systems.
Q: What type of problems do the methods developed by the researchers aim to solve?
A: The methods aim to solve cyber-physical problems, which involve automated decision-making components interacting with the physical world.
Q: What is the name of the open-source tool developed by the researchers?
A: The tool is called CalNF.