The 737 MAX Events

In 2018 and 2019 two Boeing 737 MAX aircraft crashed due to essentially the same failure mode (1, 2). In response, the 737 MAX fleet was grounded worldwide. As this was the world’s most significant multi-fatality accidents directly caused by an autonomous system, it’s worth reflecting on what went wrong.

The 737 MAX was the next generation of 737 aircraft, and during development Boeing changed the structure of the aircraft in a way that critically altered its flying characteristics. To compensate for this, they employed a software mitigation called “MCAS” to automatically adjust the nose angle of the plane downward in specific, high-risk flight conditions. MCAS depended on a single sensor for input and only became active when the autopilot was off and the wing flaps were up. Pilots typically put the wing flaps after takeoff. Pilots of the 737 were not trained on or even informed that the new MAX version had this autonomous assistant.

On Lion Air Flight 610, the single input sensor to MCAS failed on the ground before takeoff. After takeoff, when the crew retracted the flaps, MCAS received incorrect readings from the sensor and instructed the airplane to nosedive. The pilots did not know that MCAS existed, let alone turn it off, and the plane crashed, killing all aboard.

Lion Air Flight 610’s airplane’s previous flight experienced the same issue, though the pilots were able to stumble upon a remedy. They reported the problem but it wasn’t communicated to the pilots of Lion Air Flight 610. Had this communication proceeded more smoothly, the crash of Lion Air 610 may have been avoided.

After the crash of Lion Air Flight 610, Boeing informed all 737 pilots about the existence of MCAS and instructed pilots to disable it if the plane started nosediving.

Then, on Ethiopian Airlines Flight 302, there was a fault in the sensor (possibly caused by damaged sustained during takeoff). When the crew retracted the flaps, MCAS received incorrect readings from the sensor and instructed the airplane to nosedive. The pilots did know that MCAS existed and followed the procedure to turn it off by flipping the stabilizer trim cutout switches. Flipping these switches meant the pilots would have to use the “manual trim wheel” to recover from the dive but the aerodynamic pressure made this physically impossible to turn. So, in a way, the mitigation to the problem caused by MCAS made that problem impossible to recover from. All aboard died.

Lessons:

  • Communicate all autonomous sytems and changes to autonomous systems to human co-operators
  • Implement redundant sensor inputs
  • Make sure systems that are necessary for air travel fail loudly on the ground
  • Assume worst-case responses from human co-operators in critical situations
  • Simulations in inpaired states
  • Don’t use “bandaids” in safety-critical systems
  • Vigilance in human reporting of urgent safety problems

Prior Events

There were previous instances of autonomous systems indirectly leading to crashes. In 2009, Air France 447 crashed due to the pilot seemingly inexplicably putting the nose up and not putting it back down (so the plane climbed, slowed, stalled, then free-fell belly-first into the ocean). One explanation for why the pilot behaved in this way is that, due to iced-over airspeed sensors, the autopilot disconnected. The pilots’ inputs caused a stall, which correctly sounded the stall alarm. However, confusion about the automated systems and conflicting instrument readings led to the crew failing to recover from the stall.

In Asiana Airlines Flight 214, the pilots weren’t aware that the autothrottle system had switched to an idle mode (“HOLD”) and would not automatically maintain the plane’s speed during the landing approach. In this case, confusion about the interaction between multiple competing autonomous systems was a causal factor in the crash.

The closest prior event to the 737 MAX accidents was a 2008 Qantas Air flight that experienced two sudden dips in elevation that severely injured some passengers. The two dips were caused by one of the air data computers sending faulty data, causing the flight control system to command a nosedive. There were redundant sensors, but when anomalies occurred in a particular pattern over time the system did not properly handle them. Particularly concerning is that the anomaly-tolerance logic mostly involved averaging over past signals from multiple sensors, instead of alerting the pilots. The reason the nosedive occurred twice on the flight before it landed was that, while the pilots turned off the autopilot, they needed to disable additional systems to fully stop the erroneous commands. The CPU problem is called a “soft error” and happens all the time, sometimes due to cosmic rays disrupting electron flow in the CPU. Avionics CPUs use redundant lanes to minimize the effects of these events. Most memory now (in 2025) is “ECC” (error correcting code) which is 2-3% slower and more expensive, but does automatic error correction to mitigate “soft error” problems at the hardware level (so the user doesn’t need to write their own pointer logic to mitigate them). The hardware mitigations are not sufficient, however, as they don’t cover all cases. Some level of software design is needed to additionally protect against these cases.

Single Sensor

Why did Boeing opt for an autonomous system without sensor redundancy? Boeing assumed the pilots could mitigate any MCAS issue within 3 seconds by turning off the system.