Reflections on recent LTE outages

Ken Rehbehn, CritComm Insights

September 20, 2022

7 Min Read
Reflections on recent LTE outages

Rogers Communications Canada experienced a national service outage on July 8, 2022, that halted LTE communications for millions of Canadians, including public-safety agencies counting on mobile broadband for operations. In light of this jarring failure, should public-safety leaders rethink plans to collapse all wireless communications onto a single wireless technology platform operated by a service provider?

The root cause of the Rogers outage was relatively mundane. It was not a hack. It was not a disaster. The failure came from a routine, well-planned, early-morning configuration change to core internet router reachability. The faulty configuration change flooded Rogers’ network-routing equipment with invalid route information, halting all incoming and outgoing traffic until the national routing fabric could be restarted—a laborious process that took many hours. Restoral efforts continued into the night.

While Rogers’ national outage was significant in scope and impact, it was not a unique event. Modern cellular networks operate with a core IP network serving as the network’s central nervous system. Equipment failure, configuration errors, and security attacks can affect the core network subsystems that govern LTE network operation. Subsystems implementing the Border Gateway Protocol (BGP), Directory Name Service (DNS), IP Multimedia Subsystems (IMS), and Home Subscriber Server (HSS) each can trigger national outages.

Recent history shows how prevalent these failures are. Just days before Roger’s outage, the KDDI LTE network in Japan experienced a national two-day shutdown caused by congestion in its IP Multimedia Subsystem hosting Voice over LTE. In Europe, an attempted June 2021 upgrade to Orange’s network in France disrupted voice services across the country. Christmas 2020 brought about a mass failure of AT&T cellular operations, including FirstNet, across the Southeast U.S. following the Christmas-day explosion outside a significant MPLS routing hub in Nashville. And in June 2020, T-Mobile’s Voice over LTE network experienced an outage across the southeast US following an IP-routing misconfiguration. The list can go on.

LTE’s foundational role in public-safety operations is well established in many countries today. Mobile broadband communications based on LTE serve as the essential enabler of vital tools, ranging from e-mail to incident dispatch. The future will expand this role. With all mobile network infrastructure vendors offering support for 3GPP Mission Critical LTE specifications, some nations are planning to retire legacy Land Mobile Radio (LMR) systems that are based on analog radio or digital trunked radio transmission technology.

The argument for a transition from narrowband to broadband is logical. As the term implies, narrowband radio technology lacks channel capacity for extensive data flows beyond basic messaging. In contrast, LTE was architected to provide an IP-based super-set of functionality that can accommodate the needs of outdated technology silos. For cost-conscious governments, the argument is enticing. Rather than funding two network technologies – LMR for push-to-talk tactical communications and LTE for data-centric communications – they can fund a single converged mission-critical LTE network that handles all requirements.

Unfortunately, the recent massive network outages raise several issues that local and national public safety authorities must consider before taking the final step of powering down the LMR networks. Chief amongst those is the ability of a dedicated public safety core network to operate following widespread IP or transport (e.g., MPLS, optical) failure that impacts the radio access network. Resilience tools such as MOCN architectures, roaming, multi-SIM user devices, and satellite communications can be essential elements of an authority’s strategy. Likewise, the continued operation of a parallel LMR network provides significant redundancy. Specific points to consider include:

  • Mission-critical LTE does not mean bullet-proof LTE. Modern LTE networks support policy-based priority handling of voice and data traffic. Likewise, these networks have special-purpose signaling that helps the network meet transmission time requirements for push-to-talk communications. But these mechanisms protect users and traffic from contention with non-public safety users. Mission-critical LTE and mission-critical push-to-talk cannot function when the core network fails.

  • Commercial core network scale may be a factor. The mass failures that catch the public’s eye are in large-scale networks serving millions of users. That scale may be a contributing factor in the extended time that it takes to correct a fault. Restarting a complex national network requires a careful sequence of operations that takes time. Dedicated public-safety core networks are, by nature, smaller in scale. Faults in the public-safety core network may be easier to find and fix than faults in a large-scale mobile operator network.

  • Dedicated public-safety core networks are useless if there is no functional LTE radio access network. While a dedicated public safety core network may be less prone to failure thanks to its reduced scale, the radio access traffic still needs to find its way to the dedicated core. National public-safety communications authorities must ensure that their selected mobile-network operator can forward radio access network traffic to the dedicated public-safety core, even if the operator’s main commercial core network is not responding. Because some of the radio access network functionality is based on dips into the operator’s main core network for an initial subscriber and policy lookup, that assurance may be an impossible dream.

  • LMR networks can fail, but they degrade in a layered approach. A point that sometimes gets glossed over when austerity-minded politicians think about converging all emergency communications on a single technology is that LMR technology is designed to fail. In contrast, LTE is based on the assumption that the core network is always available. Worse, if an LTE handset loses connectivity with the network, the handset cannot use LTE to communicate directly with other devices. While the 3GPP Proximity Services (ProSe) feature was intended to fill this gap, the ProSe capabilities fail to meet the requirements of firefighting and law enforcement teams operating deep in buildings. LMR failure is different. In the case of digital trunked systems, the individual radio tower sites will continue to function even if the LMR core network is lost. And end users can communicate directly if the radio tower site is not accessible. The layered degradation provides a time-proven approach to resilience that LTE cannot match.

  • LTE push-to-talk can work in tandem with LMR. As companies like ESChat and Catalyst Communications have proven, LTE-based support for push-to-talk communications works well in tandem with legacy Land Mobile Radio technology. 3GPP mission-critical push-to-talk (MCPTT) interworking gateways for TETRA and Project25 are now bringing this mix-and-match capability for 3GPP-based approaches. This versatility means that public safety officials can use MCPTT as a failback for failed LMR and LMR as a failback for failed LTE. Expense?

  • Cost savings from LMR retirement may be overstated. Much of the austerity argument driving the push to retire legacy LMR technology comes from an assumption that subscriber handset costs will go down thanks to LTE economy-of-scale benefits. That handset argument is failing as the need for an effective ProSe alternative forces a move to dual-technology devices supporting both LTE and a DMR/TETRA/Project 25/Analog direct mode radio. Of course, the austerity argument includes eliminating LMR network infrastructure and maintenance costs. But the cost of operating modern digital trunked networks may be more than offset by the value of having a simple, reliable alternative communications path.

  • A hidden cost of LMR retirement. A significant public-safety in-building communications footprint is based today on LMR, not LTE. In the U.S., fire-protection regulations mandate support for public-safety communications in new large buildings. Practices vary, but most developed nations worldwide also ensure that critical facilities get built with sufficient radio support for public-safety communications. Coverage is guaranteed for portions of the building that commercial cell operators do not cover. Builders must typically fund this coverage and install in-building systems that support LMR. A shift to LTE requires a substantial new investment for existing properties . Who pays for that upgrade?

Perhaps the cadence of mass LTE network outages will slow down or stop. Emerging AI-assisted network management tools, automation, and operational experience may lead to an elimination of mass outages. But the complexity of inter-connected IP networks dependent on fragile protocols such as BGP, DNS, and IMS suggests that mass outage risk remains a lingering factor for years to come.

Placing all communications requirements in a single technology basket is tempting. But public-safety communications planners must provide a robust PACE communications methodology for resilience that incorporates each PACE element: primary, alternate, contingency, and emergency communications functionality.

About the Author

Subscribe to receive Urgent Communications Newsletters
Catch up on the latest tech, media, and telecoms news from across the critical communications community