Traditionally, public-safety agencies have relied on private networks for communications during emergencies. Industry standards have greatly facilitated the development and deployment of networks that meet the stringent availability, reliability and security requirements expected of emergency communication systems.

For example, Project 25, a standard supported by the Association of Public-Safety Communications Officials, provides a reliable and secure network for public-safety applications. Networks based on the Terrestrial Trunk Radio (TETRA) standard have been deployed by public-safety agencies in several European countries to provide reliable and secure communications to first responders and other emergency workers.

However, public-safety officials recently have expressed increasing concern about the limitations of such aging technologies to meet the growing demands of emergency communication services. After Sept. 11, 2001, a great sense of urgency developed for different public-safety agencies to collaborate and coordinate their activities in emergencies.

These agencies often use disparate networks with proprietary systems and devices that do not interoperate, thus making collaboration difficult, if not impossible.

Furthermore, most of these systems are designed primarily for voice communications and lack other capabilities such as high-speed data communications that can be used to send text or images. Compounding the issue is the concern that there is not enough vendor support and innovation and, thus, such technologies are becoming capacity-limited and expensive to deploy.

These concerns are causing public-safety agencies to look at alternative technologies, specifically those being deployed in commercial wireless communication systems. Based on widely adopted standards, commercial technologies such as CDMA and GSM offer several advantages.

In addition to addressing the interoperability issues, such technologies offer high-speed data capabilities — 1X EV-DO in CDMA networks and GPRS and EDGE in GSM networks — that are being used for a range of enhanced services such as text messaging and push-to-talk, as well as the transmission of images and video.

Combined with location-identification technologies, such services offer a compelling value proposition to a wide variety of applications, including public safety. Internet protocol-based wireless networks that use commercial off-the-shelf (COTS) components, such as 802.11 technologies, are gaining significant traction in commercial local and metro networks and are attracting attention from public-safety agencies.

While there may be a role for commercial technologies in critical applications such as public safety, some issues must first be addressed. For instance, communication systems based on commercial technologies must meet strict service availability requirements of mission-critical applications before they are used in any significant deployment. Most mission-critical applications require system uptime of 99.999% (downtime of less than five minutes per year) or better (Figure 1). Public safety, for example, falls into a subcategory of applications that dictate even more stringent requirements for on-demand, continuous operation.

To ensure continuous operation and uninterrupted service, system designers must identify all potential single points of failure in the service path and implement ways to eliminate them.

A comprehensive approach to designing such highly available systems encompasses the following elements.

  • Establish key availability and performance requirements up front

  • Define a system architecture that supports key build-vs.-buy decisions and leverages COTS elements

  • Place special focus on key high-availability attributes.

For example, to support continuous service in a wireless network, recovery from any failure must be executed in less than 50 milliseconds. Restoring a system to an operational state following a failure is a multistage process that involves fault detection, diagnosis, isolation, recovery, repair and reconfiguration. Consequently, in a wireless network each of these stages must be allocated appropriate performance budgets so that failure recovery occurs in less than 50 milliseconds.

Similarly, to support an uninterrupted wireless communication service, the messaging system must support anywhere from 10,000 to 50,000 real-time messages per second depending on the message size. Such performance requirements often demand that an efficient in-memory data-store is used in the system for storage and retrieval of real-time information.

In the 1990s the computer industry made an important transition in that it went from vertically integrated systems — primarily offered by individual system vendors — to modular systems that enabled end users to purchase COTS components (hardware, operating system and middleware) from different vendors to put together an operational system with relatively small effort. The telecom industry is just beginning a similar transition, aided by the development of specifications intended to facilitate the portability of middleware and applications across multiple platforms.

The proliferation of such standards provides designers the flexibility to build systems by combining a set of interoperable COTS building blocks such as hardware platform, operating system and middleware from a variety of competing vendors (Figure 2). This enables the equipment vendors to minimize the cost and effort involved in building a system, while allowing them to focus their precious resources on their core competencies — communication applications.

Despite this flexibility, the building of systems intended to deliver uninterrupted service availability for critical applications — such as public safety — is a daunting task that involves complex hardware and software, and critical implementation decisions along the way. Experience shows that such systems evolve over multiple releases to a point at which they are able to meet the most stringent service availability requirements.

Some of the most critical attributes that characterize such highly available systems include the following: the ability to thoroughly model the system and its resources, comprehensive high-availability services and an efficient messaging engine.

These attributes combined with effective system management can yield a system that can ensure uninterrupted service availability.

System modeling provides a mechanism to represent physical and logical resources that comprise the overall system. It also defines the relationships and dependencies among such resources in a hierarchical network of managed objects. These objects have states and attributes that represent the actual and desired state of the corresponding physical resource, such as hardware elements, software modules and nodes in a cluster.

The system model has the capability to group sets of like objects into service groups and then define respective recovery policies for each group that are executed in the event of a failure. Any change in the state of the objects that can affect service availability — such as a failed application, a hardware error or planned downtime — causes an update to the system model that in turn triggers the appropriate recovery action to maintain the overall service availability.

High-availability services are used to manage resource failures without interrupting service. Such services are responsible for providing seamless switchover among redundant components, shielding the end user from any faults or resulting failures.

High-availability services must support a large number of nodes in a cluster, with the ability to collect, preserve and distribute application state information for stateful, seamless failover between the nodes in the cluster. Key functions provided by high-availability services include:

  • Creating and maintaining availability management information

  • Providing the availability engine that applies policies to proactively manage the system for high availability, often 99.999% or better

  • Management of various parts of the system, such as nodes in a cluster and redundant components

  • Providing checkpointing services to applications and other components.

Finally, a messaging engine is designed to address the need for communication between system elements. Such an engine provides an efficient mechanism for communicating a wide variety of information, such as application state information, event-and-error notification, and fault management information. A messaging service also provides an effective way for distributed components to efficiently communicate and coordinate their activities. Instead of requiring each resource to manage its various communication complexities, the messaging service does it for them.

A messaging service must be flexible, scalable and reliable. Flexibility requires that the messaging service is independent of the applications to which it provides services, so that the communication responsibilities can be offloaded from the applications. Scalability ensures that the messaging service is designed to support change and growth in the system components, and to support increased messaging activities. Applications do not need to be modified as the system and message volume changes or grows. Lastly, the messaging service must be reliable to ensure message delivery, even in the event of failures in the primary network connection.

Dr. Asif Naseem is currently senior vice president and CTO for GoAhead Software. He has more than 18 years of experience in the computer and communications industries. Previously he served in senior-level positions at Motorola, where he established and managed a mobile applications business, and most recently at Iospan Wireless, a broadband wireless company that was acquired by Intel and L3. He started his career with AT&T Bell Laboratories where he held a variety of technical and management positions, and has an M.S. in electrical engineering and a Ph.D. in computer engineering from Michigan State University.

Defining high availability

Figure 1:
Number of 9s Downtime/Year Typical Application
99.9% ~9 hrs Typical Desktop or Server
99.99% ~1 hr Enterprise Server
99.999% ~5 mins Carrier-Class Server
99.9999% ~31 secs Carrier Switch Equipment

Anatomy of a Commercial Off-the-Shelf System

Figure 2:
Application Interface (e.g., SAF AIS)
COTS High-Availability and Management Middleware Other Middleware
Platform Interface (e.g., HPI)
Operating Systems — (e.g., CGL)
Platform Hardware — (e.g., cPCI, ATCA)