Researchers at the Università degli Studi di Napoli Federico II and at Naples company SESM SCARL have developed a software tool that works at the operating system (OS) level and can detect when a computer program “hangs” and so allow a safe exit from any given system without crashing the computer as a whole and requiring a reboot of important systems. Writing in the International Journal of Critical Computer-Based Systems, SESM’s Gabriella Carrozza explain their detection framework. The framework allows the non-intrusive monitoring of complex systems, based on multiple sources of data gathered at the OS level and the data collected data are then combined to reveal hang failures automatically.
Faults in software represent a major threat to the smooth running of sophisticated computer systems, according to Carrozza and colleagues. Testing and static code analysis are used widely to help detect and remove “bugs” in a system during development. However, once a software system is in place and being used in a real-world application, any number of problems can still occur, perhaps revealing bugs that were missed or simply triggered by memory overloads and timing errors. Such problems can cause just one critical component of the system to “hang” without crashing the whole system and without it being immediately obvious to operators or users of the system that there is a problem until it is too late.
Current software tools simply poll the health status of system components, or analyse system log files to uncover error messages and to correlate these with problematic memory or CPU component activity. However, they cannot spot “hangs” at the time they occur because the system might otherwise respond normally, but for the hanging failure.
The new approach taken by the Italian team relies on several simple monitors which exploit the OS support to trigger alarms when the behaviour of the system differs from the nominal one. “Our experimental results show that this framework increases the overall capacity of detecting hang failures, it exhibits a 100% coverage of observed failures, while keeping low the number of false positives, less than 6% in the worst case,” the team says. Response time, or latency, between a hang occurring and it being detected is about 0.1 seconds on average, while the impact on computer performance of running the hang-detection software is, they add, negligible.