Tech

A new tool to manage slow faults allows real-time adjustment of computing systems

Share
Share
A new tool to manage slow faults
Slow faults are difficult to detect because performance degrades rather than failing. A study injected slow faults into a system to better understand the realistic conditions that impact slow faults. The research team then developed a new way to detect and address slow faults that dynamically adjusts in real time. Credit: Ruiming Lu

While computing systems are typically equipped to handle crashes, slow faults—situations when system components’ performance degrades rather than failing outright—can cause severe disruptions for applications like cloud computing, real-time video calls, streaming services and more.

New research led by the University of Michigan offers a solution. Adaptive Detection at Runtime (ADR) allows systems to adjust in real time, effectively addressing the sensitive and variable nature of slow faults.

“ADR is an elegant first step towards making adaptive fault tolerance even more practical. I am very excited to continue pushing the boundary of fault tolerance and handling with respect to novel, under-studied fault models like slow faults,” said Ruiming Lu, a graduate student of computer science at Shanghai Jiao Tong University and lead author of the study.

First, a new testing pipeline identified how slow faults impact distributed systems in which a network of computers divide tasks among themselves to complete a large processing job. The results guided the development of an adaptive library that dynamically adjusts responses to slow faults and reduces their negative effects.

“This work aims to enhance slow-fault detection and response mechanisms, offering valuable insights for developers striving to improve system resilience and robustness,” said Ryan Huang, a U-M associate professor of computer science and engineering and corresponding author of the study.

Up to this point, slow faults have been handled with static, over-conservative timeouts that hardly trigger even the most severe slow faults, failing to understand the nuances surrounding slow faults.

To better understand slow faults, the research team injected slow faults into six widely-used distributed systems, systematically varying the many faces of slow faults, like fault type, severity and location. This approach assessed a broader spectrum of realistic conditions than previous research, providing deeper insights into how different distributed systems manage slow faults.

When analyzing the pipeline test, they found that nearly all systems have a “danger zone” where a slight increase in slow-fault severity results in a significant increase in performance degradation.

“I was surprised to find that such subtle variations in fault severity could trigger dramatic changes in system behavior, underscoring the need for more adaptive and nuanced fault tolerance strategies,” said Huang.

With these findings in hand, the research team developed Adaptive Detection at Runtime to replace the static threshold mechanisms.

ADR works by monitoring the system’s response values and how often they change. Rather than using a hard cutoff, the library considers how the slowdown matches up to historical values and flags potential slow faults as those that fall below the 99th percentile. As final fail safe, ADR cross-validates the flagged slow faults by making sure the response rate is continuously decreasing to prevent false positives.

Implementing ADR led to an average reduction of 65% in performance degradation under varying slow-fault conditions and workloads compared to baseline static thresholds. Slow faults were detected quickly at 0.9 to 1.3 seconds on average.

Although successful, ADR has several blind spots as it cannot detect slow faults during system startup and may misidentify slow faults that occur during workload transitions. The researchers also note that use of the tool does require some developer knowledge of where to check for slow faults.

Overall, ADR’s ability to dynamically adjust to changing conditions in real time could lead to more robust and efficient systems, reducing downtime and improving user experiences.

“This advancement opens up new possibilities for innovation in system design and fault tolerance, aligning with the growing demand for dependable digital infrastructure,” said Huang.

More information:
Full citation: “One-size-fits-none: Understanding and enhancing slow-fault tolerance in modern distributed systems,” Ruiming Lu, Yunchi Lu, Yuxuan Jiang, Guangtao Xue, and Peng Huang, USENIX Symposium on Networked Systems Design and Implementation (2025). www.usenix.org/conference/nsdi25/presentation/lu

Provided by
University of Michigan College of Engineering


Citation:
A new tool to manage slow faults allows real-time adjustment of computing systems (2025, May 7)
retrieved 7 May 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles
Apple eyes move to AI search, ending era defined by Google
Tech

Apple eyes move to AI search, ending era defined by Google

Credit: Unsplash/CC0 Public Domain Apple Inc. is “actively looking at” revamping the...

Netflix subscribers say its ‘new design sucks’ but I hope it keeps the new vertical discovery feed
Tech

Netflix subscribers say its ‘new design sucks’ but I hope it keeps the new vertical discovery feed

Netflix recently unveiled big updates to the design of its platform, and...

Women’s sports are fighting an uphill battle against social media algorithms
Tech

Women’s sports are fighting an uphill battle against social media algorithms

Credit: Unsplash/CC0 Public Domain Women’s sport is more and more getting the...

Study reveals how writers compete with AI
Tech

Study reveals how writers compete with AI

Credit: CC0 Public Domain Writers are not passive victims of AI disruption...