Conventional reliability improvement methods might not be efficient solutions for Edge systems where limited hardware and processing resources are available. To address this gap, this paper proposes the application of performance monitoring metrics of the central processing unit used in Edge devices for reliability improvement purposes. We have utilized the performance monitoring toolset, PERF, along with the LLFI fault injection tool to inject a variety of fault models into a prototypical Edge processor while running Mibench benchmark programs. The injected faults are used to collect a dataset showcasing the behavior of the system under various reliability conditions. The collected dataset is then used to train machine learning models that can help with runtime monitoring and detection of possible fault situations on the Edge system. Our experiments show that trained models can achieve a high fault detection accuracy of 91.5%. Implementations of the tiny machine-learning models showed that we can keep accuracy above 90% while model summarization methods helped save more than 80% of the model parameters. © 2025 IEEE.
Funding: National Science Foundation (Grant Number: 2302537)