Ensuring Federated Learning Reliability for Infrastructure-Enhanced Autonomous Driving

Acar, Benjamin; Sterling, Marius · 2023 · Crossref

DOI: 10.26599/jicv.2023.9210009

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This paper addresses the critical issue of reliability in Federated Learning (FL) systems, specifically within the context of infrastructure-enhanced autonomous driving. While FL offers a decentralized approach to training machine learning models without compromising data privacy, traditional server-based FL architectures suffer from a single point of failure. If the central server responsible for aggregating model updates fails, the entire network becomes inoperable, halting the training process. The authors propose a novel architecture that introduces redundancy to the central server layer, ensuring high availability and fault tolerance. This solution is designed to support the robust deployment of cooperative, connected, and automated mobility systems, such as those developed in the BeIntelli project. To achieve this reliability, the authors utilize Kubernetes, an open-source orchestrator for container-based applications, combined with Redis, an in-memory database. The proposed method deploys multiple replicas of the global model across different nodes rather than relying on a single instance. Specifically, the architecture employs Kubernetes StatefulSets to manage stateful applications, ensuring that each pod has its own physical storage and identity. The system consists of a principal pod with read-and-write access to the model storage and multiple backup pods with read-only access. These instances are continuously synchronized using Redis’s built-in replication mechanisms and monitored by Sentinel, which automatically selects a new principal pod if the current one fails. This setup eliminates the need for custom synchronization algorithms like Raft, leveraging existing Kubernetes and Redis features to maintain data consistency and availability. The experimental evaluation was conducted using Kind, a tool for creating local Kubernetes clusters, to simulate the proposed architecture. The setup included three Redis instances (one principal, two backups), three Sentinel instances for monitoring, and REST-API interfaces for communication. A simple multi-layer perceptron model was used as the test case. The experiments focused on verifying the system's ability to handle node failures and maintain data synchronization. The results demonstrated that the redundant architecture successfully maintained network availability even when the principal server was intentionally removed. The Sentinel component correctly identified the failure and promoted a backup instance to the principal role, ensuring that the FL process could continue without significant downtime or data loss. The significance of this work lies in its contribution to the robustness of decentralized machine learning systems. By addressing the single point of failure inherent in traditional FL, the proposed architecture enables more dependable and resilient training environments. This is particularly important for safety-critical applications like autonomous driving, where system reliability is paramount. The use of widely adopted technologies like Kubernetes and Redis ensures that the solution is scalable, easily integrable into existing infrastructure, and suitable for both small-scale and large-scale enterprise deployments. The findings suggest that incorporating orchestration-based redundancy is a viable and effective strategy for enhancing the reliability of Federated Learning networks.

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed.

StageOutcomeToolModelPromptAttemptsCompleted
discover success Crossref 1 2026-06-20
archive success unpaywall 2 2026-06-26
extract success cached 2 2026-06-26
clean success clean 1 2026-06-20
chunk success chunk 1 2026-06-20
embed success embed Qwen/Qwen3-Embedding-8B 1 2026-06-20
enrich success openalex 1 2026-06-20
promote success 1 2026-06-20
summarize success llm qwen3.6-27b-prismaquant summ-v5 1 2026-06-26
tag success vector_similarity 6 2026-06-20
verify success 1 2026-06-26

Summary generated by qwen3.6-27b-prismaquant on 2026-06-26; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.