Saurav Kant Kumar

Title of the Talk

From Signals to Decisions: AI for Data Center Systems

Abstract

Data centers form the backbone of modern digital infrastructure, supporting large-scale cloud services, artificial intelligence workloads, and enterprise applications. These environments continuously generate vast and heterogeneous streams of operational data, including metrics, logs, and event signals. However, extracting meaningful insights from this data remains a significant challenge due to its scale, variability, and temporal complexity. Rather than treating system events as isolated occurrences, this work adopts a behavioral perspective in which reliability is understood through evolving patterns in system activity. This study explores how artificial intelligence can be used to model and interpret behavior in large-scale data center environments. By analyzing deviations from learned baselines, AI models can identify emerging irregularities that often precede critical issues. The approach is demonstrated through real-world use cases, beginning with system-level anomaly detection, where infrastructure metrics and logs across compute, storage, and network layers are analyzed to capture deviations from normal operation. In addition, temporal analysis of alert sequences is used to uncover recurring patterns that characterize the progression of system-level events. Building on this system-level understanding, the study further examines component-level behavior through SSD modeling, where machine learning techniques are applied to SMART attributes, performance indicators, and operational metrics to capture patterns of degradation. Beyond detection, the study highlights the role of AI in enabling informed decision-making by providing early insights, prioritizing potential risks, and improving operational visibility. It also discusses practical challenges encountered in real-world deployments, including noisy and incomplete data, class imbalance, evolving system dynamics, and the need for interpretable and trustworthy models. The work ultimately presents a shift toward intelligent data center systems, where AI-driven approaches enable continuous interpretation of system signals and support scalable, resilient, and adaptive infrastructure management.

Keywords: Data Center, System Behavior Modeling, Anomaly Detection, Temporal Sequence Analysis, Decision Intelligence, Explainable AI.

Brief Profile

I am currently pursuing Ph.D. in Artificial Intelligence at the University of the Cumberlands. My research interests include predictive maintenance, explainable AI, generative AI, agentic AI, and scalable machine learning for high performance computing and large-scale systems. I have over eight years of industry experience in artificial intelligence, working in roles ranging from Data Scientist to Machine Learning Engineer. My professional experience spans multiple domains, including Banking, Telecommunications, Healthcare, Manufacturing, Supply Chain, Oil & Gas, and High Performance Computing. My research experience focuses on predictive maintenance in high-performance computing environments and the application of machine learning and explainable AI to large-scale infrastructure data.