Title of the Talk
AI Reliability Engineering: What Production Systems Teach Us That Benchmarks Dont
Abstract
Standard AI benchmarks evaluate model performance under controlled, stateless conditions, conditions that rarely reflect the complexity of real-world deployments. As LLMs move from research prototypes into production infrastructure, a significant class of reliability failures has emerged that benchmarks are structurally incapable of predicting. This talk draws on first-hand experience operating AI-augmented systems at enterprise scale to surface two categories of failure that the research community has underexplored: output inconsistency under compositional prompting, where LLM behavior degrades non-linearly as prompt complexity grows across pipelines; and cascading failures in multi-tool orchestration, where flaky tool calls and non-deterministic outputs compound across agentic workflows in ways that are difficult to observe and harder to recover from. We propose a framework for AI Reliability Engineering, adapting Site Reliability Engineering principles to AI system design. This includes defining error budgets for non-deterministic outputs, building observability pipelines for LLM behavior, and designing graceful degradation patterns calibrated to real failure modes rather than hypothetical ones. The goal is to offer the research community a practitioners vocabulary for a problem that benchmarks have not yet learned to ask.
Brief Profile
I am a Machine Learning Engineering Manager at Adobe Machine Learning Platform team, where I lead a team to architect and build large-scale ML platforms to support foundational model training at enterprise scale. With over a decade of experience spanning data infrastructure and machine learning platforms, I have successfully led multiple initiatives from concept to production, both at high-growth startups and Fortune 500 companies. I hold patents and have authored publications in the field of ML frameworks, particularly around user-generated content. My expertise lies in designing scalable, resilient ML systems and advancing MLOps practices to streamline the entire lifecycle of machine learning applications. I am deeply passionate about building robust infrastructure that powers next-generation AI workloads.
