As AI systems transition from generative models to autonomous agents, the move from experimentation to production introduces significant challenges. Anannya Roy, Developer Advocate, Gen AI at Amazon Web Services, addressed these challenges at DevSparks Pune 2026, in a session titled ‘Getting Agentic Apps Ready for Production: Lessons in Observability and Evaluation.’ She highlighted why agentic applications often fail in production and outlined strategies to prevent such failures.
Roy emphasized the shift from generative AI to agentic AI, driven by the need for systems that can reason, plan, and act autonomously. “We wanted agents – systems that could reason, plan and act on our behalf,” she stated, noting the reduction in human oversight that comes with this shift.
However, this transition introduces complexities related to security, governance, scalability, and transparency. Roy pointed out that agentic systems’ non-deterministic nature can lead to inconsistent decision paths, misinterpretation of business rules, and exposure of sensitive data. These issues can cascade, resulting in hallucinations, faulty reasoning, poor response quality, and increased operational costs. Even minor adjustments, such as modifying a tool or switching models, can alter outcomes.
To address these challenges, Roy advocated for strong observability and evaluation frameworks that trace decisions, detect drift, and ensure agents remain reliable and transparent. She stressed that observability alone is insufficient; organizations must understand how to observe these systems and what to monitor.
Once agents are deployed, they generate vast amounts of logs that need analysis to understand why an agent took a particular action and whether the outcome was correct. Human oversight remains crucial in evaluating agent behavior and guiding improvements, making structured evaluation critical. Organizations must detect issues like hallucinations or incorrect reasoning before systems move to production.
Roy emphasized that evaluation must be continuous, involving setting evaluation parameters, identifying relevant logs, building test datasets, and re-running the cycle to monitor agent behavior in production. She demonstrated the use of multi-test agents to evaluate different use cases, including planning trips and recommending budgets, and showcased the Amazon Bedrock AgentCore platform for configuring evaluation metrics and monitoring agent behavior across multiple sessions.
In the production phase, Roy explained that readiness depends heavily on monitoring and evaluation. Teams must configure how the agent will be observed in real-world environments, selecting the deployed agent and defining multiple evaluators to test different scenarios and behavioral patterns. She suggested a hybrid evaluation approach, combining offline reviews by subject-matter experts with online analytics dashboards.
Monitoring depends on the optimization goal, whether optimizing the overall application or a particular event. For the agent itself, teams observe behavioral indicators such as tool selection and handling of multi-turn conversations, checking for context overload or incorrect contextual reasoning. At the application level, monitoring focuses on cost, latency, and response quality.
Roy underscored the importance of humans-in-the-loop, stating, “Sometimes humans are present, not by redundancy. They are there by choice, so use them.” Subject-matter experts review evaluation scores across different layers to identify the root causes behind failures. Re-running prompts and test cases helps detect performance drops or correctness changes.
Roy concluded by outlining a structured path to production: build the agent, deploy it, log every activity, and continuously monitor performance. By combining logs, structured evaluation frameworks, and expert oversight, organizations can refine agents and ensure they consistently take the right actions.