Blog

A Weekend with OpenClaw : Musings and Learnings

April 26, 2026 by Ashwin Leave a Comment

Maybe I’m late to the party. Maybe I missed the bus. But when I stumbled across this excellent mini-course, I couldn’t resist building my own “OpenClaw Assistant.”

This post is a journal of what I set out to do, what I actually did, how it went — and where I ended up.

What I set out to do with OpenClaw?

My goal is to set up OpenClaw securely and build an “always-on assistant” I can reach through web chat and Telegram.

Oh, and the most important part: the whole experiment has to stay within my $30 budget.

What I actually did?

I used this mini-course as my baseline and starting point.

Here’s what I procured:

Hostinger VPS account (~$15 for a one-month subscription)
OpenClaw setup on the VPS (included in the cost — follow this guide)
Anthropic Claude API credits (~$15)

The course is neatly broken into bite-sized steps, and here are the ones I chose to follow:

Install and secure OpenClaw so that only you and your authorized apps can access your instance (course link)
Give your OpenClaw assistant a personality, so it can recognize your style and personalize every interaction (course link)

First, I named my assistant — he will hereafter be known as “Garaka”

The personality is then shaped through 4 key files:

SOUL.md – define agent identity, his values, way he communicates, hard limits
AGENT.md – operating manual, checklist, rules the agent should follow for memory management and external content
USER.md – briefing document, what agent needs to know about you as a person, distinct from his own identity
MEMORY.md – memory system, daily logs and curated long-term store. Grows over time and improves

Before going further, take time to understand each of these files deeply — this step is crucial (course link)

Connect “Garaka” to a channel so you can communicate with him beyond the web chat (course link)

A Telegram bot is the fastest way to get started, and its APIs are quite mature

Enable the “always-on” part so you can schedule tasks, reminders, and more as cron jobs (course link)

In my case, I have Garaka send me a daily reflection reminder every evening at 9:30 PM.

Give Garaka skills so he has clear instructions to accomplish specific tasks (course link)

OpenClaw comes with a set of pre-defined skills
More skills are available on ClawHub and are quite straightforward to install
I gave him two skills — document-summary and quick-note — both handy and exactly what they sound like!

Give Garaka access to web search via the Brave API so he can research topics and deliver richer, more informed responses (course link)
Create sub-agents so Garaka can delegate tasks to specialist agents (course link)

Sub-agents are ephemeral workers — they operate in their own context, can have their own styles, and hand their outputs back to the main agent (the orchestrator)
This is a clean way to separate responsibilities and enable parallel execution
I created a “writer” sub-agent that specializes in researching a topic, drafting content, and passing it to the main agent for review and approval

I chose not to enable email access for now — I want to get more comfortable with Garaka before handing him the keys to my inbox. If you’d like to go further, this course link walks you through giving your assistant access to emails.

How it went?

Overall, the entire setup took me about 3 hours. Here are a few observations along the way.

OpenClaw setup stalled and aborted a couple of times. I had to delete and recreate the Docker image before things ran smoothly
Expect significant model (LLM) usage during installation. I switched to a lower-tier model (Haiku 4.5) to keep costs down and stay within budget
⚠️ Watch out: Don’t make the mistake of selecting an OS when setting up your VPS. You must choose OpenClaw right from the start — otherwise you’ll need to configure it manually and lose access to the templated flow

Where I ended up with OpenClaw?

Here is the outcome after the setup and configuration.

Working web interface to OpenClaw and chat interface to Garaka

Garaka rephrasing his working style after initial configuration

Telegram bot interface to Garaka

Daily scheduled job to send me a reflection reminder

Skills enabled for Garaka

To conclude…

Setting up a personalized agent in a controlled environment was a genuinely enriching experience — and the possibilities feel endless!
The OpenClaw Mini-course by Aishwarya is a lifesaver for anyone who doesn’t want to get lost in a sea of tutorials or wrestle with OpenClaw’s somewhat unintuitive interface as a beginner
Having a team of agents that learns your style, improves with every interaction, and integrates with your tools is a meaningful upgrade over plain LLM interactions!

Have you gone through this journey yourself? If so, I’d love to hear your thoughts and compare notes!

Agent Evals 101 : Building a Strong Evaluation Framework for AI Agents

February 18, 2026 by Ashwin Leave a Comment

AI Agents are taking enterprises by storm. Organizations are racing to build autonomous systems that reason, plan, and execute complex tasks—whether solving real problems or keeping pace with competitors.

The promise is compelling: productivity gains and capabilities that seemed impossible months ago. But beneath the excitement lies a critical challenge: How do you actually evaluate and test AI agents?

This isn’t an engineering afterthought. It’s the difference between a reliable system and an unpredictable black box that works brilliantly one moment and fails catastrophically the next. Unlike traditional software, AI agents are probabilistic and context-dependent—our old testing playbooks don’t apply.

Yet evaluation remains the unattractive, ignored piece of the puzzle. Teams prototype excitedly while pushing “How do we know this works?” to later sprints—or ignoring it entirely.

This is a mistake organizations can’t afford to make.

How are Agent Evals different from Traditional Software Testing?

Here are 5 key differences between AI Agent testing and traditional software testing:

Non-deterministic outputs: Traditional software produces the same output for the same input every time. AI agents can generate different responses to identical prompts, making reproducible test cases nearly impossible and requiring statistical evaluation across multiple runs instead of single pass/fail checks.
Emergent behavior and edge cases: While traditional software fails in predictable ways based on code paths, AI agents can exhibit unexpected behaviors in novel situations they weren’t explicitly trained for. The failure modes are often subtle, contextual, and impossible to enumerate in advance.
Multi-step reasoning and chain-of-thought: Traditional tests verify individual functions or API calls. AI agents execute complex, multi-step workflows where each decision influences the next, requiring evaluation of the entire reasoning chain—not just the final output—to understand where and why things went wrong.
Subjective quality metrics: Traditional software testing relies on objective criteria like “does this API return a 200 status code?” AI agent outputs often require human judgment to evaluate quality, relevance, tone, and appropriateness—metrics that are inherently subjective and context-dependent.
Tool use and external interactions: Traditional software testing typically mocks external dependencies for isolation. AI agents dynamically choose which tools to use, when to use them, and how to chain them together, making it critical to test not just if they can use tools correctly, but whether they make good decisions about when and how to use them in the first place.

Building Blocks of Agent Evals

An Agent Evaluation system typically consists of these core building blocks:

Agents – The system under test that works toward specific objectives using tools and LLMs. While production systems often involve multiple agents, we treat them as a single entity for evaluation purposes.

Evaluation Suite – A comprehensive set of tasks or test cases that assess agents across various parameters and metrics. This is your complete testing framework.

Tasks – Individual test cases within the evaluation suite, each targeting specific agent capabilities or goals. Common metrics include task success rate, accuracy, latency, and cost per task.

Outcomes – The measurable outputs produced by agents during task execution. These represent what the agent actually accomplished and form the basis for evaluation—analogous to system test results in traditional software.

Graders – Automated or human evaluators that score outcomes against predefined rubrics, ground truth data, or reference outputs. Graders determine whether the agent succeeded and how well it performed.

While your evaluation system may include additional components depending on your specific testing needs, these building blocks provide a solid foundation for assessing any agentic system.

Eval-Driven Development (EDD) for Agentic Systems

Inspired by Test-Driven Development (TDD), which has improved software engineering for decades, Eval-Driven Development (EDD) places systematic evaluations at the core of the AI development lifecycle—not as an afterthought, but as the foundation.

Here are 4 core principles for a robust EDD approach:

1. Define success upfront – Establish clear, quantifiable metrics aligned with business goals before writing a single line of code. Define what “good” looks like: accuracy thresholds, acceptable latency, hallucination rates, task completion criteria.

2. Build comprehensive eval suites – Create test datasets covering common scenarios, edge cases, and adversarial examples. Your evals should stress-test your agents the way real-world usage will.

3. Evaluate continuously – Run your evaluation suite with every prompt change, model upgrade, or system modification. Treat evals like CI/CD—continuous measurement is the only way to catch regressions early and validate improvements.

4. Use multiple evaluation methods – Combine human grading for nuanced judgment, code-based assertions for objective metrics, and LLM-as-judge for scalable quality assessment. No single evaluation method captures the full picture.

Guided by these core principles, here’s a practical workflow for designing an evaluation system for your agentic application:

Understand the business problem – Start with clarity on what success looks like for your stakeholders. What specific tasks must the agent accomplish? What quality bar must it meet? What failures are unacceptable?
Assemble representative examples – Gather real data related to the problem: actual user queries, historical cases, common workflows, and known failure modes. Your eval dataset should mirror production reality.
Build an end-to-end V0 system – Create a working prototype quickly, even if crude or incomplete. You need something functional to evaluate before you can improve it.
Label data and build initial evals – Create ground truth labels for your examples and write your first evaluation metrics. Start simple: basic success/failure criteria and a handful of quality measures.
Align evals to business metrics – This is critical. Map your technical metrics (accuracy, latency, tool usage) to actual business outcomes (customer satisfaction, cost savings, error reduction). If your evals don’t predict business value, they’re measuring the wrong things.
Iterate on both system and evals – Improve your agent based on eval results, but also refine your evals as you discover what matters. Your evaluation framework should evolve alongside your system.
Integrate into your development workflow – Embed evals into CI/CD, code reviews, and release processes. Make evaluation a continuous practice, not a pre-launch gate.

This creates a virtuous flywheel: better evals surface real problems, fixes improve performance, and improved performance validates your eval framework.

Agent Eval Frameworks in Practice

The ecosystem of agent evaluation tools is rapidly maturing. Here are four leading frameworks that can accelerate your evaluation efforts:

1. LangSmith (by LangChain)

LangSmith provides end-to-end observability and evaluation for LLM applications with excellent tracing of multi-step agent workflows. Its strength lies in visualizing complex agent reasoning chains and tight integration with the LangChain ecosystem, supporting human-in-the-loop evaluation, automated scoring, and A/B testing.

2. Braintrust

Braintrust is an enterprise-grade evaluation platform focused on production AI systems with strong CI/CD integration and powerful analytics. It excels at experiment tracking and comparing different agent architectures side-by-side, supporting code-based assertions, LLM-as-judge, and human review workflows.

3. Promptfoo

Promptfoo is an open-source, configuration-driven framework that emphasizes simplicity and flexibility for teams wanting full control. Its lightweight, local-first design and YAML-based test definitions make it ideal for rapid iteration and easy integration into CI/CD pipelines.

Key Takeaways

Evaluation is non-negotiable. As AI agents become embedded in enterprise systems, a robust evaluation framework isn’t optional—it’s the foundation for reliability, trust, and measurable business impact.
Adopt Eval-Driven Development. Just as TDD transformed software engineering, EDD must become your default approach for AI development. Build evaluations first, then build agents that pass them.
Connect evals to business outcomes. Technical metrics matter only if they predict business value. Continuously align your evaluation framework with what stakeholders actually care about, and refine it as you gather more production data.
Leverage the ecosystem. The evaluation tooling landscape is evolving rapidly. Use open-source and commercial frameworks to accelerate your evaluation maturity rather than building everything from scratch.
The organizations that master agent evaluation today will be the ones deploying reliable, high-value AI systems tomorrow. Start building your evaluation practice now.

Disclaimer: The ideas, opinions, and recommendations expressed in this blog post are solely my own and do not reflect the views, policies, or positions of my employer.

Enterprise AI Agents: Why Governance is Your Competitive Advantage

December 13, 2025 by Ashwin Leave a Comment

Building AI agents in an enterprise extends far beyond prototypes and POCs. When designing agents for real users who drive revenue, governance becomes critical.

This post explores the essential governance pillars for integrating AI agents into enterprise production applications.

In the world of Agent Management Platforms, four governance pillars strengthen the successful rollout of AI agents in your organization.

Short on time? Check the TL;DR

Pillars of AI Agents Governance

Evaluation – How do you evaluate the correctness and completeness of agent outputs
Security – How do you protect agents against malicious attacks and breaches
Guardrails – How do you set boundaries and checkpoints, often tied to your business and compliance rules
Auditability – How do you understand everything that has gone on before an agent produced the outputs

For this article, I will pick examples of each pillar from the open-source CrewAI Agentic framework. But you can essentially implement it with other similar frameworks like LangGraph, Mastra, etc.

AI Agents Evaluation

Evaluation in AI agent systems serves a similar purpose to testing in traditional software development—validating that the system performs as intended.

However, AI agents built on large language models are non-deterministic systems. Unlike conventional software, where the same input consistently produces the same output, AI agents may generate different responses across multiple runs. This variability requires specialized evaluation approaches.

Understanding Non-Deterministic Behavior

The non-deterministic nature of LLM-based agents stems from temperature settings, sampling methods, and the probabilistic nature of language generation.

While this enables creative problem-solving and natural interactions, it means traditional unit tests with exact output matching are insufficient. Instead, evaluation must assess whether outputs meet quality standards and functional requirements rather than matching predetermined values.

Two Complementary Evaluation Approaches

Effective AI agent evaluation employs two methodologies: subjective and objective evaluation.

Subjective Evaluation: LLM-as-Judge

Subjective evaluation leverages other LLMs to review, critique, and score agent outputs. This approach assesses qualities difficult to measure programmatically, such as relevance, coherence, helpfulness, and tone.

An LLM judge evaluates whether an agent’s response appropriately addresses the user’s intent, maintains a consistent persona, or demonstrates sound reasoning.

The judge receives the original task, the agent’s output, and specific evaluation criteria. It then generates a critique and assigns scores, revealing subtle issues like logical inconsistencies or inappropriate confidence levels that traditional metrics might miss.

Objective Evaluation: Gold Standard Comparison

Objective evaluation relies on gold standard datasets—curated collections of inputs paired with known correct outputs. The process compares agent outputs against these benchmarks using quantifiable metrics such as accuracy, precision, recall, or task-specific success criteria.

For example, if an agent extracts structured information from documents, objective evaluation measures how accurately it identifies required fields compared to human-annotated examples.

For multi-step workflows, evaluation verifies that the agent completes all necessary steps in the correct sequence.

Implementation in Practice

Modern agent frameworks like CrewAI provide built-in evaluation capabilities combining both approaches.

A typical workflow includes running the agent multiple times with the same test cases, applying LLM judges for subjective quality assessment, comparing outputs against gold standards, and aggregating results to identify patterns and issues.

Evaluation results inform iterative improvement. Teams can adjust agent instructions, refine prompts, modify tool configurations, or retrain components based on performance gaps. This continuous evaluation and refinement process is essential for maintaining reliable AI agent systems in production.

AI Agents Security

AI agents face unique security challenges that differ from traditional software vulnerabilities. As agents interact with users, access tools, and process data autonomously, they become targets for malicious attacks designed to manipulate their behavior or extract sensitive information.

Common Security Threats

Prompt Injection occurs when attackers craft inputs that override an agent’s original instructions, causing it to ignore safety guidelines or perform unintended actions. These attacks exploit the agent’s natural language interface to introduce malicious directives disguised as legitimate requests.

Tool Misuse happens when agents are manipulated into using their capabilities inappropriately—such as accessing unauthorized data, executing harmful commands, or making API calls that violate business rules. Since agents have direct access to tools and systems, compromised decision-making can have immediate consequences.

Data Poisoning involves corrupting the training data or knowledge bases that agents rely on, causing them to learn incorrect patterns or biased behaviors. Attackers may inject false information into retrieval systems or databases that agents query.

Context Poisoning targets the agent’s working memory by inserting misleading information into conversation history or retrieved documents. This can cause the agent to make decisions based on false premises or manipulated context.

Identity Spoofing exploits weak authentication mechanisms, allowing attackers to impersonate legitimate users or systems. Agents may then grant unauthorized access or perform actions on behalf of fake identities.

Security Countermeasures

Protecting AI agents requires multiple defensive layers. Input validation and sanitization filters suspicious patterns before they reach the agent’s core reasoning. Output monitoring detects when responses deviate from expected behavior or contain sensitive information leaks.

Tool access controls implement permission boundaries, ensuring agents can only invoke authorized functions with appropriate parameters. Context isolation separates user inputs from system instructions, making it harder for malicious prompts to override core directives.

Authentication and authorization frameworks verify user identities and enforce role-based access controls before agents process requests. Audit logging tracks all agent interactions, tool usage, and decision points for forensic analysis and compliance.

Adversarial testing proactively simulates attacks to identify vulnerabilities before deployment. Security-focused evaluation frameworks can test agent resilience against known attack patterns and edge cases.

Modern agent platforms are increasingly incorporating these security primitives as built-in features, but organizations must still configure them appropriately and maintain vigilance as new attack vectors emerge.

AI Agents Guardrails

AI agents operating autonomously in production environments require guardrails—protective boundaries that ensure safe, compliant, and aligned behavior. Unlike traditional software with hardcoded logic paths, agents make dynamic decisions that need real-time constraints to prevent harmful or inappropriate actions.

Types of Guardrails

Input Guardrails filter and validate incoming requests before they reach the agent’s reasoning engine. These prevent processing of prohibited content, malicious prompts, or queries outside the agent’s intended scope. Input guardrails can reject requests containing personal identifiable information (PII), offensive language, or topics the agent shouldn’t address.

Output Guardrails scan agent responses before delivery to users, blocking content that violates policies or quality standards. These catch hallucinations, inappropriate recommendations, leaked sensitive data, or responses that contradict business rules. Output guardrails ensure the agent doesn’t make unauthorized commitments or provide advice beyond its mandate.

Behavioral Guardrails constrain how agents use tools and make decisions during execution. These include spending limits on API calls, restrictions on which databases can be accessed, approval requirements for certain actions, and constraints on autonomous decision-making scope. For example, an agent might require human approval before financial transactions exceed a threshold.

Contextual Guardrails adapt constraints based on user roles, conversation context, or operational conditions. An agent might have different permissions for internal employees versus external customers, or operate under stricter rules during high-risk scenarios.

Implementation Approaches

Guardrails can be implemented through rule-based systems that enforce explicit policies, LLM-based classifiers that evaluate content against nuanced criteria, or hybrid approaches combining both. Modern agent frameworks provide guardrail APIs that intercept agent workflows at key checkpoints—before tool execution, after response generation, and during multi-step reasoning chains.

Effective guardrails balance safety with functionality. Overly restrictive guardrails frustrate users and limit agent utility, while insufficient guardrails expose organizations to risk. Continuous monitoring and adjustment ensure guardrails remain calibrated to organizational needs.

AI Agents Auditability

Auditability establishes transparency and accountability for AI agent operations. As agents make autonomous decisions that impact business outcomes, organizations need comprehensive records of what agents did, why they did it, and what information influenced their decisions.

Essential Audit Components

Component	Why	What to capture?
Decision Trails	Understand why an agent reached a particular conclusion or took a specific action.	Capture the agent’s reasoning process, including which tools were invoked, what information was retrieved, how the agent weighted different factors, and what alternatives were considered.
Interaction Logs	To understand the complete context of each interaction, supporting compliance reviews, quality assurance, and issue resolution	All user inputs, agent outputs, and conversation flows.Logs should include timestamps, user identifiers, session metadata, and any relevant business context.
Tool Usage Records	For security monitoring, cost management, and understanding the agent’s operational footprint across enterprise systems.	Document every external system call, API request, database query, and file access performed by the agent.
Performance Metrics	Identify patterns that indicate degraded performance, emerging issues, or opportunities for improvement.	Track success rates, response times, error frequencies, and user satisfaction scores.

Compliance and Governance

Auditability supports regulatory compliance by providing evidence of proper agent behavior. Industries with strict oversight—such as finance, healthcare, and legal services—require detailed records demonstrating that agents operated within approved parameters and didn’t make unauthorized decisions.

Audit data also enables retrospective analysis when issues arise. If an agent provides incorrect information or makes a problematic decision, audit trails allow teams to reconstruct the event, identify root causes, and implement corrective measures.

Implementation Considerations

Effective audit systems must balance comprehensiveness with storage and privacy concerns. Logs should capture sufficient detail for meaningful analysis while protecting sensitive information. Retention policies should align with regulatory requirements and business needs.

Modern agent platforms provide structured logging frameworks that automatically capture key events in standardized formats. Organizations should integrate these logs with existing observability and compliance systems, enabling centralized monitoring and analysis across all AI agent deployments.

Conclusion

AI agents represent a fundamental shift in how enterprises deliver value—moving from deterministic software to autonomous systems that reason, decide, and act on behalf of organizations. This power comes with responsibility. Without robust governance, agents can make costly mistakes, expose sensitive data, violate policies, or erode user trust.

The four governance pillars—evaluation, security, guardrails, and auditability—form an integrated framework for deploying AI agents safely and effectively in production environments.

Evaluation ensures agents perform reliably despite their non-deterministic nature, combining subjective LLM-based assessment with objective benchmarking to validate quality before and after deployment.

Security protects against emerging threats unique to AI systems, from prompt injection to identity spoofing, implementing defensive layers that safeguard both the agent and the systems it accesses.

Guardrails establish boundaries that keep agents aligned with organizational policies and ethical standards, constraining behavior without sacrificing the flexibility that makes agents valuable.

Auditability provides transparency into agent operations, creating decision trails that support compliance, enable root cause analysis, and build stakeholder confidence in autonomous systems.

Together, these pillars transform AI agents from experimental prototypes into trustworthy enterprise tools. Organizations that invest in governance upfront accelerate adoption, reduce risk, and unlock the full potential of AI agents to drive business value. As agents become more capable and autonomous, governance won’t be optional—it will be the foundation of successful AI implementation.