Making my Own Agent, Part 2: The Agentic Loop

Prelude

  In the previous post I talked about the basics of an agent and why I decided to create my own. While I encourage you to go back to read the previous post, here's a refresher: an agent allows an LLM to interact with the outside world. I'm doing this for two main reasons, the first being that there is no great open source AI troubleshooting agent today and the second being that I find it helpful for my day-to-day activities.

  I talked a little about the technical parts in the earlier post but it was all generic and pertained to all agents, in this post I am going to talk about how my AI agent works, from how it receives a signal to how it investigates to how it escalates or presents its RCA.

Chain of Thought

Traditionally, LLMs worked that you asked a question and it answered immediately by just spitting out an answer. Then in January of 2022 a paper came out detailing chain of thought (CoT), where you prompt the LLM to reason. For example instead of saying "Joe had 3 sodas and buys 3 more, each pack has 12 soda cans, how many soda cans does Joe have?", the LLM might answer 50. Whereas, if you prompt it to think and reason through the steps like with a one-shot:


Problem: Liam had 5 cold sodas in his fridge. Later, he went to the store
    and grabbed 2 cases of soda. Since every case holds 12 sodas, how many total
    sodas does Liam have now?
    
A: First, we need to find out how many
    new sodas he bought: 2 packs multiplied by 12 sodas per pack equals 24
    sodas. Then we add those to the ones he already had: 5 + 24 = 29. He has 29
    sodas in total.
    
    
Problem: Joe had 3 sodas and buys 3 more, each pack has 12 soda
    cans, how many soda cans does Joe have?"

it'll answer:


    First, we need to find out how many new soda cans he bought: 3 packs
    multiplied by 12 soda cans per pack equals 36 sodas. Then we add those to
    the ones he already had:
    
    3+36=39. He has 39 soda cans in total.

The research has shown the longer it thinks the better the results, but reasoning time grows exponentially and returns diminish.

ReAct: Reasoning and Action

Then in October 2022, a paper came out detailing a framework called ReAct (Reasoning and Action). We already talked about the reasoning part above, the second part (Action) is exactly as it sounds like — an LLM can do and take certain actions to retrieve knowledge. ReAct allows the LLM to enrich its knowledge like retrieving the weather. This is done by giving LLMs tools to retrieve information. For example, if you were to ask an LLM if you should wear a jacket:

  • User Prompt: "Should I wear a jacket right now?"

  • Thought 1: I need to know the user's location to check the weather.

  • Action 1: get_current_location()

  • Observation 1: {"location": "Seattle, WA"}

  • Thought 2: Now I need to check the weather for Seattle.

  • Action 2: get_weather(location="Seattle, WA")

  • Observation 2: {"temperature": "52°F", "condition": "Light rain"}

  • Thought 3: 52°F and raining is chilly. A jacket is definitely needed.

  • Final Answer: Yes, you should wear a jacket. It's currently 52°F and raining in Seattle.


This ReAct loop is the foundation my agent is built on. Here's how it plays out with a real alert.

How my agent works

Receiving the alert

In order for the agent to be aware of the problem, you first need a receiver. My agent currently has two receivers, one from Prometheus and one from Google Cloud alerts. I have a basic HTTP listener that receives JSON from either Prometheus or GCP to a certain path. These receivers are protected by HTTP basic auth to help prevent any probers from accidentally triggering it. Here is what a GCP alert looks like:


  {
  "version": "1.2",
  "incident": {
    "incident_id": "incident-123",
    "state": "open",
    "started_at": 1609459200,
    "ended_at": 0,
    "summary": "CPU usage is above threshold",
    "policy_name": "High CPU Policy",
    "condition_name": "CPU > 80%",
    "severity": "CRITICAL",
    "url": "https://console.cloud.google.com/...",
    "resource": {
      "type": "gce_instance",
      "labels": {
        "instance_id": "1234567890",
        "zone": "us-central1-a"
      }
    },
    "metric": {
      "type": "compute.googleapis.com/instance/cpu/utilization",
      "displayName": "CPU utilization",
      "labels": {
        "instance_name": "web-server-01"
      }
    },
    "observed_value": "0.95",
    "threshold_value": "0.8",
    "metadata": {
      "system_labels": {},
      "user_labels": {
        "env": "prod",
        "team": "platform"
      }
    }
  }
}

Agent skills

The real brains for the agent is that I give it agent skills to query Google Cloud logs and metrics. This gives the agent a script to query the API and instructions on how to use it properly. For example in the instructions I have guidance to use smaller time frames like 30m instead of 2h to help limit context window usage and pulling unnecessary metrics for an event that has only happened recently. If you have a custom metric system or a system that there isn't enough training data on, you can create a skill and the agent can now interact with it and get the necessary information. You can see how I load agent skills in my repo and view my cloud metrics and cloud logs skills on my GitHub

Keeping it safe

We have all heard frightening stories about agents wiping production databases and deleting production infrastructure. I wanted to make it hard for this to happen for me, there are two approaches I use: the first being that I give the agent the least amount of permissions, so this means giving it read only permission. The second and more important is having an allowlist or blocklist (the agent supports both) — with a blocklist you can block the most dangerous commands, with an allowlist you can allow only the verified/non-dangerous commands. I also added a third option that allows the LLM to mark a command as dangerous, if it is, it's automatically blocked. This option is safer than just letting the agent run wild but there are still dangers, we've seen horrifying stories with a similar option in Claude Code, perhaps this is a prompting problem and fine tuning the prompt may make it safer but only time will tell. For reference, here's what an allowlist may look like:


  export AGENT_SAFETY_COMMAND_WHITELIST_JSON='[
    {
      "pattern": "^terraform\\s+plan",
      "description": "terraform plan (read-only)"
    },
    {
      "pattern": "^helm\\s+(list|status|get)",
      "description": "helm read commands",
      "case_insensitive": true
    },
    {
      "pattern": "^make\\s+",
      "exclude_pattern": "(clean|deploy|destroy)",
      "description": "make targets except destructive ones"
    }
  ]'

The full safety implementation is in the repo if you want to dig into it.

Putting it all together

Okay now that we have the inner workings let's trace how it works with an alert:
1. An alert comes in from GCP authenticated with basic auth
2. The LLM analyzes the alert seeing it's a CPU usage alert.
3. The LLM calls the bash tool to run gcloud describe on the node
4. The LLM sees it has 32 cores.
5. The LLM calls the Google Cloud metrics skill looking at the CPU utilization metric
    a. it filters based on node name and time of alert
6. The LLM calls the Google Cloud logs skill looking at the node
7. It sees the CPU started jumping 10 minutes ago and various logs
8. The agent calls the Google Cloud logs at the time CPU utilization started jumping, noticing a jump in Apache logs.
9. The agent calls Google Cloud logs filtering by the Apache process.
10. The agent creates a quick Python script to extract IPs and sort by count
11. The agent calls complete investigation with findings and an RCA of a certain IP hitting the website thousands of times per minute

As you can see the agent keeps collecting information, reasoning on it then refining what information it gathers. It loops through this process until it gets to the root cause, or if it hits a decision that it can't find the root cause it may escalate the investigation with its current findings. We also cap the amount of actions it can take so it doesn't get in an infinite loop just eating up tokens, and once it hits the max amount of actions/turns it escalates with partial findings.

Conclusion

As you can see an agent is basically just a for loop doing actions and reasoning about the output from those actions, the agent continues doing this until it feels confident that it found the root cause providing the operator with findings and the root cause. I believe this agent is a much needed improvement and tool in the world of troubleshooting, taking pressure off from front line troubleshooters. If you want to view my code and see the features you can look at it on my github here.

Comments

Popular posts from this blog

Beyond Static Logging: Introducing Dynamic Log Levels in Go

My website Services stack.

How we sped up a Postgresql procedure 50X