ARES Core Concepts

ARES is a reinforcement learning framework that trains the LLM inside the agent, not the whole agent. Observations are LLM requests; actions are LLM responses. These concepts are how that fits together.

Key distinction

Two different things go by “agent” in ARES:

Code Agent (static) — The orchestration logic that uses a Container and LLM to solve tasks (e.g. MiniSWECodeAgent). This is part of the environment and stays fixed during training. It’s the scaffold that defines how the LLM interacts with code.
Agent / Policy (trained) — What you actually train: a function that maps LLMRequest → LLMResponse. This can be a fine-tuned LLM, a prompt optimizer, or any policy that produces better responses. This is what improves via RL.

System architecture

Your training loop and the ARES environment interact like this:

Your RL policy receives an observation (an LLMRequest), produces an action (an LLMResponse), and passes it to env.step(action).
CodeEnvironment holds a QueueMediatedLLMClient that intercepts LLM calls from the code agent. Requests go onto a queue and become observations; your policy’s responses are fed back and unblock the agent.
CodeAgent (e.g. MiniSWECodeAgent) reasons about the task, calls the LLM (and blocks on the queue until your policy responds), and runs commands in the container. It loops until done.
Container (Docker or Daytona) is the isolated environment where the agent runs bash, edits files, and executes code.

So: the code agent is inside the environment; you train a policy that supplies the LLM’s answers at each step.

Environment

CodeEnvironment wraps the task, container, and code agent into one RL environment. It:

Manages a Container for isolated execution.
Manages a CodeAgent that does the orchestration.
Exposes LLM requests as observations (by intercepting calls from the agent).
Treats LLM responses as actions (your policy provides them).

Standard RL loop:

async with env:
    timestep = await env.reset()
    while not timestep.last():
        action = await your_policy(timestep.observation)  # observation = LLMRequest
        timestep = await env.step(action)                 # action = LLMResponse
    # timestep.reward is the final step reward

Each reset() or step() returns a TimeStep with: step_type (FIRST / MID / LAST), observation (an LLMRequest, or None on termination), reward, and discount.

CodeAgent

A CodeAgent implements the orchestration: it has a Container (to run shell commands) and an LLMClient (to talk to the model). Minimal interface:

async def run(self, task: str) -> None — Run the agent for the given task.

Typical flow: receive a task → build an LLM request → call await llm_client(request) → parse the response (e.g. commands) → run them in the container → repeat until done.

The agent doesn’t know it’s in an RL loop. When it calls await llm_client(request), it blocks. The client is actually a QueueMediatedLLMClient: it puts the request in a queue, the environment exposes it as an observation, your policy returns an LLMResponse via env.step(action), and that response is given back to the agent. So you train the LLM while the agent code stays simple and linear.

Available agents include MiniSWECodeAgent (mini-swe-agent, Jinja2 prompts, markdown command parsing). You can implement your own by wiring your logic to Container and QueueMediatedLLMClient.

Container

A Container is the isolated environment where the code agent runs commands, edits files, and runs code. Interface:

start(env), stop()
exec_run(command, workdir, env, timeout_s) → ExecResult
upload_files(local_paths, remote_paths), download_files(remote_paths, local_paths)

Implementations: DockerContainer (local Docker, good for dev and single-machine runs) and DaytonaContainer (cloud, resource limits, auto-cleanup, good for production). The environment creates and manages the container; you usually don’t touch it directly.

LLMClient

An LLMClient gives a single interface for LLM calls: async __call__(request: LLMRequest) -> LLMResponse.

LLMRequest: messages (OpenAI-style chat messages), optional temperature.
LLMResponse: chat_completion_response, cost.

In the RL loop, timestep.observation is the LLMRequest the code agent wants to send (the “state” your policy sees). The action you pass to env.step() is an LLMResponse (how your policy controls the agent).

Implementations: ChatCompletionCompatibleLLMClient (real API calls, OpenAI-compatible, retries, cost tracking), QueueMediatedLLMClient (the one that enables the RL abstraction; see How It Works), and MockLLMClient (fixed responses for tests).

Full documentation

The full ARES docs (API details, more examples, implementing your own agents) are on Read the Docs:

Core Concepts on Read the Docs