SAG — Setup Agent | Autonomous Java Project Configuration

Open Source · ICSE-NIER '26 · Rio de Janeiro Automating
The First Step.

Not a Dockerfile generator. An autonomous dual-model agent that reasons, acts, and verifies inside isolated Docker containers — turning hours of Java project setup into minutes of autonomous execution.

1. Install

git clone https://github.com/Codegass/Setup-Agent.git && cd Setup-Agent && uv sync

2. Setup a Project

sag project https://github.com/apache/commons-cli.git

Evaluation Results · 15 Apache Projects · 180 Runs

84.4%

Build Success Rate

o4-mini + GPT-4.1-mini configuration

5.5×

Cost Reduction

vs. GPT-5+GPT-5 with higher accuracy

12.09

Think-to-Act Ratio

Validates the dual-model separation

48.4%

Test Execution Rate

Average across all configurations

"Role specialization beats raw capacity — pairing a chain-of-thought Thinking Model with a smaller Action Model achieves higher success at 5.5× lower cost than uniformly applying powerful models."

Core Finding · SAG Paper · ICSE-NIER '26

SAG

Setup Agent

Features

Dual-Model ReAct Engine

Separates reasoning from execution. A chain-of-thought Thinking Model plans while a smaller Action Model translates plans into tool invocations via function calling. The 12.09 thinking-to-action ratio validates this separation.

Container-Native Isolation

SAG operates entirely within Docker containers — not generating Dockerfiles, but interactively exploring and configuring inside them. Zero host pollution, fully reproducible.

File-Based Validation

Examines concrete evidence on disk — .class files after compilation, surefire XML test reports, build artifacts. Prevents hallucinations where the agent believes it succeeded without actual completion.

Trunk & Branch Context

Trunk Context maintains global state and task lists. Branch contexts capture subtask details. When a subtask ends, results merge back to Trunk — enabling complex chains without context loss.

Hierarchical Tool-Belt

Layered tools from low-level (Bash) through build specialists (Maven, Gradle) to high-level orchestrators (Project Analyzer). Graceful fallback when specialized tools encounter unexpected scenarios.

Agent State Evaluator

Analyzes Agent Memory to detect problematic patterns — repeated identical actions indicating the agent is stuck in a loop. Provides corrective feedback to break cycles autonomously.

View Source on GitHub ↗

Architecture

How SAG Works.

User Input

Repository URL
Goal description

↓ when all tasks finished

Setup Report

Build & test outcome
Per-task summary

SAG Agent · ReAct Loop

Thinking Model

Reasoning
Plan next action

o4-mini / GPT-5

action plan →

Action Model

Function calling
Tool invocation

GPT-4.1-mini / o4-m

All tasks finished?

Yes → emit Setup Report
No → back to Thinking

← observation

Observation

Tool Execution Results
Project Running Status
+ Validation feedback ↓

→use tool

←tool output

Docker Container

Tool Sets

Bash · Maven · Gradle
Sys Package · Analyzer

File System

Agent Memory

Trunk: general, tasks
Branch: detail context

Java Project Files

src/ · pom.xml · build.xml
target/ · surefire reports

↑ feedback closes the loop → written back into Observation

Validation System · Rule-Based, Out-of-Band

File Validator

Reads Java project files directly
.class files, surefire XML, JARs
Ground truth over model claims

Agent State Evaluator

Reads Agent Memory directly
Detects stuck / repeating loops
Produces corrective signals

→ inspects Java Project Files (Docker) → inspects Agent Memory (Docker) ↑ feeds Observation (ReAct Loop)

SAG has three tightly coupled subsystems. The ReAct Loop alternates between reasoning (Thinking) and tool use (Action); on each turn it checks whether all tasks are finished and either emits the Setup Report or continues. The Docker Container isolates execution and hosts a shared file system containing both the agent's hierarchical memory and the Java project files. The Validation System runs out of band: it reads files and memory directly from the container and writes an independent observation back into the ReAct loop, so each iteration is anchored to ground truth rather than the model's own claims.

Thinking Model

Chain-of-thought reasoning
Analyze running status
Examine tool results
Read validation feedback
Plan recovery on failure

o4-mini / GPT-5 · 16,584 tokens avg

plan →

ReAct Loop

← observe

Action Model

Translate plan → tools
Execute in container
Capture output

GPT-4.1-mini / o4-m · 1,372 tokens avg

→tool calls

Termination

All tasks done? → validate & report
Not done? → next iteration

The extreme 12.09 thinking-to-action token ratio validates the architectural separation — the Action Model primarily serves as a command translator, using only 1,372 tokens while the Thinking Model uses 16,584 for complex reasoning. This asymmetry means smaller action models achieve higher success rates at 5.5× lower cost.

Host Machine

SAG Agent Process
ReAct Engine
Context Manager
Tool Belt
Validator

Your machine is never modified.
Nothing installed on host.

Docker API →

Docker Container · /workspace

Project Files

src/ pom.xml build.xml
target/ reports/

Tool Sets · ↕ fallback

Bash (low)
Maven / Gradle (mid)
Sys Package (mid)
Analyzer (high)

Agent Memory

Trunk: goal, task list, progress
Branch: subtask detail
↕ merge back on done

SAG operates within Docker containers, not synthesizing Dockerfiles. The agent interactively explores the project, installs dependencies, resolves version conflicts, and builds — all inside the container. Disposable · Reproducible · Safe for untrusted code.

Tool Results

exit codes
stdout / stderr
build logs

Physical Evidence

.class files
surefire/*.xml
compiled JARs

→

File Validator

Reads actual files, not model claims
"Model says success" vs "XML shows 3 fail"

Agent State Evaluator

Reads agent memory
Detects stuck loops
Breaks repeat cycles

combined →

Observations

= tool results
+ file validation
+ state evaluation

Fed back to Thinking Model each iteration

LLMs hallucinate — especially about success. The File Validator uses the file system as ground truth, not the model's interpretation of command output. This dual-source feedback means even smaller, cheaper models drive the action loop reliably, enabling the 84.4% success rate with mini models.

Hallucination-Resistant by Design

Authors · Stevens Institute of Technology