Setup
Agent
Architecture
Results
GitHub
Paper
The First Step.
Not a Dockerfile generator. An autonomous dual-model agent that reasons, acts, and verifies inside isolated Docker containers — turning hours of Java project setup into minutes of autonomous execution.
git clone https://github.com/Codegass/Setup-Agent.git && cd Setup-Agent && uv syncsag project https://github.com/apache/commons-cli.git"Role specialization beats raw capacity — pairing a chain-of-thought Thinking Model with a smaller Action Model achieves higher success at 5.5× lower cost than uniformly applying powerful models."
Separates reasoning from execution. A chain-of-thought Thinking Model plans while a smaller Action Model translates plans into tool invocations via function calling. The 12.09 thinking-to-action ratio validates this separation.
SAG operates entirely within Docker containers — not generating Dockerfiles, but interactively exploring and configuring inside them. Zero host pollution, fully reproducible.
Examines concrete evidence on disk — .class files after compilation, surefire XML test reports, build artifacts. Prevents hallucinations where the agent believes it succeeded without actual completion.
Trunk Context maintains global state and task lists. Branch contexts capture subtask details. When a subtask ends, results merge back to Trunk — enabling complex chains without context loss.
Layered tools from low-level (Bash) through build specialists (Maven, Gradle) to high-level orchestrators (Project Analyzer). Graceful fallback when specialized tools encounter unexpected scenarios.
Analyzes Agent Memory to detect problematic patterns — repeated identical actions indicating the agent is stuck in a loop. Provides corrective feedback to break cycles autonomously.
User Input
- Repository URL
- Goal description
Setup Report
- Build & test outcome
- Per-task summary
Thinking Model
- Reasoning
- Plan next action
Action Model
- Function calling
- Tool invocation
All tasks finished?
- Yes → emit Setup Report
- No → back to Thinking
Observation
- Tool Execution Results
- Project Running Status
- + Validation feedback ↓
Tool Sets
- Bash · Maven · Gradle
- Sys Package · Analyzer
Agent Memory
- Trunk: general, tasks
- Branch: detail context
Java Project Files
- src/ · pom.xml · build.xml
- target/ · surefire reports
File Validator
- Reads Java project files directly
- .class files, surefire XML, JARs
- Ground truth over model claims
Agent State Evaluator
- Reads Agent Memory directly
- Detects stuck / repeating loops
- Produces corrective signals
SAG has three tightly coupled subsystems. The ReAct Loop alternates between reasoning (Thinking) and tool use (Action); on each turn it checks whether all tasks are finished and either emits the Setup Report or continues. The Docker Container isolates execution and hosts a shared file system containing both the agent's hierarchical memory and the Java project files. The Validation System runs out of band: it reads files and memory directly from the container and writes an independent observation back into the ReAct loop, so each iteration is anchored to ground truth rather than the model's own claims.
Thinking Model
- Chain-of-thought reasoning
- Analyze running status
- Examine tool results
- Read validation feedback
- Plan recovery on failure
Action Model
- Translate plan → tools
- Execute in container
- Capture output
Termination
- All tasks done? → validate & report
- Not done? → next iteration
The extreme 12.09 thinking-to-action token ratio validates the architectural separation — the Action Model primarily serves as a command translator, using only 1,372 tokens while the Thinking Model uses 16,584 for complex reasoning. This asymmetry means smaller action models achieve higher success rates at 5.5× lower cost.
Host Machine
- SAG Agent Process
- ReAct Engine
- Context Manager
- Tool Belt
- Validator
Project Files
- src/ pom.xml build.xml
- target/ reports/
Tool Sets · ↕ fallback
- Bash (low)
- Maven / Gradle (mid)
- Sys Package (mid)
- Analyzer (high)
Agent Memory
- Trunk: goal, task list, progress
- Branch: subtask detail
- ↕ merge back on done
SAG operates within Docker containers, not synthesizing Dockerfiles. The agent interactively explores the project, installs dependencies, resolves version conflicts, and builds — all inside the container. Disposable · Reproducible · Safe for untrusted code.
Tool Results
- exit codes
- stdout / stderr
- build logs
Physical Evidence
- .class files
- surefire/*.xml
- compiled JARs
File Validator
- Reads actual files, not model claims
- "Model says success" vs "XML shows 3 fail"
Agent State Evaluator
- Reads agent memory
- Detects stuck loops
- Breaks repeat cycles
Observations
- = tool results
- + file validation
- + state evaluation
LLMs hallucinate — especially about success. The File Validator uses the file system as ground truth, not the model's interpretation of command output. This dual-source feedback means even smaller, cheaper models drive the action loop reliably, enabling the 84.4% success rate with mini models.