Setup
Agent

Architecture Results GitHub Paper
Open Source · ICSE-NIER '26 · Rio de Janeiro Automating
The First Step.

Not a Dockerfile generator. An autonomous dual-model agent that reasons, acts, and verifies inside isolated Docker containers — turning hours of Java project setup into minutes of autonomous execution.

1. Install
git clone https://github.com/Codegass/Setup-Agent.git && cd Setup-Agent && uv sync
2. Setup a Project
sag project https://github.com/apache/commons-cli.git
84.4%
Build Success Rate
o4-mini + GPT-4.1-mini configuration
5.5×
Cost Reduction
vs. GPT-5+GPT-5 with higher accuracy
12.09
Think-to-Act Ratio
Validates the dual-model separation
48.4%
Test Execution Rate
Average across all configurations
"Role specialization beats raw capacity — pairing a chain-of-thought Thinking Model with a smaller Action Model achieves higher success at 5.5× lower cost than uniformly applying powerful models."
Core Finding · SAG Paper · ICSE-NIER '26
SAG

  
Setup Agent
Dual-Model ReAct Engine

Separates reasoning from execution. A chain-of-thought Thinking Model plans while a smaller Action Model translates plans into tool invocations via function calling. The 12.09 thinking-to-action ratio validates this separation.

Container-Native Isolation

SAG operates entirely within Docker containers — not generating Dockerfiles, but interactively exploring and configuring inside them. Zero host pollution, fully reproducible.

File-Based Validation

Examines concrete evidence on disk — .class files after compilation, surefire XML test reports, build artifacts. Prevents hallucinations where the agent believes it succeeded without actual completion.

Trunk & Branch Context

Trunk Context maintains global state and task lists. Branch contexts capture subtask details. When a subtask ends, results merge back to Trunk — enabling complex chains without context loss.

Hierarchical Tool-Belt

Layered tools from low-level (Bash) through build specialists (Maven, Gradle) to high-level orchestrators (Project Analyzer). Graceful fallback when specialized tools encounter unexpected scenarios.

Agent State Evaluator

Analyzes Agent Memory to detect problematic patterns — repeated identical actions indicating the agent is stuck in a loop. Provides corrective feedback to break cycles autonomously.

How SAG Works.
User Input
  • Repository URL
  • Goal description
when all tasks finished
Setup Report
  • Build & test outcome
  • Per-task summary
SAG Agent · ReAct Loop
Thinking Model
  • Reasoning
  • Plan next action
o4-mini / GPT-5
action plan →
Action Model
  • Function calling
  • Tool invocation
GPT-4.1-mini / o4-m
All tasks finished?
  • Yes → emit Setup Report
  • No → back to Thinking
← observation
Observation
  • Tool Execution Results
  • Project Running Status
  • + Validation feedback ↓
use tool
tool output
Docker Container
Tool Sets
  • Bash · Maven · Gradle
  • Sys Package · Analyzer
File System
Agent Memory
  • Trunk: general, tasks
  • Branch: detail context
Java Project Files
  • src/ · pom.xml · build.xml
  • target/ · surefire reports
feedback closes the loop → written back into Observation
Validation System · Rule-Based, Out-of-Band
File Validator
  • Reads Java project files directly
  • .class files, surefire XML, JARs
  • Ground truth over model claims
Agent State Evaluator
  • Reads Agent Memory directly
  • Detects stuck / repeating loops
  • Produces corrective signals

SAG has three tightly coupled subsystems. The ReAct Loop alternates between reasoning (Thinking) and tool use (Action); on each turn it checks whether all tasks are finished and either emits the Setup Report or continues. The Docker Container isolates execution and hosts a shared file system containing both the agent's hierarchical memory and the Java project files. The Validation System runs out of band: it reads files and memory directly from the container and writes an independent observation back into the ReAct loop, so each iteration is anchored to ground truth rather than the model's own claims.

Thinking Model
  • Chain-of-thought reasoning
  • Analyze running status
  • Examine tool results
  • Read validation feedback
  • Plan recovery on failure
o4-mini / GPT-5 · 16,584 tokens avg
plan →
ReAct Loop
← observe
Action Model
  • Translate plan → tools
  • Execute in container
  • Capture output
GPT-4.1-mini / o4-m · 1,372 tokens avg
tool calls
Termination
  • All tasks done? → validate & report
  • Not done? → next iteration

The extreme 12.09 thinking-to-action token ratio validates the architectural separation — the Action Model primarily serves as a command translator, using only 1,372 tokens while the Thinking Model uses 16,584 for complex reasoning. This asymmetry means smaller action models achieve higher success rates at 5.5× lower cost.

Host Machine
  • SAG Agent Process
  • ReAct Engine
  • Context Manager
  • Tool Belt
  • Validator
Your machine is never modified.
Nothing installed on host.
Docker API
Docker Container · /workspace
Project Files
  • src/ pom.xml build.xml
  • target/ reports/
Tool Sets · ↕ fallback
  • Bash (low)
  • Maven / Gradle (mid)
  • Sys Package (mid)
  • Analyzer (high)
Agent Memory
  • Trunk: goal, task list, progress
  • Branch: subtask detail
  • ↕ merge back on done

SAG operates within Docker containers, not synthesizing Dockerfiles. The agent interactively explores the project, installs dependencies, resolves version conflicts, and builds — all inside the container. Disposable · Reproducible · Safe for untrusted code.

Tool Results
  • exit codes
  • stdout / stderr
  • build logs
Physical Evidence
  • .class files
  • surefire/*.xml
  • compiled JARs
File Validator
  • Reads actual files, not model claims
  • "Model says success" vs "XML shows 3 fail"
Agent State Evaluator
  • Reads agent memory
  • Detects stuck loops
  • Breaks repeat cycles
combined
Observations
  • = tool results
  • + file validation
  • + state evaluation
Fed back to Thinking Model each iteration

LLMs hallucinate — especially about success. The File Validator uses the file system as ground truth, not the model's interpretation of command output. This dual-source feedback means even smaller, cheaper models drive the action loop reliably, enabling the 84.4% success rate with mini models.

Hallucination-Resistant by Design

Chenhao Wei

Stevens Institute of Technology
Google Scholar ↗
First Author

Gengwu Zhao

Stevens Institute of Technology
Google Scholar ↗

Xinyi Li

Stevens Institute of Technology

Billy Ye

Stevens Institute of Technology

Lu Xiao

Stevens Institute of Technology
Google Scholar ↗
Corresponding Author
Setup AGent (SAG): A Dual-Model LLM Agent for Autonomous End-to-End Java Project Configuration
Chenhao Wei, Gengwu Zhao, Xinyi Li, Billy Ye, and Lu Xiao
ICSE-NIER '26 · IEEE/ACM 48th International Conference on Software Engineering · April 2026 · Rio de Janeiro, Brazil
DOI: 10.1145/3786582.3786818 ↗
@inproceedings{wei2026sag, title = {Setup AGent (SAG): A Dual-Model LLM Agent for Autonomous End-to-End Java Project Configuration}, author = {Wei, Chenhao and Zhao, Gengwu and Li, Xinyi and Ye, Billy and Xiao, Lu}, booktitle = {ICSE-NIER '26}, year = {2026}, doi = {10.1145/3786582.3786818} }