Rogue Open-Sourced: End-to-End Framework for Testing and Auditing Agentic AI
What Rogue is and why it matters
Agentic systems behave differently from deterministic software: they are stochastic, context-dependent, and constrained by policies. Traditional QA—unit tests, static prompts, or single-score LLM judgments—miss multi-turn vulnerabilities and often leave poor audit trails. Development teams need protocol-accurate conversations, explicit policy checks, and machine-readable evidence to confidently gate releases.
Qualifire AI has open-sourced Rogue, a Python framework built to evaluate AI agents over the Agent-to-Agent (A2A) protocol. Rogue transforms business policies into executable scenarios, runs multi-turn interactions against a target agent, and produces deterministic reports suitable for CI/CD pipelines and compliance reviews.
Quick start and prerequisites
Before you run Rogue, make sure you have:
- uvx installed (follow your uv installation guide if needed)
- Python 3.10+
- An API key for an LLM provider (OpenAI, Google, Anthropic, or other providers supported via LiteLLM)
Installation (recommended quick path)
Use the uvx automated installer to get up and running quickly:
# TUI
uvx rogue-ai
# Web UI
uvx rogue-ai ui
# CLI / CI/CD
uvx rogue-ai cli
Manual installation
(a) Clone the repository:
git clone https://github.com/qualifire-dev/rogue.git
cd rogue
(b) Install dependencies:
If you are using uv:
uv sync
Or, if you are using pip:
pip install -e .
(c) OPTIONAL: Set up your environment variables. Create a .env file in the project root and add your API keys. Rogue uses LiteLLM and can accept keys for multiple providers:
OPENAI_API_KEY="sk-..."
ANTHROPIC_API_KEY="sk-..."
GOOGLE_API_KEY="..."
Running Rogue
Rogue uses a client-server architecture: the core evaluation logic runs on a backend server while multiple clients can connect to it (TUI, Web UI, CLI).
Running the default uvx command without a mode will start the server in the background and launch the TUI client:
uvx rogue-ai
Available modes for different use cases:
- Default (Server + TUI):
uvx rogue-ai— starts server + TUI client - Server:
uvx rogue-ai server— runs the backend server only - TUI:
uvx rogue-ai tui— runs the terminal client (requires server) - Web UI:
uvx rogue-ai ui— runs the Gradio web interface client (requires server) - CLI:
uvx rogue-ai cli— runs non-interactive evaluations (good for CI/CD; requires server)
Example command signatures:
uvx rogue-ai server [OPTIONS]
Options for server mode typically include host/port and debug flags:
- –host HOST – Host to run the server on (default: 127.0.0.1 or HOST env var)
- –port PORT – Port to run the server on (default: 8000 or PORT env var)
- –debug – Enable debug logging
uvx rogue-ai tui [OPTIONS]
uvx rogue-ai ui [OPTIONS]
Common UI options include --rogue-server-url, --port, --workdir, and --debug.
Example: testing the T-Shirt store agent
The repository includes a simple example agent (a T-shirt store) that you can use to see Rogue in action.
Install example dependencies:
If you are using uv:
uv sync --group examples
or, if you are using pip:
pip install -e .[examples]
(a) Start the example agent server in a separate terminal:
If you are using uv:
uv run examples/tshirt_store_agent
If not:
python examples/tshirt_store_agent
This will start the agent on http://localhost:10001.
(b) Configure Rogue in the UI to point to the example agent. Example settings:
- Agent URL: http://localhost:10001
- Authentication: no-auth
(c) Run the evaluation and watch Rogue test the T-Shirt agent’s policies. You can use either the TUI (uvx rogue-ai) or the Web UI (uvx rogue-ai ui).
Where Rogue fits in your workflow
Rogue provides practical testing for multiple domains:
- Safety & compliance hardening: validate PII/PHI handling, refusal behavior, secret-leak prevention, and regulated-domain policies with transcript-anchored evidence.
- E-commerce & support agents: enforce OTP-gated discounts, refund rules, SLA-aware escalation, and tool-use correctness under adversarial conditions.
- Developer/DevOps agents: assess workspace confinement, rollback semantics, rate-limit/backoff behavior, and prevention of unsafe commands.
- Multi-agent systems: verify planner/executor contracts, capability negotiation, and schema conformance over A2A.
- Regression & drift monitoring: run nightly suites against new model versions or prompt changes and detect behavioral drift before release.
How Rogue works
Rogue synthesizes business context and risk into structured tests with clear objectives, tactics, and success criteria. The EvaluatorAgent runs protocol-correct conversations in single-turn or deep multi-turn adversarial modes. Use your own model or let Rogue employ Qualifire’s SLM judges to drive tests. Rogue produces streaming observability and deterministic artifacts: live transcripts, pass/fail verdicts, rationales tied to transcript spans, timing information, and model/version lineage metadata.
Architecture and interfaces
- Rogue Server: core evaluation engine
- Client interfaces: TUI (Go + Bubble Tea), Web UI (Gradio), and CLI for automated CI/CD runs
This separation enables flexible deployments where the server runs independently and multiple clients connect concurrently.
Practical outcome
Rogue lets developer teams test agent behavior as it runs in production. It turns written policies into concrete scenarios, exercises those scenarios over A2A, and records auditable transcripts. The output provides a repeatable signal you can use in CI/CD to catch policy breaks and regressions before shipping.
Where to find it
Rogue is available on GitHub under the Qualifire organization. Thanks to the Qualifire team for their leadership and resources supporting this project.