Skip to content

# Testing

OpenSoul has three Vitest suites (unit/integration, e2e, live) and a small set of Docker runners.

This doc is a “how we test” guide:

  • What each suite covers (and what it deliberately does not cover)
  • Which commands to run for common workflows (local, pre-push, debugging)
  • How live tests discover credentials and select models/providers
  • How to add regressions for real-world model/provider issues

# Quick start

Most days:

  • Full gate (expected before push): pnpm build && pnpm check && pnpm test

When you touch tests or want extra confidence:

  • Coverage gate: pnpm test:coverage
  • E2E suite: pnpm test:e2e

When debugging real providers/models (requires real creds):

  • Live suite (models + gateway tool/image probes): pnpm test:live

Tip: when you only need one failing case, prefer narrowing live tests via the allowlist env vars described below.

# Test suites (what runs where)

Think of the suites as “increasing realism” (and increasing flakiness/cost):

# Unit / integration (default)

  • Command: pnpm test
  • Config: vitest.config.ts
  • Files: src/**/*.test.ts
  • Scope:
    • Pure unit tests
    • In-process integration tests (gateway auth, routing, tooling, parsing, config)
    • Deterministic regressions for known bugs
  • Expectations:
    • Runs in CI
    • No real keys required
    • Should be fast and stable

# E2E (gateway smoke)

  • Command: pnpm test:e2e
  • Config: vitest.e2e.config.ts
  • Files: src/**/*.e2e.test.ts
  • Scope:
    • Multi-instance gateway end-to-end behavior
    • WebSocket/HTTP surfaces, node pairing, and heavier networking
  • Expectations:
    • Runs in CI (when enabled in the pipeline)
    • No real keys required
    • More moving parts than unit tests (can be slower)

# Live (real providers + real models)

  • Command: pnpm test:live
  • Config: vitest.live.config.ts
  • Files: src/**/*.live.test.ts
  • Default: enabled by pnpm test:live (sets OPENSOUL_LIVE_TEST=1)
  • Scope:
    • “Does this provider/model actually work today with real creds?”
    • Catch provider format changes, tool-calling quirks, auth issues, and rate limit behavior
  • Expectations:
    • Not CI-stable by design (real networks, real provider policies, quotas, outages)
    • Costs money / uses rate limits
    • Prefer running narrowed subsets instead of “everything”
    • Live runs will source ~/.profile to pick up missing API keys
    • Anthropic key rotation: set OPENSOUL_LIVE_ANTHROPIC_KEYS="sk-...,sk-..." (or OPENSOUL_LIVE_ANTHROPIC_KEY=sk-...) or multiple ANTHROPIC_API_KEY* vars; tests will retry on rate limits

# Which suite should I run?

Use this decision table:

  • Editing logic/tests: run pnpm test (and pnpm test:coverage if you changed a lot)
  • Touching gateway networking / WS protocol / pairing: add pnpm test:e2e
  • Debugging “my bot is down” / provider-specific failures / tool calling: run a narrowed pnpm test:live

# Live: model smoke (profile keys)

Live tests are split into two layers so we can isolate failures:

  • “Direct model” tells us the provider/model can answer at all with the given key.
  • “Gateway smoke” tells us the full gateway+agent pipeline works for that model (sessions, history, tools, sandbox policy, etc.).

# Layer 1: Direct model completion (no gateway)

  • Test: src/agents/models.profiles.live.test.ts
  • Goal:
    • Enumerate discovered models
    • Use getApiKeyForModel to select models you have creds for
    • Run a small completion per model (and targeted regressions where needed)
  • How to enable:
    • pnpm test:live (or OPENSOUL_LIVE_TEST=1 if invoking Vitest directly)
  • Set OPENSOUL_LIVE_MODELS=modern (or all, alias for modern) to actually run this suite; otherwise it skips to keep pnpm test:live focused on gateway smoke
  • How to select models:
    • OPENSOUL_LIVE_MODELS=modern to run the modern allowlist (Opus/Sonnet/Haiku 4.5, GPT-5.x + Codex, Gemini 3, GLM 4.7, MiniMax M2.1, Grok 4)
    • OPENSOUL_LIVE_MODELS=all is an alias for the modern allowlist
    • or OPENSOUL_LIVE_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-6,..." (comma allowlist)
  • How to select providers:
    • OPENSOUL_LIVE_PROVIDERS="google,google-antigravity,google-gemini-cli" (comma allowlist)
  • Where keys come from:
    • By default: profile store and env fallbacks
    • Set OPENSOUL_LIVE_REQUIRE_PROFILE_KEYS=1 to enforce profile store only
  • Why this exists:
    • Separates “provider API is broken / key is invalid” from “gateway agent pipeline is broken”
    • Contains small, isolated regressions (example: OpenAI Responses/Codex Responses reasoning replay + tool-call flows)

# Layer 2: Gateway + dev agent smoke (what “@opensoul” actually does)

  • Test: src/gateway/gateway-models.profiles.live.test.ts
  • Goal:
    • Spin up an in-process gateway
    • Create/patch a agent:dev:* session (model override per run)
    • Iterate models-with-keys and assert:
      • “meaningful” response (no tools)
      • a real tool invocation works (read probe)
      • optional extra tool probes (exec+read probe)
      • OpenAI regression paths (tool-call-only → follow-up) keep working
  • Probe details (so you can explain failures quickly):
    • read probe: the test writes a nonce file in the workspace and asks the agent to read it and echo the nonce back.
    • exec+read probe: the test asks the agent to exec-write a nonce into a temp file, then read it back.
    • image probe: the test attaches a generated PNG (cat + randomized code) and expects the model to return cat <CODE>.
    • Implementation reference: src/gateway/gateway-models.profiles.live.test.ts and src/gateway/live-image-probe.ts.
  • How to enable:
    • pnpm test:live (or OPENSOUL_LIVE_TEST=1 if invoking Vitest directly)
  • How to select models:
    • Default: modern allowlist (Opus/Sonnet/Haiku 4.5, GPT-5.x + Codex, Gemini 3, GLM 4.7, MiniMax M2.1, Grok 4)
    • OPENSOUL_LIVE_GATEWAY_MODELS=all is an alias for the modern allowlist
    • Or set OPENSOUL_LIVE_GATEWAY_MODELS="provider/model" (or comma list) to narrow
  • How to select providers (avoid “OpenRouter everything”):
    • OPENSOUL_LIVE_GATEWAY_PROVIDERS="google,google-antigravity,google-gemini-cli,openai,anthropic,zai,minimax" (comma allowlist)
  • Tool + image probes are always on in this live test:
    • read probe + exec+read probe (tool stress)
    • image probe runs when the model advertises image input support
    • Flow (high level):
      • Test generates a tiny PNG with “CAT” + random code (src/gateway/live-image-probe.ts)
      • Sends it via agent attachments: [{ mimeType: "image/png", content: "<base64>" }]
      • Gateway parses attachments into images[] (src/gateway/server-methods/agent.ts + src/gateway/chat-attachments.ts)
      • Embedded agent forwards a multimodal user message to the model
      • Assertion: reply contains cat + the code (OCR tolerance: minor mistakes allowed)

Tip: to see what you can test on your machine (and the exact provider/model ids), run:

bash
opensoul models list
opensoul models list --json

# Live: Anthropic setup-token smoke

  • Test: src/agents/anthropic.setup-token.live.test.ts
  • Goal: verify Claude Code CLI setup-token (or a pasted setup-token profile) can complete an Anthropic prompt.
  • Enable:
    • pnpm test:live (or OPENSOUL_LIVE_TEST=1 if invoking Vitest directly)
    • OPENSOUL_LIVE_SETUP_TOKEN=1
  • Token sources (pick one):
    • Profile: OPENSOUL_LIVE_SETUP_TOKEN_PROFILE=anthropic:setup-token-test
    • Raw token: OPENSOUL_LIVE_SETUP_TOKEN_VALUE=sk-ant-oat01-...
  • Model override (optional):
    • OPENSOUL_LIVE_SETUP_TOKEN_MODEL=anthropic/claude-opus-4-6

Setup example:

bash
opensoul models auth paste-token --provider anthropic --profile-id anthropic:setup-token-test
OPENSOUL_LIVE_SETUP_TOKEN=1 OPENSOUL_LIVE_SETUP_TOKEN_PROFILE=anthropic:setup-token-test pnpm test:live src/agents/anthropic.setup-token.live.test.ts

# Live: CLI backend smoke (Claude Code CLI or other local CLIs)

  • Test: src/gateway/gateway-cli-backend.live.test.ts
  • Goal: validate the Gateway + agent pipeline using a local CLI backend, without touching your default config.
  • Enable:
    • pnpm test:live (or OPENSOUL_LIVE_TEST=1 if invoking Vitest directly)
    • OPENSOUL_LIVE_CLI_BACKEND=1
  • Defaults:
    • Model: claude-cli/claude-sonnet-4-5
    • Command: claude
    • Args: ["-p","--output-format","json","--dangerously-skip-permissions"]
  • Overrides (optional):
    • OPENSOUL_LIVE_CLI_BACKEND_MODEL="claude-cli/claude-opus-4-6"
    • OPENSOUL_LIVE_CLI_BACKEND_MODEL="codex-cli/gpt-5.3-codex"
    • OPENSOUL_LIVE_CLI_BACKEND_COMMAND="/full/path/to/claude"
    • OPENSOUL_LIVE_CLI_BACKEND_ARGS='["-p","--output-format","json","--permission-mode","bypassPermissions"]'
    • OPENSOUL_LIVE_CLI_BACKEND_CLEAR_ENV='["ANTHROPIC_API_KEY","ANTHROPIC_API_KEY_OLD"]'
    • OPENSOUL_LIVE_CLI_BACKEND_IMAGE_PROBE=1 to send a real image attachment (paths are injected into the prompt).
    • OPENSOUL_LIVE_CLI_BACKEND_IMAGE_ARG="--image" to pass image file paths as CLI args instead of prompt injection.
    • OPENSOUL_LIVE_CLI_BACKEND_IMAGE_MODE="repeat" (or "list") to control how image args are passed when IMAGE_ARG is set.
    • OPENSOUL_LIVE_CLI_BACKEND_RESUME_PROBE=1 to send a second turn and validate resume flow.
  • OPENSOUL_LIVE_CLI_BACKEND_DISABLE_MCP_CONFIG=0 to keep Claude Code CLI MCP config enabled (default disables MCP config with a temporary empty file).

Example:

bash
OPENSOUL_LIVE_CLI_BACKEND=1 \
  OPENSOUL_LIVE_CLI_BACKEND_MODEL="claude-cli/claude-sonnet-4-5" \
  pnpm test:live src/gateway/gateway-cli-backend.live.test.ts

Narrow, explicit allowlists are fastest and least flaky:

  • Single model, direct (no gateway):

    • OPENSOUL_LIVE_MODELS="openai/gpt-5.2" pnpm test:live src/agents/models.profiles.live.test.ts
  • Single model, gateway smoke:

    • OPENSOUL_LIVE_GATEWAY_MODELS="openai/gpt-5.2" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts
  • Tool calling across several providers:

    • OPENSOUL_LIVE_GATEWAY_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-6,google/gemini-3-flash-preview,zai/glm-4.7,minimax/minimax-m2.1" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts
  • Google focus (Gemini API key + Antigravity):

    • Gemini (API key): OPENSOUL_LIVE_GATEWAY_MODELS="google/gemini-3-flash-preview" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts
    • Antigravity (OAuth): OPENSOUL_LIVE_GATEWAY_MODELS="google-antigravity/claude-opus-4-6-thinking,google-antigravity/gemini-3-pro-high" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts

Notes:

  • google/... uses the Gemini API (API key).
  • google-antigravity/... uses the Antigravity OAuth bridge (Cloud Code Assist-style agent endpoint).
  • google-gemini-cli/... uses the local Gemini CLI on your machine (separate auth + tooling quirks).
  • Gemini API vs Gemini CLI:
    • API: OpenSoul calls Google’s hosted Gemini API over HTTP (API key / profile auth); this is what most users mean by “Gemini”.
    • CLI: OpenSoul shells out to a local gemini binary; it has its own auth and can behave differently (streaming/tool support/version skew).

# Live: model matrix (what we cover)

There is no fixed “CI model list” (live is opt-in), but these are the recommended models to cover regularly on a dev machine with keys.

# Modern smoke set (tool calling + image)

This is the “common models” run we expect to keep working:

  • OpenAI (non-Codex): openai/gpt-5.2 (optional: openai/gpt-5.1)
  • OpenAI Codex: openai-codex/gpt-5.3-codex (optional: openai-codex/gpt-5.3-codex-codex)
  • Anthropic: anthropic/claude-opus-4-6 (or anthropic/claude-sonnet-4-5)
  • Google (Gemini API): google/gemini-3-pro-preview and google/gemini-3-flash-preview (avoid older Gemini 2.x models)
  • Google (Antigravity): google-antigravity/claude-opus-4-6-thinking and google-antigravity/gemini-3-flash
  • Z.AI (GLM): zai/glm-4.7
  • MiniMax: minimax/minimax-m2.1

Run gateway smoke with tools + image: OPENSOUL_LIVE_GATEWAY_MODELS="openai/gpt-5.2,openai-codex/gpt-5.3-codex,anthropic/claude-opus-4-6,google/gemini-3-pro-preview,google/gemini-3-flash-preview,google-antigravity/claude-opus-4-6-thinking,google-antigravity/gemini-3-flash,zai/glm-4.7,minimax/minimax-m2.1" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts

# Baseline: tool calling (Read + optional Exec)

Pick at least one per provider family:

  • OpenAI: openai/gpt-5.2 (or openai/gpt-5-mini)
  • Anthropic: anthropic/claude-opus-4-6 (or anthropic/claude-sonnet-4-5)
  • Google: google/gemini-3-flash-preview (or google/gemini-3-pro-preview)
  • Z.AI (GLM): zai/glm-4.7
  • MiniMax: minimax/minimax-m2.1

Optional additional coverage (nice to have):

  • xAI: xai/grok-4 (or latest available)
  • Mistral: mistral/… (pick one “tools” capable model you have enabled)
  • Cerebras: cerebras/… (if you have access)
  • LM Studio: lmstudio/… (local; tool calling depends on API mode)

# Vision: image send (attachment → multimodal message)

Include at least one image-capable model in OPENSOUL_LIVE_GATEWAY_MODELS (Claude/Gemini/OpenAI vision-capable variants, etc.) to exercise the image probe.

# Aggregators / alternate gateways

If you have keys enabled, we also support testing via:

  • OpenRouter: openrouter/... (hundreds of models; use opensoul models scan to find tool+image capable candidates)
  • OpenCode Zen: opencode/... (auth via OPENCODE_API_KEY / OPENCODE_ZEN_API_KEY)

More providers you can include in the live matrix (if you have creds/config):

  • Built-in: openai, openai-codex, anthropic, google, google-vertex, google-antigravity, google-gemini-cli, zai, openrouter, opencode, xai, groq, cerebras, mistral, github-copilot
  • Via models.providers (custom endpoints): minimax (cloud/API), plus any OpenAI/Anthropic-compatible proxy (LM Studio, vLLM, LiteLLM, etc.)

Tip: don’t try to hardcode “all models” in docs. The authoritative list is whatever discoverModels(...) returns on your machine + whatever keys are available.

# Credentials (never commit)

Live tests discover credentials the same way the CLI does. Practical implications:

  • If the CLI works, live tests should find the same keys.

  • If a live test says “no creds”, debug the same way you’d debug opensoul models list / model selection.

  • Profile store: ~/.opensoul/credentials/ (preferred; what “profile keys” means in the tests)

  • Config: ~/.opensoul/opensoul.json (or OPENSOUL_CONFIG_PATH)

If you want to rely on env keys (e.g. exported in your ~/.profile), run local tests after source ~/.profile, or use the Docker runners below (they can mount ~/.profile into the container).

# Deepgram live (audio transcription)

  • Test: src/media-understanding/providers/deepgram/audio.live.test.ts
  • Enable: DEEPGRAM_API_KEY=... DEEPGRAM_LIVE_TEST=1 pnpm test:live src/media-understanding/providers/deepgram/audio.live.test.ts

# Docker runners (optional “works in Linux” checks)

These run pnpm test:live inside the repo Docker image, mounting your local config dir and workspace (and sourcing ~/.profile if mounted):

  • Direct models: pnpm test:docker:live-models (script: scripts/test-live-models-docker.sh)
  • Gateway + dev agent: pnpm test:docker:live-gateway (script: scripts/test-live-gateway-models-docker.sh)
  • Onboarding wizard (TTY, full scaffolding): pnpm test:docker:onboard (script: scripts/e2e/onboard-docker.sh)
  • Gateway networking (two containers, WS auth + health): pnpm test:docker:gateway-network (script: scripts/e2e/gateway-network-docker.sh)
  • Plugins (custom extension load + registry smoke): pnpm test:docker:plugins (script: scripts/e2e/plugins-docker.sh)

Useful env vars:

  • OPENSOUL_CONFIG_DIR=... (default: ~/.opensoul) mounted to /home/node/.opensoul
  • OPENSOUL_WORKSPACE_DIR=... (default: ~/.opensoul/workspace) mounted to /home/node/.opensoul/workspace
  • OPENSOUL_PROFILE_FILE=... (default: ~/.profile) mounted to /home/node/.profile and sourced before running tests
  • OPENSOUL_LIVE_GATEWAY_MODELS=... / OPENSOUL_LIVE_MODELS=... to narrow the run
  • OPENSOUL_LIVE_REQUIRE_PROFILE_KEYS=1 to ensure creds come from the profile store (not env)

# Docs sanity

Run docs checks after doc edits: pnpm docs:list.

# Offline regression (CI-safe)

These are “real pipeline” regressions without real providers:

  • Gateway tool calling (mock OpenAI, real gateway + agent loop): src/gateway/gateway.tool-calling.mock-openai.test.ts
  • Gateway wizard (WS wizard.start/wizard.next, writes config + auth enforced): src/gateway/gateway.wizard.e2e.test.ts

# Agent reliability evals (skills)

We already have a few CI-safe tests that behave like “agent reliability evals”:

  • Mock tool-calling through the real gateway + agent loop (src/gateway/gateway.tool-calling.mock-openai.test.ts).
  • End-to-end wizard flows that validate session wiring and config effects (src/gateway/gateway.wizard.e2e.test.ts).

What’s still missing for skills (see Skills):

  • Decisioning: when skills are listed in the prompt, does the agent pick the right skill (or avoid irrelevant ones)?
  • Compliance: does the agent read SKILL.md before use and follow required steps/args?
  • Workflow contracts: multi-turn scenarios that assert tool order, session history carryover, and sandbox boundaries.

Future evals should stay deterministic first:

  • A scenario runner using mock providers to assert tool calls + order, skill file reads, and session wiring.
  • A small suite of skill-focused scenarios (use vs avoid, gating, prompt injection).
  • Optional live evals (opt-in, env-gated) only after the CI-safe suite is in place.

# Adding regressions (guidance)

When you fix a provider/model issue discovered in live:

  • Add a CI-safe regression if possible (mock/stub provider, or capture the exact request-shape transformation)
  • If it’s inherently live-only (rate limits, auth policies), keep the live test narrow and opt-in via env vars
  • Prefer targeting the smallest layer that catches the bug:
    • provider request conversion/replay bug → direct models test
    • gateway session/history/tool pipeline bug → gateway live smoke or CI-safe gateway mock test

Released under the MIT License.