Conductor Guide¶

Marianne uses a conductor process to manage job execution, similar to how Docker requires dockerd. The conductor centralizes resource management, maintains a persistent job registry, and provides health monitoring with backpressure.

Why a Conductor?¶

Running jobs through a long-lived conductor process provides several advantages over direct CLI execution:

Centralized resource management. The conductor monitors memory and child process counts, applying backpressure when the system is under load. Jobs submitted during high pressure are rejected rather than allowed to destabilize the system.
Persistent job registry. A SQLite-backed registry tracks all submitted jobs across conductor restarts. Jobs left running when a conductor crashes are automatically marked as failed on the next startup.
Cross-job learning. A single GlobalLearningStore instance is shared across all concurrent jobs. Pattern discoveries in one job are immediately available to others — no cross-process SQLite locking contention.
Health monitoring. Liveness and readiness probes are available over the IPC socket, enabling external monitoring tools to check conductor health.

Quick Start¶

# 1. Start the conductor
mzt start --foreground    # Foreground (logs to console)
mzt start                 # Background (double-fork, logs to file)

# 2. Run a job (routed through conductor)
mzt run my-score.yaml

# 3. Check conductor status
mzt conductor-status

# 4. Stop the conductor
mzt stop

The Conductor Requirement¶

mzt run requires a running conductor. If no conductor is detected, the command exits with an error:

Error: Marianne conductor is not running.
Start it with: mzt start

Exceptions that work without a conductor:

Command	Conductor Required?	Why
`mzt run config.yaml`	Yes	Job execution goes through conductor
`mzt run --dry-run config.yaml`	No	Only validates and displays the job plan
`mzt validate config.yaml`	No	Static config validation, no execution

How Auto-Detection Works¶

When mzt run is invoked:

The CLI calls is_daemon_available() from marianne.daemon.detect
This resolves the socket path (default: /tmp/marianne.sock)
A DaemonClient attempts to open a Unix socket connection
If the connection succeeds, the job is submitted via JSON-RPC
If the connection fails (socket missing, refused, timeout), the CLI reports the conductor is not running

The detection is fail-safe: any exception during detection returns False, ensuring conductor bugs never break the CLI. The CLI simply falls through to the "conductor not running" error message.

Architecture¶

The conductor is composed of several layers:

┌──────────────────────────────────────────────────┐
│                   DaemonProcess                   │
│  Lifecycle owner: boot, signal handling, cleanup  │
├──────────────────────────────────────────────────┤
│                    DaemonServer                   │
│  Unix socket + JSON-RPC 2.0 (NDJSON protocol)    │
├────────────────────┬─────────────────────────────┤
│   RequestHandler   │      HealthChecker          │
│  Method dispatch   │  Liveness + readiness       │
├────────────────────┴─────────────────────────────┤
│                    JobManager                     │
│  Task tracking, concurrency semaphore, shutdown   │
├──────────────┬───────────────┬───────────────────┤
│  JobRegistry │  JobService   │  BackpressureCtrl │
│  SQLite DB   │  Run/resume   │  Memory-based     │
│  persistence │  pause/status │  load management  │
├──────────────┴───────────────┴───────────────────┤
│                     EventBus                      │
│  Async pub/sub for runner callback events         │
├──────────────────────────────────────────────────┤
│         ResourceMonitor + LearningHub             │
│  Periodic checks   │  Cross-job pattern sharing   │
├──────────────────────────────────────────────────┤
│               SemanticAnalyzer                    │
│  LLM-based sheet analysis → learning insights     │
└──────────────────────────────────────────────────┘

Process Model¶

DaemonProcess — Owns the lifecycle. In background mode, performs a classic double-fork to detach from the terminal. Installs signal handlers for SIGTERM and SIGINT.
DaemonServer — Listens on a Unix domain socket. Each client connection is handled as an asyncio task. Messages are newline-delimited JSON (NDJSON), each containing a JSON-RPC 2.0 request or response.
JobManager — Tracks jobs as asyncio.Task instances. Uses a Semaphore to enforce the max_concurrent_jobs limit. Jobs exceeding job_timeout_seconds are cancelled.
JobService — Decoupled execution engine (no CLI dependencies). Handles the full run/resume/pause/status lifecycle for individual jobs.
JobRegistry — SQLite-backed persistent storage. Survives conductor restarts. On startup, orphaned jobs (status queued or running from a previous conductor) are marked as failed.
EventBus — Async pub/sub that routes runner callback events (sheet.started, sheet.completed, sheet.failed, sheet.retrying, sheet.validation_passed/failed, job.cost_update, job.iteration) to downstream consumers. Bounded deques per subscriber prevent slow consumers from blocking publishers.
SemanticAnalyzer — Subscribes to sheet.completed and sheet.failed events via the EventBus. On each event, captures a snapshot of the sheet's execution context (prompt, output, validation results) and sends it to an LLM (configurable model, default Sonnet) for analysis. The LLM response is parsed into structured insights and stored as SEMANTIC_INSIGHT patterns in the global learning store. These patterns are automatically picked up by the existing pattern injection pipeline for future sheets. Concurrency is limited by a semaphore (max_concurrent_analyses, default 3). Analysis failures never affect running jobs.

IPC Protocol¶

The conductor uses JSON-RPC 2.0 over a Unix domain socket with newline-delimited JSON framing:

{"jsonrpc":"2.0","method":"job.submit","params":{"config_path":"/path/to/config.yaml"},"id":1}

Registered RPC methods:

Method	Description
`job.submit`	Submit a new job for execution
`job.status`	Get status of a specific job
`job.pause`	Pause a running job
`job.resume`	Resume a paused job
`job.cancel`	Cancel a running job
`job.list`	List all known jobs
`job.diagnose`	Get diagnostic report for a job
`job.errors`	Get error details for a job
`job.history`	Get execution history for a job
`job.recover`	Recover failed sheets for a job
`daemon.status`	Get conductor status summary (PID, uptime, memory, version)
`daemon.shutdown`	Initiate conductor shutdown
`daemon.health`	Liveness probe
`daemon.ready`	Readiness probe
`daemon.clear_rate_limits`	Clear cached rate limits (all or by instrument)

Backpressure¶

The BackpressureController assesses system pressure based on memory usage (as a percentage of max_memory_mb) and child process count:

Pressure Level	Memory Threshold	Effect
NONE	< 50%	Normal operation
LOW	50–70%	2s delay between sheet dispatches
MEDIUM	70–85%	10s delay between sheet dispatches
HIGH	> 85% memory or high process count	New job submissions rejected
CRITICAL	> 95% memory, > 80% process count, or monitor degraded	Emergency: oldest job (by submission time) may be cancelled

When should_accept_job() returns False (HIGH or CRITICAL pressure), the score is rejected outright with "System under high pressure — try again later."

Rate limits do not cause job rejection. They are per-instrument concerns handled at the sheet dispatch level by the baton — a rate limit on one instrument does not block jobs targeting different instruments.

Configuration¶

The conductor is configured via a YAML file passed to mzt start --config <path>. Without a config file, all defaults are used.

DaemonConfig Fields¶

Essential fields (what you'll typically configure):

Field	Type	Default	Description
`max_concurrent_jobs`	`int`	`15`	Max simultaneous jobs (1–50)
`log_level`	`str`	`"info"`	One of: `debug`, `info`, `warning`, `error`
`job_timeout_seconds`	`float`	`86400.0`	Max wall-clock time per job (24 hours)
`learning.enabled`	`bool`	`true`	Semantic learning on/off
`profiler.enabled`	`bool`	`true`	Resource monitoring on/off

Advanced fields (rarely need changing):

Field	Type	Default	Description
`socket.path`	`Path`	`/tmp/marianne.sock`	Unix domain socket path
`socket.permissions`	`int`	`0o660`	Socket file permissions (octal)
`socket.backlog`	`int`	`5`	Max pending connections
`pid_file`	`Path`	`/tmp/marianne.pid`	PID file path
`shutdown_timeout_seconds`	`float`	`300.0`	Max wait for graceful shutdown (5 min)
`monitor_interval_seconds`	`float`	`15.0`	Resource check interval
`max_job_history`	`int`	`1000`	Terminal jobs kept in memory before eviction
`log_file`	`Path \\| None`	`None`	Log file path (`None` = stderr only)
`learning.*`	`SemanticLearningConfig`	(see below)	Semantic learning via LLM analysis

Semantic Learning (nested under `learning`)¶

Controls the SemanticAnalyzer — LLM-based analysis of sheet completions that produces insights stored in the learning database.

Field	Type	Default	Description
`enabled`	`bool`	`true`	Enable semantic learning. Defaults to on when the conductor is running.
`backend`	`BackendConfig`	(nested)	Backend configuration for the analysis LLM. Supports any backend type: `claude_cli` (uses Claude Code, no API key needed), `anthropic_api`, `ollama` (free local models), or `recursive_light`. See BackendConfig fields below.
`analyze_on`	`list`	`["success", "failure"]`	Which outcomes to analyze: `success`, `failure`, or both
`max_concurrent_analyses`	`int`	`3`	Max concurrent LLM analysis tasks (1–20)
`entropy_threshold`	`float`	`0.1`	Entropy threshold for triggering diversity injection. When pattern diversity index drops below this value (10% of max entropy by default), the learning system automatically boosts exploration budget to escape local optima.
`exploration_budget`	`float`	`0.15`	Exploration budget boost amount when entropy response triggers. Controls how aggressively the system explores alternatives when pattern diversity collapses.

BackendConfig fields (nested under learning.backend.*):

The backend field accepts the same BackendConfig used by job execution. Common fields include:

Field	Example	Description
`type`	`"anthropic_api"`	Backend type: `anthropic_api`, `claude_cli`, `ollama`, etc.
`model`	`"claude-sonnet-4-5-20250929"`	Model ID for analysis LLM calls
`api_key_env`	`"ANTHROPIC_API_KEY"`	Environment variable containing the API key (not needed for `claude_cli`)
`max_tokens`	`4096`	Max response tokens (256–32768)
`temperature`	`0.3`	Sampling temperature for analytical precision (0.0–1.0)

See the Configuration Reference for all available backend options.

Resource Limits (nested under `resource_limits`)¶

Field	Type	Default	Description
`max_memory_mb`	`int`	`8192`	Max RSS memory before backpressure triggers
`max_processes`	`int`	`50`	Max child processes
`max_api_calls_per_minute`	`int`	`60`	Not yet enforced — reserved for future rate limiting

Additional Fields¶

Field	Type	Default	Description
`state_db_path`	`Path`	`~/.marianne/daemon-state.db`	Job registry database path. Overridden by `--conductor-clone` for clone isolation.
`max_concurrent_sheets`	`int`	`10`	Max concurrent sheets across all jobs (used by the baton when `use_baton: true`)
`use_baton`	`bool`	`true`	Enable the baton execution engine. The baton is the default since Phase 2. Set to `false` to fall back to the legacy monolithic runner.
`preflight.token_warning_threshold`	`int`	`50000`	Token count above which to warn during preflight checks. Set higher for large-context instruments. `0` to disable.
`preflight.token_error_threshold`	`int`	`150000`	Token count above which to error during preflight checks. Set higher for large-context instruments (e.g., `800000` for 1M-context models). `0` to disable.

Example Config File¶

Most users need very little configuration. The defaults are sensible:

# ~/.marianne/conductor.yaml — minimal config
max_concurrent_jobs: 10
log_level: info

For more control:

# ~/.marianne/conductor.yaml — full example
max_concurrent_jobs: 10
job_timeout_seconds: 43200  # 12 hours for long jobs
log_level: debug
log_file: ~/.marianne/marianne.log

resource_limits:
  max_memory_mb: 4096
  max_processes: 30

learning:
  enabled: true
  backend:
    type: anthropic_api
    model: claude-sonnet-4-5-20250929
    temperature: 0.3
  analyze_on: [success, failure]
  max_concurrent_analyses: 3

Daemon Operational Profiles¶

Instead of configuring individual fields, use --profile to apply a preset:

mzt start                        # sensible defaults
mzt start --profile dev          # debug logging, strace on
mzt start --profile intensive    # 48h timeout, high resource limits
mzt start --profile minimal      # profiler + learning off

Profiles are partial overrides applied on top of your config file. Resolution order:

DaemonConfig defaults
~/.marianne/conductor.yaml (if exists)
--profile (if specified)
CLI flags (--log-level, etc.) override everything

Built-in profiles:

Profile	Use Case	Key Overrides
`dev`	Debugging Marianne itself	`log_level: debug`, `max_concurrent_jobs: 2`, `strace_enabled: true`
`intensive`	Long-running production work	`job_timeout: 48h`, `max_memory: 16GB`, `max_processes: 100`
`minimal`	Low-resource environments	`profiler: off`, `learning: off`, `max_concurrent_jobs: 2`

Profiles also work with restart:

mzt restart --profile dev

Live Config Reload (SIGHUP)¶

The conductor supports hot-reloading configuration without a restart. Send SIGHUP to the conductor process and it will re-read the config file from disk:

# Reload config after editing ~/.marianne/conductor.yaml
kill -SIGHUP $(cat /tmp/marianne.pid)

Reloadable fields (take effect immediately): - max_concurrent_jobs — concurrency semaphore is rebuilt - resource_limits.* — memory/process limits updated - log_level — logging reconfigured - job_timeout_seconds, shutdown_timeout_seconds, monitor_interval_seconds

Non-reloadable fields (require conductor restart): - socket.* (path, permissions, backlog) - pid_file

If a non-reloadable field differs from the running config, the conductor logs a warning but continues with the old value.

Live Config Display¶

mzt config show automatically queries the running conductor for its in-memory config. This reflects any SIGHUP reloads. When the conductor is not running, it falls back to reading from disk.

# Shows [live] source when conductor is running
mzt config show

# Validate a config file without starting the conductor
mzt config check
mzt config check --config /path/to/custom.yaml

Monitoring¶

`mzt conductor-status`¶

The primary monitoring command. Queries the conductor via IPC and displays:

Marianne conductor is running (PID 12345)
  Uptime: 2h 15m 30s
  [+] Readiness: ready
  Running jobs: 2
  Memory: 1024.5 MB
  Child processes: 8
  Accepting work: True
  Version: 0.1.0

Readiness shows [+] ready when the conductor is accepting jobs, or [-] not_ready when under pressure, shutting down, experiencing elevated failure rates, or notification delivery is degraded.

Conductor Logs¶

In foreground mode, logs go to stderr in console format. In background mode, logs are structured JSON. If log_file is configured, logs are written there:

# View conductor logs (if log_file is set)
tail -f ~/.marianne/marianne.log

Programmatic Health Probes¶

The daemon.health and daemon.ready IPC methods are available for integration with external monitoring. They are used internally by mzt conductor-status:

Liveness (daemon.health): Returns OK if the conductor can execute the handler — minimal cost, no resource checks.
Readiness (daemon.ready): Returns ready when memory is within limits, failure rate is normal, notifications are functional, and the conductor is not shutting down.

Conductor Clones¶

A clone conductor is an isolated conductor instance used for testing. Clones have their own socket, PID file, state database, and log — so you can test scores and CLI behavior without risking your production conductor.

# Start a clone
mzt --conductor-clone start

# Submit a score to the clone
mzt --conductor-clone run my-test-score.yaml

# Check the clone's status
mzt --conductor-clone conductor-status

# Named clones for parallel testing
mzt --conductor-clone=staging start
mzt --conductor-clone=staging run staging-test.yaml

# Stop when done
mzt --conductor-clone stop

Key behaviors: - The clone inherits your production ~/.marianne/conductor.yaml config. - Clone paths: /tmp/marianne-clone.sock (socket), /tmp/marianne-clone.pid (PID). - Named clones: /tmp/marianne-clone-staging.sock, etc. - Clone names are sanitized (64-character limit, safe characters only). - Commands that don't interact with the conductor (validate, --help) ignore the flag.

See the CLI Reference for full details.

The Baton (Event-Driven Execution Engine)¶

The baton (daemon/baton/) is Marianne's execution engine, using event-driven per-sheet dispatch instead of the legacy monolithic sequential runner. It is the default execution path (use_baton: true) with 1,900+ tests.

Key capabilities: - Event-driven dispatch — sheets dispatch when their dependencies are met and their instrument has capacity, rather than running sequentially in one task - Per-instrument concurrency — each instrument has independent slots, so rate limits on one instrument don't block others - Timer-based retry — backoff delays use a timer wheel instead of asyncio.sleep() - Rate limit auto-resume — when a rate limit is hit, the baton schedules a timer to automatically clear the limit when it expires and resume WAITING sheets. Without this, rate-limited sheets stay blocked until manually cleared via mzt clear-rate-limits. - Restart recovery — baton state persists and reconciles with CheckpointState on restart - Cost enforcement — per-sheet and per-job cost limits enforced after every attempt - Full prompt assembly — the baton renders prompts through the complete pipeline (preamble, template variables, prelude/cadenza injection, validation requirements) - Cross-sheet context — previous_outputs and previous_files are populated from completed sheets' stdout and workspace files, matching the legacy runner's behavior - Checkpoint sync — sheet status changes from all event types (escalation, cancellation, timeout, rate limit expiry, shutdown) are synchronized back to CheckpointState with deduplication to prevent redundant callbacks

The baton is active by default. No configuration needed. To fall back to the legacy runner, set use_baton: false in ~/.marianne/conductor.yaml.

Transition Plan¶

The baton is the mandatory path to multi-instrument execution. Without it, the conductor delegates to the legacy monolithic runner, which silently ignores per-sheet instrument assignments and runs everything through a single backend.

The transition has three phases:

Phase 1: Prove the baton works (complete) - Baton tested with --conductor-clone alongside the production conductor - Per-sheet instrument assignment, rate limits, timeouts verified

Phase 2: Baton as default (complete — D-027) - use_baton default flipped to true in DaemonConfig - Legacy runner remains as fallback for scores that explicitly opt out (use_baton: false) - All new features target the baton path only

Phase 3: Remove the toggle (future) - Delete the legacy runner execution path - Remove use_baton from DaemonConfig - The baton is how the conductor runs — no toggle, no fallback

Important: If you set use_baton: false, per-sheet instrument: assignments will not take effect at runtime — the legacy runner uses a single backend regardless of per-sheet configuration.

Legacy Components¶

These components are still in use through the current execution path:

GlobalSheetScheduler (daemon/scheduler.py) — Priority-based cross-job scheduling. Built but not yet wired into the execution path.
RateLimitCoordinator (daemon/rate_coordinator.py) — Cross-job rate limit state sharing. The write path is active; the read path feeds the scheduler. Stale limits can be cleared manually with mzt clear-rate-limits.

Migration from Pre-Conductor Usage¶

If you previously ran Marianne without a conductor:

Before (direct execution, no longer supported for mzt run):

mzt run my-score.yaml

After (conductor required):

# One-time: start the conductor
mzt start

# Then run jobs as before
mzt run my-score.yaml

# The conductor stays running — no need to restart per job

setup.sh¶

The setup.sh script includes a --daemon flag that installs conductor dependencies (psutil for resource monitoring):

./setup.sh --daemon          # Install with conductor support
./setup.sh --dev --daemon    # Dev + conductor

This installs the daemon extras group, which includes psutil for the ResourceMonitor and SystemProbe components.

Troubleshooting¶

"Marianne conductor is not running"¶

The conductor is not reachable at the default socket path (/tmp/marianne.sock).

# Start the conductor
mzt start

# Verify it's running
mzt conductor-status

"Conductor is already running"¶

A conductor instance is already active. Check with:

mzt conductor-status

If the conductor is unresponsive but the PID file exists, it may have crashed:

# Stop cleans up stale PID files
mzt stop
mzt start

Stale PID File¶

If mzt stop reports "conductor is not running" but a PID file exists, the file is automatically cleaned up. If it persists:

rm /tmp/marianne.pid
mzt start

Permission Errors on Socket¶

The socket is created with 0o660 permissions by default (owner + group read/write). If another user needs access, adjust socket.permissions in the conductor config. The conductor also rejects symlinked socket paths as a security measure.

"Conductor does not support '...'. Restart the conductor"¶

The CLI is newer than the running conductor. This happens when you update Marianne code but don't restart the conductor — the running daemon still has the old code loaded and doesn't recognize new IPC methods.

# Pick up code changes
pip install -e ".[dev]"
mzt restart

"--escalation incompatible with conductor"¶

The --escalation flag requires interactive console prompts which are not available in conductor mode. Escalation is not currently supported when running through the conductor.

Orphan Recovery on Restart¶

When the conductor starts, it checks the persistent registry for jobs that were queued or running during the previous conductor session. These orphaned jobs are automatically marked as failed with the error message "Conductor restarted while job was active." Check for recovered orphans in the conductor logs:

manager.orphans_recovered count=2

Advisory Lock Conflicts¶

The conductor uses fcntl.flock() on the PID file to prevent concurrent starts. If you see "PID file locked" errors, another mzt start is in progress. Wait a moment and retry.

Force Stop¶

If the conductor is unresponsive to graceful shutdown:

mzt stop --force    # Sends SIGKILL instead of SIGTERM

This kills the process immediately without waiting for running jobs to complete. Use as a last resort — running jobs will not checkpoint cleanly.