checkpoint
checkpoint
¶
Checkpoint and state management models.
Defines the state that gets persisted between runs for resumable orchestration.
Classes¶
ValidationDetailDict
¶
Bases: TypedDict
Schema for individual validation result entries in SheetState.validation_details.
Most keys are optional (total=False) to support partial dicts from
legacy data and simplified test fixtures. passed is required
because a validation result without a pass/fail status is meaningless.
PromptMetricsDict
¶
Bases: TypedDict
Schema for prompt analysis metrics in SheetState.prompt_metrics.
All keys are optional (total=False) to support partial metrics from legacy data or simplified test fixtures.
ProgressSnapshotDict
¶
Bases: TypedDict
Schema for execution progress snapshots in SheetState.progress_snapshots.
All keys are optional (total=False) because snapshots may contain varying subsets depending on when they were captured.
ErrorContextDict
¶
Bases: TypedDict
Schema for error context in CheckpointErrorRecord.context.
All keys optional since context varies by error type. Values may be None when the information is not available.
AppliedPatternDict
¶
Bases: TypedDict
Schema for a single applied pattern in SheetState.applied_patterns.
Replaces the parallel applied_pattern_ids / applied_pattern_descriptions
lists with a single structured list for safety and clarity.
OutcomeDataDict
¶
Bases: TypedDict
Schema for structured outcome data in SheetState.outcome_data.
All keys optional since this is extensible for learning/pattern recognition.
SynthesisResultDict
¶
Bases: TypedDict
Schema for synthesis result entries in CheckpointState.synthesis_results.
All keys are optional (total=False) to support partial data from legacy state files and test fixtures.
SheetStatus
¶
Bases: str, Enum
Status of a single sheet.
The baton tracks 11 scheduling states. Status display and persistence
use all 11. Consumers that only care about terminal/non-terminal can
check is_terminal.
OutcomeCategory
¶
Bases: str, Enum
Classification of sheet execution outcome (#7).
JobStatus
¶
Bases: str, Enum
Status of an entire job run.
CheckpointErrorRecord
¶
Bases: BaseModel
Record of a single error occurrence during sheet execution.
Stores structured error information for debugging and pattern analysis. Error history is trimmed to MAX_ERROR_HISTORY records per sheet to prevent unbounded state growth.
SheetState
¶
Bases: BaseModel
State for a single sheet.
Attributes¶
applied_pattern_ids
property
writable
¶
Backward-compatible accessor for pattern IDs.
applied_pattern_descriptions
property
writable
¶
Backward-compatible accessor for pattern descriptions.
has_fallback_available
property
¶
Whether there is another instrument in the fallback chain to try.
Functions¶
record_attempt
¶
Record an attempt result and update tracking state.
Only non-successful, non-rate-limited attempts increment
normal_attempts. Successes and rate-limited attempts are
recorded in attempt_results for history but don't consume
retry budget.
Source code in src/marianne/core/checkpoint.py
advance_fallback
¶
Advance to the next instrument in the fallback chain.
Records the transition, switches instrument_name, resets retry budget, and increments current_instrument_index.
Returns the new instrument name, or None if the chain is exhausted.
Source code in src/marianne/core/checkpoint.py
to_dict
¶
from_dict
classmethod
¶
Restore from dict. Compatibility with dataclass SheetExecutionState.
capture_output
¶
Capture tail of stdout/stderr for debugging.
Stores the last max_bytes of each output stream. Sets output_truncated
to True if either stream was larger than the limit.
Credential scanning (F-003): Before storing, both streams are scanned for API key patterns (sk-ant-, AKIA, AIzaSy, Bearer tokens) and matches are replaced with [REDACTED_] placeholders. This prevents leaked credentials from propagating to learning store, dashboard, diagnostics, and MCP resources.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stdout
|
str
|
Full stdout string from execution. |
required |
stderr
|
str
|
Full stderr string from execution. |
required |
max_bytes
|
int
|
Maximum bytes to capture per stream (default 10KB). |
MAX_OUTPUT_CAPTURE_BYTES
|
Source code in src/marianne/core/checkpoint.py
add_error_to_history
¶
Append an error record and enforce the history size limit.
All callers that add errors to error_history should use this
method instead of appending directly so that the list never exceeds
MAX_ERROR_HISTORY entries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
error
|
CheckpointErrorRecord
|
The error record to add. |
required |
Source code in src/marianne/core/checkpoint.py
add_fallback_to_history
¶
Append a fallback record and enforce the history size limit.
All callers that add entries to instrument_fallback_history
should use this method instead of appending directly so the list
never exceeds MAX_INSTRUMENT_FALLBACK_HISTORY entries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
record
|
dict[str, str]
|
Dict with keys from, to, reason, timestamp. |
required |
Source code in src/marianne/core/checkpoint.py
CheckpointState
¶
Bases: BaseModel
Complete checkpoint state for a job run.
This is the primary state object that gets persisted and restored for resumable job execution.
Zombie Detection
A job is considered a "zombie" when the state shows RUNNING status
but the associated process (tracked by pid) is no longer alive.
This can happen when:
- External timeout wrapper sends SIGKILL
- System crash or forced termination
- WSL shutdown while job running
Use is_zombie() to detect this state, and mark_zombie_detected()
to recover from it.
Worktree Isolation
When isolation is enabled, jobs execute in a separate git worktree. The worktree tracking fields record the worktree state for: - Resume operations (reuse existing worktree) - Cleanup on completion (remove or preserve based on outcome) - Debugging (know which worktree was used)
Functions¶
record_hook_result
¶
Append a hook result to the checkpoint state.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
dict[str, Any]
|
Serialized HookResult dict from hook execution. |
required |
Source code in src/marianne/core/checkpoint.py
record_circuit_breaker_change
¶
Record a circuit breaker state transition.
Persists circuit breaker state changes so that mzt status
can display ground-truth CB state instead of inferring it from
failure patterns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
state
|
str
|
Current CB state after transition ("closed", "open", "half_open"). |
required |
trigger
|
str
|
What caused the transition (e.g., "failure_recorded", "success_recorded"). |
required |
consecutive_failures
|
int
|
Number of consecutive failures at time of transition. |
required |
Source code in src/marianne/core/checkpoint.py
add_synthesis
¶
Add or update a synthesis result.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_id
|
str
|
The batch identifier. |
required |
result
|
SynthesisResultDict
|
Synthesis result as dict (from SynthesisResult.to_dict()). |
required |
Source code in src/marianne/core/checkpoint.py
get_next_sheet
¶
Determine the next sheet to process.
Returns None if all sheets are complete.
Source code in src/marianne/core/checkpoint.py
mark_sheet_started
¶
Mark a sheet as started.
Source code in src/marianne/core/checkpoint.py
mark_sheet_completed
¶
mark_sheet_completed(sheet_num, validation_passed=True, validation_details=None, execution_duration_seconds=None)
Mark a sheet as completed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sheet_num
|
int
|
Sheet number that completed. |
required |
validation_passed
|
bool
|
Whether validation checks passed. |
True
|
validation_details
|
list[ValidationDetailDict] | None
|
Detailed validation results. |
None
|
execution_duration_seconds
|
float | None
|
How long the sheet execution took. |
None
|
Source code in src/marianne/core/checkpoint.py
mark_sheet_failed
¶
mark_sheet_failed(sheet_num, error_message, error_category=None, exit_code=None, exit_signal=None, exit_reason=None, execution_duration_seconds=None, error_code=None)
Mark a sheet as failed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sheet_num
|
int
|
Sheet number that failed. |
required |
error_message
|
str
|
Human-readable error description. |
required |
error_category
|
ErrorCategory | str | None
|
Error category from ErrorClassifier (e.g., "signal", "timeout"). |
None
|
exit_code
|
int | None
|
Process exit code (None if killed by signal). |
None
|
exit_signal
|
int | None
|
Signal number if killed by signal (e.g., 9=SIGKILL, 15=SIGTERM). |
None
|
exit_reason
|
ExitReason | None
|
Why execution ended ("completed", "timeout", "killed", "error"). |
None
|
execution_duration_seconds
|
float | None
|
How long the sheet execution took. |
None
|
error_code
|
str | None
|
Structured error code (e.g., "E001", "E006"). More specific than error_category — distinguishes stale (E006) from timeout (E001). |
None
|
Source code in src/marianne/core/checkpoint.py
1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 | |
mark_sheet_skipped
¶
Mark a sheet as skipped.
v21 Evolution: Proactive Checkpoint System - supports skipping sheets via checkpoint response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sheet_num
|
int
|
Sheet number to skip. |
required |
reason
|
str | None
|
Optional reason for skipping (stored in error_message field). |
None
|
Source code in src/marianne/core/checkpoint.py
mark_job_failed
¶
Mark the entire job as failed.
Source code in src/marianne/core/checkpoint.py
mark_job_paused
¶
Mark the job as paused.
Source code in src/marianne/core/checkpoint.py
get_progress
¶
Get progress as (completed, total).
get_progress_percent
¶
is_zombie
¶
Check if this job is a zombie (RUNNING but process dead).
A zombie state occurs when: 1. Status is RUNNING 2. PID is set 3. Process with that PID is no longer alive
Note: This only checks if the PID is dead. It does NOT use time-based stale detection, as jobs can legitimately run for hours or days.
Returns:
| Type | Description |
|---|---|
bool
|
True if job appears to be a zombie, False otherwise. |
Source code in src/marianne/core/checkpoint.py
mark_zombie_detected
¶
Mark this job as recovered from zombie state.
Changes status from RUNNING to PAUSED, clears PID, and records the zombie recovery in the error message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reason
|
str | None
|
Optional additional context about why zombie was detected. |
None
|
Source code in src/marianne/core/checkpoint.py
set_running_pid
¶
Set the PID of the running orchestrator process.
Call this when starting job execution to enable zombie detection. If pid is None, uses the current process PID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pid
|
int | None
|
Process ID to record. Defaults to current process. |
None
|