retry_strategy
retry_strategy
¶
Adaptive retry strategy with intelligent pattern detection.
Analyzes error history to make smart retry decisions, detecting patterns like: - Rapid consecutive failures → longer exponential backoff - Same error code repeated → different strategy (may be persistent issue) - Rate limits → use rate limit delay from error classification - Transient errors → standard retry with jitter
Example usage
from marianne.execution.retry_strategy import ( AdaptiveRetryStrategy, ErrorRecord, RetryRecommendation )
strategy = AdaptiveRetryStrategy()
Record errors as they occur¶
error_history: list[ErrorRecord] = [] error_history.append(ErrorRecord.from_classified_error(classified_error))
Get retry recommendation¶
recommendation = strategy.analyze(error_history) if recommendation.should_retry: await asyncio.sleep(recommendation.delay_seconds) # Retry the operation else: # Give up or escalate logger.error(f"Not retrying: {recommendation.reason}")
Attributes¶
Classes¶
RetryBehavior
¶
Bases: NamedTuple
Precise retry behavior recommendation for a specific error code.
Unlike ErrorCategory which provides broad retry guidelines, RetryBehavior encodes error-code-specific knowledge about optimal retry strategies.
Attributes:
| Name | Type | Description |
|---|---|---|
delay_seconds |
float
|
Recommended delay before retrying (0 = no delay). |
is_retriable |
bool
|
Whether this error is generally retriable. |
reason |
str
|
Human-readable explanation for the retry behavior. |
DelayOutcome
dataclass
¶
Record of a delay used and its outcome for learning.
Captures the relationship between delay duration and subsequent success/failure, enabling the system to learn optimal delays for each error type.
Attributes:
| Name | Type | Description |
|---|---|---|
error_code |
ErrorCode
|
The ErrorCode that triggered the retry. |
delay_seconds |
float
|
The delay that was actually used before retrying. |
succeeded_after |
bool
|
Whether the retry succeeded after this delay. |
timestamp |
datetime
|
When this delay was recorded. |
DelayHistory
¶
Tracks historical delay outcomes for learning optimal delays.
Maintains a record of (error_code, delay, success) tuples to enable the system to learn which delays work best for each error type.
Thread-safe: Uses a threading.Lock to protect all mutable operations. Pruning maintains chronological order by sorting retained outcomes by timestamp after grouping by error code.
Initialize delay history.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_history
|
int
|
Maximum number of outcomes to retain per error code. |
100
|
Source code in src/marianne/execution/retry_strategy.py
Functions¶
record
¶
Record a delay outcome.
Thread-safe: Uses lock to protect append and pruning operations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
outcome
|
DelayOutcome
|
The delay outcome to record. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If outcome is None or outcome.delay_seconds is negative. |
Source code in src/marianne/execution/retry_strategy.py
query_for_error_code
¶
Query outcomes for a specific error code.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
code
|
ErrorCode
|
The error code to query. |
required |
Returns:
| Type | Description |
|---|---|
list[DelayOutcome]
|
List of DelayOutcome for this error code. |
Source code in src/marianne/execution/retry_strategy.py
get_average_successful_delay
¶
Get average delay that led to success for an error code.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
code
|
ErrorCode
|
The error code to query. |
required |
Returns:
| Type | Description |
|---|---|
float | None
|
Average successful delay in seconds, or None if no successful samples. |
Source code in src/marianne/execution/retry_strategy.py
RetryPattern
¶
Bases: str, Enum
Detected error patterns that influence retry strategy.
Each pattern triggers a different retry behavior to maximize the chance of recovery while minimizing wasted attempts.
Attributes¶
NONE
class-attribute
instance-attribute
¶
No clear pattern detected - use default retry behavior.
RAPID_FAILURES
class-attribute
instance-attribute
¶
Multiple failures in quick succession - needs longer cooldown.
REPEATED_ERROR_CODE
class-attribute
instance-attribute
¶
Same error code appearing repeatedly - may be persistent issue.
RATE_LIMITED
class-attribute
instance-attribute
¶
Rate limiting detected - use rate limit wait time.
CASCADING_FAILURES
class-attribute
instance-attribute
¶
Errors are getting worse/different - system may be degrading.
INTERMITTENT
class-attribute
instance-attribute
¶
Errors are spread out with successes in between - normal transient.
RECOVERY_IN_PROGRESS
class-attribute
instance-attribute
¶
Recent success after failures - system may be recovering.
ErrorRecord
dataclass
¶
ErrorRecord(timestamp, error_code, category, message, exit_code=None, exit_signal=None, retriable=True, suggested_wait=None, sheet_num=None, attempt_num=1, monotonic_time=monotonic(), root_cause_confidence=None, secondary_error_count=0)
Record of a single error occurrence for pattern analysis.
Captures all relevant information about an error to enable intelligent pattern detection across multiple errors.
Attributes:
| Name | Type | Description |
|---|---|---|
timestamp |
datetime
|
When the error occurred (UTC). |
error_code |
ErrorCode
|
Structured error code (e.g., E001, E101). |
category |
ErrorCategory
|
High-level error category (rate_limit, transient, etc.). |
message |
str
|
Human-readable error description. |
exit_code |
int | None
|
Process exit code if applicable. |
exit_signal |
int | None
|
Signal number if killed by signal. |
retriable |
bool
|
Whether this specific error is retriable. |
suggested_wait |
float | None
|
Classifier's suggested wait time in seconds. |
sheet_num |
int | None
|
Sheet number where error occurred. |
attempt_num |
int
|
Which attempt number this was (1-indexed). |
monotonic_time |
float
|
Monotonic timestamp for precise timing calculations. |
root_cause_confidence |
float | None
|
Confidence in root cause identification (0.0-1.0). |
secondary_error_count |
int
|
Number of secondary errors detected. |
Functions¶
from_classified_error
classmethod
¶
Create an ErrorRecord from a ClassifiedError.
This is the primary factory method for creating ErrorRecords in the retry flow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
error
|
ClassifiedError
|
ClassifiedError from the error classifier. |
required |
sheet_num
|
int | None
|
Optional sheet number for context. |
None
|
attempt_num
|
int
|
Which retry attempt this represents. |
1
|
Returns:
| Type | Description |
|---|---|
ErrorRecord
|
ErrorRecord populated from the classified error. |
Source code in src/marianne/execution/retry_strategy.py
from_classification_result
classmethod
¶
Create an ErrorRecord from a ClassificationResult.
This factory method captures root cause information from the multi-error classification, including confidence in root cause identification and the count of secondary errors. This enables the retry strategy to consider root cause confidence when making retry decisions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
ClassificationResult
|
ClassificationResult from classify_execution(). |
required |
sheet_num
|
int | None
|
Optional sheet number for context. |
None
|
attempt_num
|
int
|
Which retry attempt this represents. |
1
|
Returns:
| Type | Description |
|---|---|
ErrorRecord
|
ErrorRecord with root cause confidence and secondary error count. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If confidence is not in valid range [0.0, 1.0]. |
Source code in src/marianne/execution/retry_strategy.py
to_dict
¶
Convert to dictionary for logging/serialization.
Returns:
| Type | Description |
|---|---|
dict[str, object]
|
Dictionary representation with all fields. |
Source code in src/marianne/execution/retry_strategy.py
RetryRecommendation
dataclass
¶
RetryRecommendation(should_retry, delay_seconds, reason, confidence, detected_pattern=NONE, strategy_used='default', root_cause_confidence=None)
Recommendation from the adaptive retry strategy.
Encapsulates the decision of whether to retry, how long to wait, and the reasoning behind the decision for observability.
Attributes:
| Name | Type | Description |
|---|---|---|
should_retry |
bool
|
Whether a retry should be attempted. |
delay_seconds |
float
|
Recommended delay before retrying. |
reason |
str
|
Human-readable explanation of the decision. |
confidence |
float
|
Confidence in this recommendation (0.0-1.0). |
detected_pattern |
RetryPattern
|
The pattern that influenced this decision. |
strategy_used |
str
|
Name of the strategy/heuristic that was applied. |
root_cause_confidence |
float | None
|
Confidence in root cause identification (0.0-1.0, None if N/A). |
Functions¶
__post_init__
¶
Validate confidence is in valid range.
Source code in src/marianne/execution/retry_strategy.py
to_dict
¶
Convert to dictionary for logging/serialization.
Returns:
| Type | Description |
|---|---|
dict[str, object]
|
Dictionary representation with all fields. |
Source code in src/marianne/execution/retry_strategy.py
RetryStrategyConfig
dataclass
¶
RetryStrategyConfig(base_delay=10.0, max_delay=API_RATE_LIMIT, exponential_base=2.0, rapid_failure_window=60.0, rapid_failure_threshold=3, rapid_failure_multiplier=2.0, repeated_error_threshold=2, repeated_error_strategy_change_threshold=3, min_confidence=0.3, jitter_factor=0.25)
Configuration for the adaptive retry strategy.
All timing values are in seconds. Thresholds are tuned for typical Claude CLI execution patterns.
Attributes:
| Name | Type | Description |
|---|---|---|
base_delay |
float
|
Starting delay for exponential backoff. |
max_delay |
float
|
Maximum delay cap. |
exponential_base |
float
|
Multiplier for exponential backoff. |
rapid_failure_window |
float
|
Window (seconds) to detect rapid failures. |
rapid_failure_threshold |
int
|
Number of failures in window to trigger. |
rapid_failure_multiplier |
float
|
Extra delay multiplier for rapid failures. |
repeated_error_threshold |
int
|
Same error code count before flagging. |
repeated_error_strategy_change_threshold |
int
|
Count before strategy change. |
min_confidence |
float
|
Minimum confidence for retry recommendation. |
jitter_factor |
float
|
Random jitter to add (0.0-1.0 of delay). |
Functions¶
__post_init__
¶
Validate configuration values.
Source code in src/marianne/execution/retry_strategy.py
LearnedDelayCircuitBreaker
¶
Protects learned delays from bad outcomes via a 3-strike rule.
Tracks consecutive failures per error code when using learned delays. After 3+ consecutive failures, the breaker "opens" and the system reverts to static delays for that error code.
State is intentionally ephemeral (not persisted) — a fresh AdaptiveRetryStrategy gets a clean breaker so it can re-evaluate.
Source code in src/marianne/execution/retry_strategy.py
Functions¶
is_enabled
¶
record_outcome
¶
Record a retry outcome; trip breaker after consecutive failures.
Source code in src/marianne/execution/retry_strategy.py
reset
¶
Reset breaker for an error code, re-enabling learned delays.
Source code in src/marianne/execution/retry_strategy.py
AdaptiveRetryStrategy
¶
Intelligent retry strategy that analyzes error patterns.
The strategy examines error history to detect patterns and make informed retry decisions. Key features:
-
Rapid Failure Detection: If multiple errors occur in a short window, applies longer backoff to avoid overwhelming the system.
-
Repeated Error Detection: If the same error code appears repeatedly, may recommend different strategies or lower confidence.
-
Rate Limit Handling: Uses suggested wait times from rate limit errors, with additional buffer.
-
Cascading Failure Detection: If errors are getting different/worse, may recommend stopping to prevent further damage.
-
Recovery Detection: If recent attempts succeeded after failures, uses shorter delays to capitalize on recovery.
-
Delay Learning with Circuit Breaker: When a DelayHistory is provided, the strategy learns optimal delays from past outcomes. A circuit breaker protects against bad learned delays by reverting to static delays after 3 consecutive failures.
Circuit Breaker State Design
The circuit breaker state (_learned_delay_failures, _use_learned_delay) is intentionally ephemeral and NOT persisted. This is a deliberate design choice with the following trade-offs:
Benefits: - After restart, the system gets a "fresh start" to try learned delays - Avoids persisting potentially stale circuit breaker state - Simple implementation without additional state management
Trade-offs: - After restart, may retry with a previously-failed learned delay once - Circuit breaker will re-trigger after 3 failures if the learned delay is still problematic
The DelayHistory itself CAN be persisted (it's just delay outcomes), but the circuit breaker resets on each AdaptiveRetryStrategy instantiation. Use reset_circuit_breaker() to manually reset circuit breaker state for a specific error code during runtime.
Thread-safe: No mutable state; all analysis is based on input history.
Example
strategy = AdaptiveRetryStrategy()
Analyze error history¶
recommendation = strategy.analyze(error_history)
Log the decision¶
logger.info( "retry_decision", should_retry=recommendation.should_retry, delay=recommendation.delay_seconds, pattern=recommendation.detected_pattern.value, reason=recommendation.reason, )
Initialize the adaptive retry strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
RetryStrategyConfig | None
|
Optional configuration. Uses defaults if not provided. |
None
|
delay_history
|
DelayHistory | None
|
Optional delay history for learning. If not provided, learning features are disabled (purely static delays). |
None
|
global_learning_store
|
GlobalLearningStore | None
|
Optional global learning store for cross-workspace learned delays (Evolution #3: Learned Wait Time Injection). If provided, blend_historical_delay() will query global store for cross-workspace learned delays when in-memory history is insufficient. |
None
|
Source code in src/marianne/execution/retry_strategy.py
Functions¶
analyze
¶
Analyze error history and recommend retry behavior.
This is the main entry point for the adaptive retry strategy. It examines the error history to detect patterns and returns a recommendation with reasoning.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
error_history
|
list[ErrorRecord]
|
List of ErrorRecords in chronological order. |
required |
max_retries
|
int | None
|
Optional maximum retries to consider (for confidence). |
None
|
Returns:
| Type | Description |
|---|---|
RetryRecommendation
|
RetryRecommendation with decision, delay, and reasoning. |
Source code in src/marianne/execution/retry_strategy.py
611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 | |
blend_historical_delay
¶
Blend learned delay with static delay for an error code.
Priority order: 1. Circuit breaker override → static 2. In-memory delay history (job-specific learning) 3. Global learning store (cross-workspace learned delays) 4. Static delay (fallback)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
error_code
|
ErrorCode
|
The error code to get delay for. |
required |
static_delay
|
float
|
The static delay from ErrorCode.get_retry_behavior(). |
required |
Returns:
| Type | Description |
|---|---|
tuple[float, str]
|
Tuple of (blended_delay, strategy_name). |
Source code in src/marianne/execution/retry_strategy.py
record_delay_outcome
¶
Record the outcome of a retry delay for learning.
Should be called after each retry attempt to update the delay history. Also updates circuit breaker state.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
error_code
|
ErrorCode
|
The error code that was being retried. |
required |
delay_used
|
float
|
The delay in seconds that was used. |
required |
succeeded
|
bool
|
Whether the retry succeeded after this delay. |
required |
Source code in src/marianne/execution/retry_strategy.py
reset_circuit_breaker
¶
Reset circuit breaker for an error code, re-enabling learned delays.
Call this method when you want to give learned delays another chance after the circuit breaker has tripped. Common scenarios:
- After manual intervention that fixed the underlying issue
- After a cooling-off period with successful static delays
- At the start of a new batch/job where conditions may have changed
Note: The circuit breaker state is ephemeral (not persisted), so it automatically resets when a new AdaptiveRetryStrategy is instantiated. This method is for resetting during runtime without reinstantiation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
error_code
|
ErrorCode
|
The error code to reset circuit breaker for. |
required |
Example
After manual fix, give learned delays another chance¶
strategy.reset_circuit_breaker(ErrorCode.E101)