Index
profiler
¶
Profiler package — system resource collection, storage, and anomaly detection.
Collects per-process metrics, GPU stats, strace summaries, and stores
time-series data in SQLite for consumption by mzt top and the
anomaly/correlation analyzers.
Classes¶
AnomalyDetector
¶
Detects resource anomalies by comparing snapshots against thresholds.
Runs on each new snapshot collected by ProfilerCollector. Stateless
except for the configuration — all history is passed in via the
detect method.
Source code in src/marianne/daemon/profiler/anomaly.py
Functions¶
detect
¶
Run all anomaly checks against current snapshot and history.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
current
|
SystemSnapshot
|
The most recent system snapshot. |
required |
history
|
list[SystemSnapshot]
|
Recent snapshots (oldest-first) for trend analysis. Should cover at least the configured spike window. |
required |
Returns:
| Type | Description |
|---|---|
list[Anomaly]
|
List of detected |
Source code in src/marianne/daemon/profiler/anomaly.py
ProfilerCollector
¶
Central orchestrator for the daemon profiler subsystem.
Coordinates:
- Periodic metric collection (system + per-process + GPU + strace)
- SQLite + JSONL persistence via MonitorStorage
- Heuristic anomaly detection via AnomalyDetector
- EventBus integration for monitor.anomaly events
- Process lifecycle tracking via sheet.started/completed/failed
Parameters¶
config:
Profiler configuration (interval, storage paths, thresholds).
monitor:
The daemon's ResourceMonitor for system-level metrics.
pgroup:
The daemon's ProcessGroupManager for child process enumeration.
event_bus:
The daemon's EventBus for publishing anomaly events and
subscribing to sheet lifecycle events.
manager:
Optional JobManager for mapping PIDs to job_id/sheet_num
and reading running job / active sheet counts.
Source code in src/marianne/daemon/profiler/collector.py
Functions¶
start
async
¶
Initialize storage, subscribe to events, start collection loop.
Source code in src/marianne/daemon/profiler/collector.py
stop
async
¶
Stop collection loop, detach strace, unsubscribe from events.
Source code in src/marianne/daemon/profiler/collector.py
collect_snapshot
async
¶
Gather all metrics into a single SystemSnapshot.
Steps: 1. System memory via SystemProbe 2. Per-process metrics via psutil (with PID → job mapping) 3. GPU metrics via GpuProbe 4. Load average via os.getloadavg() 5. Strace summaries for attached PIDs 6. Pressure level from BackpressureController 7. Running jobs / active sheets from JobManager
Source code in src/marianne/daemon/profiler/collector.py
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 | |
get_resource_context_for_pid
¶
Get current resource context for a specific PID.
Returns a dict suitable for embedding in sheet event data:
rss_mb, cpu_pct, syscall_hotspot, anomalies_active.
If the PID is not found in the latest snapshot, returns a dict with all values set to None/empty.
Source code in src/marianne/daemon/profiler/collector.py
get_resource_context
¶
Get general resource context (not PID-specific).
Useful when no specific PID is available for the event.
Source code in src/marianne/daemon/profiler/collector.py
get_latest_snapshot
¶
Return the latest snapshot as a JSON-serializable dict.
Used by the daemon.top IPC method.
Source code in src/marianne/daemon/profiler/collector.py
get_jsonl_path
¶
Return the JSONL streaming log path.
Used by the daemon.top.stream IPC method.
Source code in src/marianne/daemon/profiler/collector.py
get_recent_events
¶
Return recent process events as JSON-serializable dicts.
Used by the daemon.events IPC method.
Source code in src/marianne/daemon/profiler/collector.py
CorrelationAnalyzer
¶
Periodic statistical analysis of resource usage vs. job outcomes.
Cross-references profiler snapshots (peak memory, CPU, syscall distributions, anomalies) with job success/failure outcomes from the learning store to identify predictive patterns.
Lifecycle::
analyzer = CorrelationAnalyzer(storage, learning_hub, config)
await analyzer.start(event_bus)
# ... periodic analysis runs automatically ...
await analyzer.stop()
Source code in src/marianne/daemon/profiler/correlation.py
Functions¶
start
async
¶
Start the periodic analysis loop.
The event_bus parameter is accepted for interface consistency with other daemon components but is not currently used by the correlation analyzer (it reads from storage, not events).
Source code in src/marianne/daemon/profiler/correlation.py
stop
async
¶
Stop the periodic analysis loop.
Source code in src/marianne/daemon/profiler/correlation.py
analyze
async
¶
Run correlation analysis on completed jobs.
Steps: 1. Query completed jobs from storage (last 7 days) 2. For each job: get peak memory, total CPU, syscall distribution 3. Cross-reference with job outcomes from learning store 4. Statistical analysis: - Memory vs failure rate (binned histogram) - Syscall hotspots vs failure rate - Anomaly presence vs failure rate - Execution duration vs failure rate 5. Generate RESOURCE_CORRELATION patterns for confidence > 0.6 6. Store in LearningHub
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of generated correlation dicts (for testing/logging). |
Source code in src/marianne/daemon/profiler/correlation.py
GpuMetric
dataclass
¶
Snapshot of a single GPU's current state.
GpuProbe
¶
GPU resource probes following the SystemProbe pattern.
Each method tries pynvml first, then falls back to nvidia-smi. Returns empty list when no GPU is available — callers treat that as "no GPU present" (not an error).
Functions¶
get_gpu_metrics
staticmethod
¶
Get current metrics for all GPUs.
Priority
- pynvml (fast, in-process)
- nvidia-smi subprocess fallback
- Empty list (no GPU / no drivers)
Returns:
| Type | Description |
|---|---|
list[GpuMetric]
|
List of GpuMetric, one per GPU. Empty if no GPU available. |
Source code in src/marianne/daemon/profiler/gpu_probe.py
get_gpu_metrics_async
async
staticmethod
¶
Async variant of get_gpu_metrics.
Uses asyncio.create_subprocess_exec for the nvidia-smi fallback
so it doesn't block the event loop.
Returns:
| Type | Description |
|---|---|
list[GpuMetric]
|
List of GpuMetric, one per GPU. Empty if no GPU available. |
Source code in src/marianne/daemon/profiler/gpu_probe.py
is_available
staticmethod
¶
Check whether any GPU probing method is available.
Returns True if pynvml is importable OR nvidia-smi is on PATH.
Source code in src/marianne/daemon/profiler/gpu_probe.py
Anomaly
¶
Bases: BaseModel
A detected resource anomaly.
Produced by AnomalyDetector when heuristic thresholds are exceeded.
Published to EventBus as monitor.anomaly events and stored as
RESOURCE_ANOMALY patterns in the learning system.
AnomalyConfig
¶
Bases: BaseModel
Thresholds for anomaly detection.
AnomalySeverity
¶
Bases: str, Enum
Severity level for detected anomalies.
AnomalyType
¶
Bases: str, Enum
Types of resource anomalies the detector can identify.
Attributes¶
MEMORY_SPIKE
class-attribute
instance-attribute
¶
RSS increased >threshold in recent window.
RUNAWAY_PROCESS
class-attribute
instance-attribute
¶
Child process consuming >threshold CPU for extended duration.
ZOMBIE
class-attribute
instance-attribute
¶
One or more zombie child processes found.
FD_EXHAUSTION
class-attribute
instance-attribute
¶
Process approaching file descriptor limits.
CorrelationConfig
¶
Bases: BaseModel
Configuration for the periodic correlation analyzer.
EventType
¶
Bases: str, Enum
Process lifecycle event types.
ProcessEvent
¶
Bases: BaseModel
Lifecycle event for a child process (spawn, exit, signal, kill, oom).
ProcessMetric
¶
Bases: BaseModel
Resource metrics for a single process in a snapshot.
ProfilerConfig
¶
Bases: BaseModel
Top-level profiler configuration for DaemonConfig.profiler.
Controls data collection, storage, anomaly detection thresholds, and correlation analysis frequency.
ResourceEstimate
¶
Bases: BaseModel
Scheduling hint based on learned resource correlations.
Returned by BackpressureController.estimate_job_resource_needs() to inform job admission and scheduling decisions.
RetentionConfig
¶
Bases: BaseModel
Data retention policy for profiler storage.
SystemSnapshot
¶
Bases: BaseModel
Point-in-time system resource snapshot.
Collected periodically by ProfilerCollector, stored in SQLite + JSONL, and consumed by AnomalyDetector and CorrelationAnalyzer.
MonitorStorage
¶
Async SQLite + JSONL storage for profiler time-series data.
Uses aiosqlite for non-blocking database access and WAL mode
for safe concurrent reads (mzt top) while the daemon writes.
Parameters¶
db_path: Path to the SQLite database file. Parent directories are created automatically. jsonl_path: Optional path for the NDJSON streaming log. When provided, each snapshot is also appended as a single JSON line. jsonl_max_bytes: Maximum JSONL file size before rotation (default 50 MB).
Source code in src/marianne/daemon/profiler/storage.py
Functions¶
close
async
¶
initialize
async
¶
Create tables and indexes if they don't exist.
Source code in src/marianne/daemon/profiler/storage.py
write_snapshot
async
¶
Insert a snapshot and its process metrics into the database.
Returns the snapshot row ID for cross-referencing.
Source code in src/marianne/daemon/profiler/storage.py
210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 | |
write_event
async
¶
Insert a process lifecycle event.
Source code in src/marianne/daemon/profiler/storage.py
read_snapshots
async
¶
Read snapshots since the given unix timestamp.
Returns snapshots in chronological order, most recent last. Process metrics are reconstructed for each snapshot.
Source code in src/marianne/daemon/profiler/storage.py
324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 | |
read_events
async
¶
Read process events since the given unix timestamp.
Source code in src/marianne/daemon/profiler/storage.py
read_process_history
async
¶
Read historical metrics for a specific process.
Source code in src/marianne/daemon/profiler/storage.py
read_job_resource_profile
async
¶
Aggregate resource profile for a specific job.
Returns a dict with peak memory, total CPU-time, process spawn
count, and syscall hotspots — useful for scheduling hints and
mzt diagnose --resources.
Source code in src/marianne/daemon/profiler/storage.py
509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 | |
cleanup
async
¶
Apply retention policy by deleting old data.
- Snapshots + process_metrics older than full_resolution_hours
- Process events older than events_days
Source code in src/marianne/daemon/profiler/storage.py
append_jsonl
¶
Append one NDJSON line for the given snapshot.
Synchronous I/O — callers should wrap in run_in_executor
if strict non-blocking is required. In practice the writes are
small and fast enough for the daemon's collection loop.
Performs size-based rotation: when the file exceeds
jsonl_max_bytes, renames it with a .1 suffix (keeping
at most 2 rotated files) and starts a new file.
Source code in src/marianne/daemon/profiler/storage.py
StraceManager
¶
Manages strace attachment to child processes.
Typical lifecycle::
mgr = StraceManager(enabled=True)
await mgr.attach(pid) # spawns ``strace -c -p PID``
... # time passes, child does work
summary = await mgr.detach(pid) # SIGINT -> parse summary
await mgr.detach_all() # cleanup on shutdown
Source code in src/marianne/daemon/profiler/strace_manager.py
Attributes¶
Functions¶
is_available
staticmethod
¶
attach
async
¶
Attach strace -c -p <pid> for syscall summary collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pid
|
int
|
Target process PID to trace. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if strace was successfully spawned, False otherwise. |
Source code in src/marianne/daemon/profiler/strace_manager.py
detach
async
¶
Detach strace from a process and parse the summary output.
Sends SIGINT to the strace process (which causes it to print its
-c summary table to stderr), then parses the output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pid
|
int
|
Target process PID to stop tracing. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
Dict with |
dict[str, Any] | None
|
or None if the pid was not being traced. |
Source code in src/marianne/daemon/profiler/strace_manager.py
attach_full_trace
async
¶
Attach a full strace (strace -f -t -p PID -o file).
This is the on-demand deep-trace triggered by mzt top --trace PID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pid
|
int
|
Target process PID. |
required |
output_file
|
Path
|
Path to write the full trace output. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if strace was successfully spawned, False otherwise. |
Source code in src/marianne/daemon/profiler/strace_manager.py
detach_all
async
¶
Detach and terminate all strace processes.
Called during daemon shutdown for cleanup.
Source code in src/marianne/daemon/profiler/strace_manager.py
get_strace_pids
¶
Return PIDs of all running strace subprocesses.
Useful for registering with ProcessGroupManager so they get cleaned up on daemon shutdown.
Source code in src/marianne/daemon/profiler/strace_manager.py
Functions¶
generate_resource_report
async
¶
Generate comprehensive resource report for a job.
Produces a text report designed for AI consumption (mzt diagnose
--resources). Aggregates peak memory per sheet, total CPU-time,
process spawn count, signal/kill events, zombie/OOM events, syscall
hotspots, and anomaly history.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
The job ID to generate the report for. |
required |
storage
|
MonitorStorage
|
An initialized |
required |
Returns:
| Type | Description |
|---|---|
str
|
Multi-line text report. Returns a short "no data" message if the |
str
|
job has no profiler data. |
Source code in src/marianne/daemon/profiler/storage.py
691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 | |