pgroup

pgroup ¶

Process group management for orphan prevention.

Addresses issue #38: cascading crashes from orphaned MCP server processes. When the daemon is the process group leader, all child processes (backends, validators, MCP servers) belong to the same pgid, enabling clean group-wide shutdown via os.killpg().

Lifecycle

setup() — create new process group (daemon becomes leader)
cleanup_orphans() — scan daemon's child tree for leaked processes
reap_orphaned_backends() — system-wide scan for leaked backend children
kill_all_children() — shutdown: signal entire group
atexit handler — last-resort cleanup if shutdown is interrupted

The distinction between cleanup_orphans() and reap_orphaned_backends(): cleanup_orphans() walks the daemon's own child tree via psutil. This misses processes that were reparented to init (PID 1) after their parent (e.g. Claude CLI) exited — which is exactly how MCP/LSP server leaks happen. reap_orphaned_backends() scans ALL processes owned by the current user for known orphan patterns.

Classes¶

ProcessGroupManager ¶

ProcessGroupManager()

Manages the daemon's process group to prevent orphans.

The daemon calls setup() early in its lifecycle to become the process group leader. During shutdown, kill_all_children() sends SIGTERM to the entire group, ensuring no child process (including deeply nested MCP servers) survives the daemon.

An atexit handler provides last-resort cleanup even if the normal shutdown path is skipped.

Source code in src/marianne/daemon/pgroup.py

def __init__(self) -> None:
    self._original_pgid: int = os.getpgrp()
    self._is_leader: bool = False
    self._atexit_registered: bool = False
    # PIDs of processes spawned by backends (claude, etc.).
    # Used to identify orphaned children: if a tracked PID is dead
    # but its children survive, those children are orphans.
    self._tracked_backend_pids: set[int] = set()

Attributes¶

is_leader `property` ¶

is_leader

Whether the daemon is the process group leader.

original_pgid `property` ¶

original_pgid

Process group ID before setup() was called.

Functions¶

track_backend_pid ¶

track_backend_pid(pid)

Register a backend process PID for orphan tracking.

When a backend (claude, gemini-cli, etc.) spawns a process, call this with the process PID. On cleanup, any surviving children of dead tracked PIDs are killed as orphans — regardless of what they're called. This replaces cmdline pattern matching with ancestry-based detection.

Source code in src/marianne/daemon/pgroup.py

def track_backend_pid(self, pid: int) -> None:
    """Register a backend process PID for orphan tracking.

    When a backend (claude, gemini-cli, etc.) spawns a process, call
    this with the process PID. On cleanup, any surviving children of
    dead tracked PIDs are killed as orphans — regardless of what
    they're called. This replaces cmdline pattern matching with
    ancestry-based detection.
    """
    self._tracked_backend_pids.add(pid)
    _logger.debug("pgroup.track_backend", pid=pid)

untrack_backend_pid ¶

untrack_backend_pid(pid)

Remove a backend PID from tracking after clean exit.

Source code in src/marianne/daemon/pgroup.py

def untrack_backend_pid(self, pid: int) -> None:
    """Remove a backend PID from tracking after clean exit."""
    self._tracked_backend_pids.discard(pid)
    _logger.debug("pgroup.untrack_backend", pid=pid)

setup ¶

setup()

Create a new process group with the daemon as leader.

Must be called early in daemon startup, before spawning any child processes. Idempotent — safe to call multiple times.

Source code in src/marianne/daemon/pgroup.py

def setup(self) -> None:
    """Create a new process group with the daemon as leader.

    Must be called early in daemon startup, before spawning any
    child processes.  Idempotent — safe to call multiple times.
    """
    if self._is_leader:
        return

    try:
        os.setpgrp()
        self._is_leader = True
        _logger.info(
            "pgroup.setup_complete",
            pid=os.getpid(),
            pgid=os.getpgrp(),
            original_pgid=self._original_pgid,
        )
    except OSError as exc:
        # May fail if already the group leader (e.g. after daemonize)
        if os.getpid() == os.getpgrp():
            self._is_leader = True
            _logger.debug(
                "pgroup.already_leader",
                pid=os.getpid(),
            )
        else:
            _logger.warning(
                "pgroup.setup_failed",
                error=str(exc),
                pid=os.getpid(),
            )

    # Register last-resort atexit cleanup
    if self._is_leader and not self._atexit_registered:
        atexit.register(self._atexit_cleanup)
        self._atexit_registered = True

kill_all_children ¶

kill_all_children(sig=SIGTERM)

Send signal to all processes in our group except ourselves.

Parameters:

Name	Type	Description	Default
`sig`	`int`	Signal number to send (default SIGTERM for graceful stop).	`SIGTERM`

Returns:

Type	Description
`int`	The process group ID that was signaled, or 0 if no signal sent.

Source code in src/marianne/daemon/pgroup.py

def kill_all_children(self, sig: int = signal.SIGTERM) -> int:
    """Send signal to all processes in our group except ourselves.

    Args:
        sig: Signal number to send (default SIGTERM for graceful stop).

    Returns:
        The process group ID that was signaled, or 0 if no signal sent.
    """
    if not self._is_leader:
        _logger.debug("pgroup.not_leader_skip_kill")
        return 0

    pgid = os.getpgrp()
    my_pid = os.getpid()

    # Count children before signaling for logging
    child_count = self._count_group_members(pgid, exclude_pid=my_pid)

    if child_count == 0:
        _logger.debug("pgroup.no_children_to_signal")
        return pgid

    try:
        # Temporarily ignore the signal in our own process so killpg
        # doesn't terminate us along with the children.
        old_handler = signal.signal(sig, signal.SIG_IGN)
        try:
            sent = _safe_killpg(pgid, sig, context="pgroup.kill_all_children")
        finally:
            signal.signal(sig, old_handler)

        if sent:
            _logger.info(
                "pgroup.signaled_children",
                signal=signal.Signals(sig).name,
                pgid=pgid,
                child_count=child_count,
            )
        return pgid
    except ProcessLookupError:
        _logger.debug("pgroup.no_processes_in_group", pgid=pgid)
        return 0
    except PermissionError:
        _logger.warning("pgroup.permission_denied", pgid=pgid)
        return 0

cleanup_orphans ¶

cleanup_orphans()

Find and clean up orphaned child processes in the daemon's tree.

Detects two categories: 1. Zombie children — reaped via waitpid 2. Orphaned MCP servers — processes whose parent has died (reparented to init/PID 1) that still match MCP patterns

Note: This only scans the daemon's own child tree. For processes that escaped the tree entirely (reparented to init), use reap_orphaned_backends() which does a system-wide scan.

Returns:

Type	Description
`list[int]`	List of PIDs that were cleaned up.

Source code in src/marianne/daemon/pgroup.py

def cleanup_orphans(self) -> list[int]:
    """Find and clean up orphaned child processes in the daemon's tree.

    Detects two categories:
    1. Zombie children — reaped via waitpid
    2. Orphaned MCP servers — processes whose parent has died
       (reparented to init/PID 1) that still match MCP patterns

    Note: This only scans the daemon's own child tree.  For processes
    that escaped the tree entirely (reparented to init), use
    reap_orphaned_backends() which does a system-wide scan.

    Returns:
        List of PIDs that were cleaned up.
    """
    orphans: list[int] = []

    # Strategy 1: psutil-based deep scan (preferred)
    try:
        import psutil

        current = psutil.Process(os.getpid())

        for child in current.children(recursive=True):
            try:
                if not child.is_running():
                    continue

                status = child.status()

                # Reap zombies
                if status == psutil.STATUS_ZOMBIE:
                    try:
                        os.waitpid(child.pid, os.WNOHANG)
                    except ChildProcessError:
                        pass
                    orphans.append(child.pid)
                    _logger.debug(
                        "pgroup.reaped_zombie",
                        pid=child.pid,
                    )
                    continue

                # NOTE: Ancestry-based orphan killing disabled.
                # The F-481 rewrite removed cmdline filtering, making
                # this kill ANY child not parented by us when dead
                # backends exist — including Chrome, pytest, pyright,
                # and other legitimate processes in the daemon's tree.
                # The per-job PID tracking in the conductor DB will
                # replace this.  Until then, only zombie reaping above
                # is active.  Orphaned MCP servers accumulate but
                # don't crash the system.

            except (psutil.NoSuchProcess, psutil.AccessDenied):
                continue

    except ImportError:
        # Strategy 2: /proc fallback (Linux only)
        orphans.extend(self._cleanup_orphans_proc())

    if orphans:
        _logger.info(
            "pgroup.orphan_cleanup",
            cleaned_count=len(orphans),
            pids=orphans,
        )

    return orphans

reap_orphaned_backends ¶

reap_orphaned_backends()

System-wide scan for orphaned backend child processes.

.. warning:: DISABLED — This method is a no-op.

The F-481 rewrite removed cmdline pattern filtering and replaced it with ancestry-only detection (ppid in {0, 1}). Without filtering, this kills EVERY user-owned process parented by init/systemd — including the user's systemd session manager, terminal emulators, and dbus. On WSL2, killing systemd --user cascades into systemd-poweroff.service and shuts down the entire VM (observed 9 times, exit code 9, all terminals dead).

The replacement is per-job PID tracking in the conductor DB (see composer-notes.yaml "PROCESS CLEANUP SIMPLIFICATION"). Until that's implemented, orphaned MCP/LSP servers from dead backends accumulate but don't crash the system.

Returns:

Type	Description
`list[int]`	Empty list (no-op).

Source code in src/marianne/daemon/pgroup.py

def reap_orphaned_backends(self) -> list[int]:
    """System-wide scan for orphaned backend child processes.

    .. warning:: DISABLED — This method is a no-op.

       The F-481 rewrite removed cmdline pattern filtering and replaced
       it with ancestry-only detection (ppid in {0, 1}).  Without
       filtering, this kills EVERY user-owned process parented by
       init/systemd — including the user's systemd session manager,
       terminal emulators, and dbus.  On WSL2, killing ``systemd
       --user`` cascades into ``systemd-poweroff.service`` and shuts
       down the entire VM (observed 9 times, exit code 9, all
       terminals dead).

       The replacement is per-job PID tracking in the conductor DB
       (see composer-notes.yaml "PROCESS CLEANUP SIMPLIFICATION").
       Until that's implemented, orphaned MCP/LSP servers from dead
       backends accumulate but don't crash the system.

    Returns:
        Empty list (no-op).
    """
    # Drain dead PIDs from the tracking set so it doesn't grow
    # unboundedly — but do NOT act on them.
    if self._tracked_backend_pids:
        dead: set[int] = set()
        for bpid in self._tracked_backend_pids:
            try:
                os.kill(bpid, 0)
            except OSError:
                dead.add(bpid)
        if dead:
            self._tracked_backend_pids -= dead
            _logger.debug(
                "pgroup.drained_dead_backend_pids",
                count=len(dead),
                pids=list(dead),
            )
    return []

pgroup

pgroup ¶

Classes¶

ProcessGroupManager ¶

Attributes¶

is_leader property ¶

original_pgid property ¶

Functions¶

track_backend_pid ¶

untrack_backend_pid ¶

setup ¶

kill_all_children ¶

cleanup_orphans ¶

reap_orphaned_backends ¶

Functions¶

is_leader `property` ¶

original_pgid `property` ¶