Skip to content

pgroup

pgroup

Process group management for orphan prevention.

Addresses issue #38: cascading crashes from orphaned MCP server processes. When the daemon is the process group leader, all child processes (backends, validators, MCP servers) belong to the same pgid, enabling clean group-wide shutdown via os.killpg().

Lifecycle
  1. setup() — create new process group (daemon becomes leader)
  2. cleanup_orphans() — scan daemon's child tree for leaked processes
  3. reap_orphaned_backends() — system-wide scan for leaked backend children
  4. kill_all_children() — shutdown: signal entire group
  5. atexit handler — last-resort cleanup if shutdown is interrupted

The distinction between cleanup_orphans() and reap_orphaned_backends(): cleanup_orphans() walks the daemon's own child tree via psutil. This misses processes that were reparented to init (PID 1) after their parent (e.g. Claude CLI) exited — which is exactly how MCP/LSP server leaks happen. reap_orphaned_backends() scans ALL processes owned by the current user for known orphan patterns.

Classes

ProcessGroupManager

ProcessGroupManager()

Manages the daemon's process group to prevent orphans.

The daemon calls setup() early in its lifecycle to become the process group leader. During shutdown, kill_all_children() sends SIGTERM to the entire group, ensuring no child process (including deeply nested MCP servers) survives the daemon.

An atexit handler provides last-resort cleanup even if the normal shutdown path is skipped.

Source code in src/marianne/daemon/pgroup.py
def __init__(self) -> None:
    self._original_pgid: int = os.getpgrp()
    self._is_leader: bool = False
    self._atexit_registered: bool = False
    # PIDs of processes spawned by backends (claude, etc.).
    # Used to identify orphaned children: if a tracked PID is dead
    # but its children survive, those children are orphans.
    self._tracked_backend_pids: set[int] = set()
Attributes
is_leader property
is_leader

Whether the daemon is the process group leader.

original_pgid property
original_pgid

Process group ID before setup() was called.

Functions
track_backend_pid
track_backend_pid(pid)

Register a backend process PID for orphan tracking.

When a backend (claude, gemini-cli, etc.) spawns a process, call this with the process PID. On cleanup, any surviving children of dead tracked PIDs are killed as orphans — regardless of what they're called. This replaces cmdline pattern matching with ancestry-based detection.

Source code in src/marianne/daemon/pgroup.py
def track_backend_pid(self, pid: int) -> None:
    """Register a backend process PID for orphan tracking.

    When a backend (claude, gemini-cli, etc.) spawns a process, call
    this with the process PID. On cleanup, any surviving children of
    dead tracked PIDs are killed as orphans — regardless of what
    they're called. This replaces cmdline pattern matching with
    ancestry-based detection.
    """
    self._tracked_backend_pids.add(pid)
    _logger.debug("pgroup.track_backend", pid=pid)
untrack_backend_pid
untrack_backend_pid(pid)

Remove a backend PID from tracking after clean exit.

Source code in src/marianne/daemon/pgroup.py
def untrack_backend_pid(self, pid: int) -> None:
    """Remove a backend PID from tracking after clean exit."""
    self._tracked_backend_pids.discard(pid)
    _logger.debug("pgroup.untrack_backend", pid=pid)
setup
setup()

Create a new process group with the daemon as leader.

Must be called early in daemon startup, before spawning any child processes. Idempotent — safe to call multiple times.

Source code in src/marianne/daemon/pgroup.py
def setup(self) -> None:
    """Create a new process group with the daemon as leader.

    Must be called early in daemon startup, before spawning any
    child processes.  Idempotent — safe to call multiple times.
    """
    if self._is_leader:
        return

    try:
        os.setpgrp()
        self._is_leader = True
        _logger.info(
            "pgroup.setup_complete",
            pid=os.getpid(),
            pgid=os.getpgrp(),
            original_pgid=self._original_pgid,
        )
    except OSError as exc:
        # May fail if already the group leader (e.g. after daemonize)
        if os.getpid() == os.getpgrp():
            self._is_leader = True
            _logger.debug(
                "pgroup.already_leader",
                pid=os.getpid(),
            )
        else:
            _logger.warning(
                "pgroup.setup_failed",
                error=str(exc),
                pid=os.getpid(),
            )

    # Register last-resort atexit cleanup
    if self._is_leader and not self._atexit_registered:
        atexit.register(self._atexit_cleanup)
        self._atexit_registered = True
kill_all_children
kill_all_children(sig=SIGTERM)

Send signal to all processes in our group except ourselves.

Parameters:

Name Type Description Default
sig int

Signal number to send (default SIGTERM for graceful stop).

SIGTERM

Returns:

Type Description
int

The process group ID that was signaled, or 0 if no signal sent.

Source code in src/marianne/daemon/pgroup.py
def kill_all_children(self, sig: int = signal.SIGTERM) -> int:
    """Send signal to all processes in our group except ourselves.

    Args:
        sig: Signal number to send (default SIGTERM for graceful stop).

    Returns:
        The process group ID that was signaled, or 0 if no signal sent.
    """
    if not self._is_leader:
        _logger.debug("pgroup.not_leader_skip_kill")
        return 0

    pgid = os.getpgrp()
    my_pid = os.getpid()

    # Count children before signaling for logging
    child_count = self._count_group_members(pgid, exclude_pid=my_pid)

    if child_count == 0:
        _logger.debug("pgroup.no_children_to_signal")
        return pgid

    try:
        # Temporarily ignore the signal in our own process so killpg
        # doesn't terminate us along with the children.
        old_handler = signal.signal(sig, signal.SIG_IGN)
        try:
            sent = _safe_killpg(pgid, sig, context="pgroup.kill_all_children")
        finally:
            signal.signal(sig, old_handler)

        if sent:
            _logger.info(
                "pgroup.signaled_children",
                signal=signal.Signals(sig).name,
                pgid=pgid,
                child_count=child_count,
            )
        return pgid
    except ProcessLookupError:
        _logger.debug("pgroup.no_processes_in_group", pgid=pgid)
        return 0
    except PermissionError:
        _logger.warning("pgroup.permission_denied", pgid=pgid)
        return 0
cleanup_orphans
cleanup_orphans()

Find and clean up orphaned child processes in the daemon's tree.

Detects two categories: 1. Zombie children — reaped via waitpid 2. Orphaned MCP servers — processes whose parent has died (reparented to init/PID 1) that still match MCP patterns

Note: This only scans the daemon's own child tree. For processes that escaped the tree entirely (reparented to init), use reap_orphaned_backends() which does a system-wide scan.

Returns:

Type Description
list[int]

List of PIDs that were cleaned up.

Source code in src/marianne/daemon/pgroup.py
def cleanup_orphans(self) -> list[int]:
    """Find and clean up orphaned child processes in the daemon's tree.

    Detects two categories:
    1. Zombie children — reaped via waitpid
    2. Orphaned MCP servers — processes whose parent has died
       (reparented to init/PID 1) that still match MCP patterns

    Note: This only scans the daemon's own child tree.  For processes
    that escaped the tree entirely (reparented to init), use
    reap_orphaned_backends() which does a system-wide scan.

    Returns:
        List of PIDs that were cleaned up.
    """
    orphans: list[int] = []

    # Strategy 1: psutil-based deep scan (preferred)
    try:
        import psutil

        current = psutil.Process(os.getpid())

        for child in current.children(recursive=True):
            try:
                if not child.is_running():
                    continue

                status = child.status()

                # Reap zombies
                if status == psutil.STATUS_ZOMBIE:
                    try:
                        os.waitpid(child.pid, os.WNOHANG)
                    except ChildProcessError:
                        pass
                    orphans.append(child.pid)
                    _logger.debug(
                        "pgroup.reaped_zombie",
                        pid=child.pid,
                    )
                    continue

                # NOTE: Ancestry-based orphan killing disabled.
                # The F-481 rewrite removed cmdline filtering, making
                # this kill ANY child not parented by us when dead
                # backends exist — including Chrome, pytest, pyright,
                # and other legitimate processes in the daemon's tree.
                # The per-job PID tracking in the conductor DB will
                # replace this.  Until then, only zombie reaping above
                # is active.  Orphaned MCP servers accumulate but
                # don't crash the system.

            except (psutil.NoSuchProcess, psutil.AccessDenied):
                continue

    except ImportError:
        # Strategy 2: /proc fallback (Linux only)
        orphans.extend(self._cleanup_orphans_proc())

    if orphans:
        _logger.info(
            "pgroup.orphan_cleanup",
            cleaned_count=len(orphans),
            pids=orphans,
        )

    return orphans
reap_orphaned_backends
reap_orphaned_backends()

System-wide scan for orphaned backend child processes.

.. warning:: DISABLED — This method is a no-op.

The F-481 rewrite removed cmdline pattern filtering and replaced it with ancestry-only detection (ppid in {0, 1}). Without filtering, this kills EVERY user-owned process parented by init/systemd — including the user's systemd session manager, terminal emulators, and dbus. On WSL2, killing systemd --user cascades into systemd-poweroff.service and shuts down the entire VM (observed 9 times, exit code 9, all terminals dead).

The replacement is per-job PID tracking in the conductor DB (see composer-notes.yaml "PROCESS CLEANUP SIMPLIFICATION"). Until that's implemented, orphaned MCP/LSP servers from dead backends accumulate but don't crash the system.

Returns:

Type Description
list[int]

Empty list (no-op).

Source code in src/marianne/daemon/pgroup.py
def reap_orphaned_backends(self) -> list[int]:
    """System-wide scan for orphaned backend child processes.

    .. warning:: DISABLED — This method is a no-op.

       The F-481 rewrite removed cmdline pattern filtering and replaced
       it with ancestry-only detection (ppid in {0, 1}).  Without
       filtering, this kills EVERY user-owned process parented by
       init/systemd — including the user's systemd session manager,
       terminal emulators, and dbus.  On WSL2, killing ``systemd
       --user`` cascades into ``systemd-poweroff.service`` and shuts
       down the entire VM (observed 9 times, exit code 9, all
       terminals dead).

       The replacement is per-job PID tracking in the conductor DB
       (see composer-notes.yaml "PROCESS CLEANUP SIMPLIFICATION").
       Until that's implemented, orphaned MCP/LSP servers from dead
       backends accumulate but don't crash the system.

    Returns:
        Empty list (no-op).
    """
    # Drain dead PIDs from the tracking set so it doesn't grow
    # unboundedly — but do NOT act on them.
    if self._tracked_backend_pids:
        dead: set[int] = set()
        for bpid in self._tracked_backend_pids:
            try:
                os.kill(bpid, 0)
            except OSError:
                dead.add(bpid)
        if dead:
            self._tracked_backend_pids -= dead
            _logger.debug(
                "pgroup.drained_dead_backend_pids",
                count=len(dead),
                pids=list(dead),
            )
    return []

Functions