Skip to content

gpu_probe

gpu_probe

GPU probing for the Marianne daemon profiler.

Provides a GpuProbe class that follows the same "try primary → fallback → graceful None" pattern as SystemProbe. Priority chain:

  1. pynvml — NVIDIA Management Library Python bindings (fast, no subprocess)
  2. nvidia-smi — shell subprocess fallback
  3. No GPU data — silent skip, returns empty list

All methods are static. _pynvml_available is set once at import time.

Classes

GpuMetric dataclass

GpuMetric(index, utilization_pct, memory_used_mb, memory_total_mb, temperature_c)

Snapshot of a single GPU's current state.

GpuProbe

GPU resource probes following the SystemProbe pattern.

Each method tries pynvml first, then falls back to nvidia-smi. Returns empty list when no GPU is available — callers treat that as "no GPU present" (not an error).

Functions
get_gpu_metrics staticmethod
get_gpu_metrics()

Get current metrics for all GPUs.

Priority
  1. pynvml (fast, in-process)
  2. nvidia-smi subprocess fallback
  3. Empty list (no GPU / no drivers)

Returns:

Type Description
list[GpuMetric]

List of GpuMetric, one per GPU. Empty if no GPU available.

Source code in src/marianne/daemon/profiler/gpu_probe.py
@staticmethod
def get_gpu_metrics() -> list[GpuMetric]:
    """Get current metrics for all GPUs.

    Priority:
        1. pynvml (fast, in-process)
        2. nvidia-smi subprocess fallback
        3. Empty list (no GPU / no drivers)

    Returns:
        List of GpuMetric, one per GPU.  Empty if no GPU available.
    """
    if _pynvml_available:
        try:
            return GpuProbe._probe_pynvml()
        except Exception:
            _logger.debug("pynvml_probe_failed", exc_info=True)
    # Fallback to nvidia-smi
    try:
        return GpuProbe._probe_nvidia_smi_sync()
    except Exception:
        _logger.debug("nvidia_smi_probe_failed", exc_info=True)
    return []
get_gpu_metrics_async async staticmethod
get_gpu_metrics_async()

Async variant of get_gpu_metrics.

Uses asyncio.create_subprocess_exec for the nvidia-smi fallback so it doesn't block the event loop.

Returns:

Type Description
list[GpuMetric]

List of GpuMetric, one per GPU. Empty if no GPU available.

Source code in src/marianne/daemon/profiler/gpu_probe.py
@staticmethod
async def get_gpu_metrics_async() -> list[GpuMetric]:
    """Async variant of get_gpu_metrics.

    Uses ``asyncio.create_subprocess_exec`` for the nvidia-smi fallback
    so it doesn't block the event loop.

    Returns:
        List of GpuMetric, one per GPU.  Empty if no GPU available.
    """
    if _pynvml_available:
        try:
            return GpuProbe._probe_pynvml()
        except Exception:
            _logger.debug("pynvml_probe_failed", exc_info=True)
    try:
        return await GpuProbe._probe_nvidia_smi_async()
    except Exception:
        _logger.debug("nvidia_smi_async_probe_failed", exc_info=True)
    return []
is_available staticmethod
is_available()

Check whether any GPU probing method is available.

Returns True if pynvml is importable OR nvidia-smi is on PATH.

Source code in src/marianne/daemon/profiler/gpu_probe.py
@staticmethod
def is_available() -> bool:
    """Check whether any GPU probing method is available.

    Returns True if pynvml is importable OR nvidia-smi is on PATH.
    """
    if _pynvml_available:
        return True
    return shutil.which("nvidia-smi") is not None

Functions