Bobcares

Docker failed to initialize NVML unknown error

by | Jan 30, 2023

Wondering how to resolve docker failed to initialize nvml unknown error? Our Docker Support team is here to lend a hand with your queries and issues.

How to resolve docker failed to initialize NVML unknown error?

After a random amount of time the GPUs become unavailable inside all the running containers and nvidia-smi returns below error:

“Failed to initialize NVML: Unknown Error”. A restart of all the containers fixes the issue and the GPUs return available.

Today, let us see the methods followed by our support techs to resolve docker error:

Method 1:

1) Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :

cat /proc/cmdline

It’s of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime

2) nvidia-container configuration
In the file

/etc/nvidia-container-runtime/config.toml

set the parameter

no-cgroups = false

After that restart docker and run test container:

sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Method 2

Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)

no-cgroups = true

Then you must manually pass all gpu devices to the container.
For debugging purposes, just run:

sudo systemctl restart docker
sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi

When using nvidia-container-runtime or nvidia-container-toolkit with cgroup option, it automatically allocate machine resource for the container

So when bypass this option, you gotta allocate resource by your own. Here’s an example
A single docker run

docker run --rm --gpus 1 --device /dev/nvidia-caps --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset  --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools nvidia/cuda:11.0-base nvidia-smi

A compose file

version: "3.9"

services:

  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]

Or you can run docker in privileged mode, which allow docker to access all devices in your machine
A single docker run

docker run --rm --privileged --gpus 1 nvidia/cuda:11.0-base nvidia-smi

A compose file

version: "3.9"

services:

  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    privileged: true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]

———————————————————————–

You can also use

nvidia-container-runtime

installed by

yay -S nvidia-container-runtime

which is another options along with nvidia-container-toolkit
Edit you /etc/docker/daemon.json

{
    "data-root": "/virtual/data/docker", // just my personal
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
sudo systemctl restart docker

There’s no need to edit docker unit. As you run you container with additional runtime option, it’ll search for runtime inside the daemon.json

For a single docker run, with or without privileged mode, just replace

--gpus all

with

--runtime nvidia

For a compose file

Without privilege mode

version: "3.9"

services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
    entrypoint: ["nvidia-smi"]

With privilege mode

version: "3.9"

services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    privileged: true
    entrypoint: ["nvidia-smi"]

run a privileged container with `docker run`’s option `–privileged` to have access to the GPU:

docker run --gpus all --privileged nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi

[Need assistance with a different issue? Our team is available 24/7.]

Conclusion

In conclusion, our Support Engineers demonstrated how to resolve docker failed to initialize nvml unknown error

PREVENT YOUR SERVER FROM CRASHING!

Never again lose customers to poor server speed! Let us help you.

Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.

GET STARTED

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Never again lose customers to poor
server speed! Let us help you.