Bobcares

Docker failed to initialize NVML unknown error

by | Jan 30, 2023

Wondering how to resolve docker failed to initialize nvml unknown error? Our Docker Support team is here to lend a hand with your queries and issues.

How to resolve docker failed to initialize NVML unknown error?

After a random amount of time the GPUs become unavailable inside all the running containers and nvidia-smi returns below error:

“Failed to initialize NVML: Unknown Error”. A restart of all the containers fixes the issue and the GPUs return available.

Today, let us see the methods followed by our support techs to resolve docker error:

Method 1:

1) Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :

cat /proc/cmdlineCopy Code

It’s of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime

2) nvidia-container configuration
In the file

/etc/nvidia-container-runtime/config.tomlCopy Code

set the parameter

no-cgroups = falseCopy Code

After that restart docker and run test container:

sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Copy Code

Method 2

Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)

no-cgroups = trueCopy Code

Then you must manually pass all gpu devices to the container.
For debugging purposes, just run:

sudo systemctl restart docker
sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smiCopy Code

When using nvidia-container-runtime or nvidia-container-toolkit with cgroup option, it automatically allocate machine resource for the container

So when bypass this option, you gotta allocate resource by your own. Here’s an example
A single docker run

docker run --rm --gpus 1 --device /dev/nvidia-caps --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset  --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools nvidia/cuda:11.0-base nvidia-smiCopy Code

A compose file

version: "3.9"

services:

  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]Copy Code

Or you can run docker in privileged mode, which allow docker to access all devices in your machine
A single docker run

docker run --rm --privileged --gpus 1 nvidia/cuda:11.0-base nvidia-smiCopy Code

A compose file

version: "3.9"

services:

  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    privileged: true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]Copy Code

———————————————————————–

You can also use

nvidia-container-runtimeCopy Code

installed by

yay -S nvidia-container-runtimeCopy Code

which is another options along with nvidia-container-toolkit
Edit you /etc/docker/daemon.json

{
    "data-root": "/virtual/data/docker", // just my personal
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}Copy Code
sudo systemctl restart dockerCopy Code

There’s no need to edit docker unit. As you run you container with additional runtime option, it’ll search for runtime inside the daemon.json

For a single docker run, with or without privileged mode, just replace

--gpus allCopy Code

with

--runtime nvidiaCopy Code

For a compose file

Without privilege mode

version: "3.9"

services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
    entrypoint: ["nvidia-smi"]Copy Code

With privilege mode

version: "3.9"

services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    privileged: true
    entrypoint: ["nvidia-smi"]Copy Code

run a privileged container with `docker run`’s option `–privileged` to have access to the GPU:

docker run --gpus all --privileged nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smiCopy Code

[Need assistance with a different issue? Our team is available 24/7.]

Conclusion

In conclusion, our Support Engineers demonstrated how to resolve docker failed to initialize nvml unknown error

PREVENT YOUR SERVER FROM CRASHING!

Never again lose customers to poor server speed! Let us help you.

Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.

GET STARTED

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Speed issues driving customers away?
We’ve got your back!