Docker failed to initialize NVML unknown error

by Amritha V | Jan 30, 2023

Wondering how to resolve docker failed to initialize nvml unknown error? Our Docker Support team is here to lend a hand with your queries and issues.

How to resolve docker failed to initialize NVML unknown error?

After a random amount of time the GPUs become unavailable inside all the running containers and nvidia-smi returns below error:

“Failed to initialize NVML: Unknown Error”. A restart of all the containers fixes the issue and the GPUs return available.

Today, let us see the methods followed by our support techs to resolve docker error:

Method 1:

1) Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :

cat /proc/cmdline

It’s of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime

2) nvidia-container configuration
In the file

/etc/nvidia-container-runtime/config.toml

set the parameter

no-cgroups = false

After that restart docker and run test container:

sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Method 2

Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)

no-cgroups = true

Then you must manually pass all gpu devices to the container.
For debugging purposes, just run:

sudo systemctl restart docker
sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi

When using nvidia-container-runtime or nvidia-container-toolkit with cgroup option, it automatically allocate machine resource for the container

So when bypass this option, you gotta allocate resource by your own. Here’s an example
A single docker run

docker run --rm --gpus 1 --device /dev/nvidia-caps --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset  --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools nvidia/cuda:11.0-base nvidia-smi

A compose file

version: "3.9"

services:

  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]

Or you can run docker in privileged mode, which allow docker to access all devices in your machine
A single docker run

docker run --rm --privileged --gpus 1 nvidia/cuda:11.0-base nvidia-smi

A compose file

version: "3.9"

services:

  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    privileged: true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]

———————————————————————–

You can also use

nvidia-container-runtime

installed by

yay -S nvidia-container-runtime

which is another options along with nvidia-container-toolkit
Edit you /etc/docker/daemon.json

{
    "data-root": "/virtual/data/docker", // just my personal
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

sudo systemctl restart docker

There’s no need to edit docker unit. As you run you container with additional runtime option, it’ll search for runtime inside the daemon.json

For a single docker run, with or without privileged mode, just replace

--gpus all

with

--runtime nvidia

For a compose file

Without privilege mode

version: "3.9"

services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
    entrypoint: ["nvidia-smi"]

With privilege mode

version: "3.9"

services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    privileged: true
    entrypoint: ["nvidia-smi"]

run a privileged container with `docker run`’s option `–privileged` to have access to the GPU:

docker run --gpus all --privileged nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi

[Need assistance with a different issue? Our team is available 24/7.]

Conclusion

In conclusion, our Support Engineers demonstrated how to resolve docker failed to initialize nvml unknown error

PREVENT YOUR SERVER FROM CRASHING!

Never again lose customers to poor server speed! Let us help you.

Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.

Software Development

Server Management

Software Development

Server Management

Docker failed to initialize NVML unknown error

How to resolve docker failed to initialize NVML unknown error?

Method 1:

Method 2

Conclusion

PREVENT YOUR SERVER FROM CRASHING!

0 Comments

Submit a Comment Cancel reply

Spend time on your business, not on your servers.

Related Articles

Never again lose customers to poor
server speed! Let us help you.

WE ARE AT

INFORMATION

LATEST BLOG POSTS

Server Management

Outsourced Support

Software Development

Application Support

Cloud

Software Development

Server Management

Software Development

Server Management

Docker failed to initialize NVML unknown error

How to resolve docker failed to initialize NVML unknown error?

Method 1:

Method 2

Conclusion

PREVENT YOUR SERVER FROM CRASHING!

Related posts:

0 Comments

Submit a Comment Cancel reply

Spend time on your business, not on your servers.

Related Articles

Find the article helpful? Subscribe to our newsletter to never miss out on useful content

Never again lose customers to poor server speed! Let us help you.

Never again lose customers to poor
server speed! Let us help you.