Wondering how to resolve docker failed to initialize nvml unknown error? Our Docker Support team is here to lend a hand with your queries and issues.
How to resolve docker failed to initialize NVML unknown error?
After a random amount of time the GPUs become unavailable inside all the running containers and nvidia-smi returns below error:
“Failed to initialize NVML: Unknown Error”. A restart of all the containers fixes the issue and the GPUs return available.
Today, let us see the methods followed by our support techs to resolve docker error:
Method 1:
1) Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :
cat /proc/cmdline
It’s of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime
2) nvidia-container configuration
In the file
/etc/nvidia-container-runtime/config.toml
set the parameter
no-cgroups = false
After that restart docker and run test container:
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Method 2
Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)
no-cgroups = true
Then you must manually pass all gpu devices to the container.
For debugging purposes, just run:
sudo systemctl restart docker
sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi
When using nvidia-container-runtime or nvidia-container-toolkit with cgroup option, it automatically allocate machine resource for the container
So when bypass this option, you gotta allocate resource by your own. Here’s an example
A single docker run
docker run --rm --gpus 1 --device /dev/nvidia-caps --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools nvidia/cuda:11.0-base nvidia-smi
A compose file
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
devices:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidiactl:/dev/nvidiactl
- /dev/nvidia-caps:/dev/nvidia-caps
- /dev/nvidia-modeset:/dev/nvidia-modeset
- /dev/nvidia-uvm:/dev/nvidia-uvm
- /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["nvidia-smi"]
Or you can run docker in privileged mode, which allow docker to access all devices in your machine
A single docker run
docker run --rm --privileged --gpus 1 nvidia/cuda:11.0-base nvidia-smi
A compose file
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
privileged: true
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["nvidia-smi"]
———————————————————————–
You can also use
nvidia-container-runtime
installed by
yay -S nvidia-container-runtime
which is another options along with nvidia-container-toolkit
Edit you /etc/docker/daemon.json
{
"data-root": "/virtual/data/docker", // just my personal
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
sudo systemctl restart docker
There’s no need to edit docker unit. As you run you container with additional runtime option, it’ll search for runtime inside the daemon.json
For a single docker run, with or without privileged mode, just replace
--gpus all
with
--runtime nvidia
For a compose file
Without privilege mode
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
runtime: nvidia
devices:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidiactl:/dev/nvidiactl
- /dev/nvidia-caps:/dev/nvidia-caps
- /dev/nvidia-modeset:/dev/nvidia-modeset
- /dev/nvidia-uvm:/dev/nvidia-uvm
- /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
entrypoint: ["nvidia-smi"]
With privilege mode
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
runtime: nvidia
privileged: true
entrypoint: ["nvidia-smi"]
run a privileged container with `docker run`’s option `–privileged` to have access to the GPU:
docker run --gpus all --privileged nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi
[Need assistance with a different issue? Our team is available 24/7.]
Conclusion
In conclusion, our Support Engineers demonstrated how to resolve docker failed to initialize nvml unknown error
PREVENT YOUR SERVER FROM CRASHING!
Never again lose customers to poor server speed! Let us help you.
Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.
0 Comments