Wondering how to resolve docker failed to initialize nvml unknown error? Our Docker Support team is here to lend a hand with your queries and issues.
How to resolve docker failed to initialize NVML unknown error?
After a random amount of time the GPUs become unavailable inside all the running containers and nvidia-smi returns below error:
“Failed to initialize NVML: Unknown Error”. A restart of all the containers fixes the issue and the GPUs return available.
Today, let us see the methods followed by our support techs to resolve docker error:
Method 1:
1) Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :
cat /proc/cmdline
Copy Code
It’s of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime
2) nvidia-container configuration
In the file
/etc/nvidia-container-runtime/config.toml
Copy Code
set the parameter
no-cgroups = false
Copy Code
After that restart docker and run test container:
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Copy Code
Method 2
Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)
no-cgroups = true
Copy Code
Then you must manually pass all gpu devices to the container.
For debugging purposes, just run:
sudo systemctl restart docker
sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi
Copy Code
When using nvidia-container-runtime or nvidia-container-toolkit with cgroup option, it automatically allocate machine resource for the container
So when bypass this option, you gotta allocate resource by your own. Here’s an example
A single docker run
docker run --rm --gpus 1 --device /dev/nvidia-caps --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools nvidia/cuda:11.0-base nvidia-smi
Copy Code
A compose file
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
devices:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidiactl:/dev/nvidiactl
- /dev/nvidia-caps:/dev/nvidia-caps
- /dev/nvidia-modeset:/dev/nvidia-modeset
- /dev/nvidia-uvm:/dev/nvidia-uvm
- /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["nvidia-smi"]
Copy Code
Or you can run docker in privileged mode, which allow docker to access all devices in your machine
A single docker run
docker run --rm --privileged --gpus 1 nvidia/cuda:11.0-base nvidia-smi
Copy Code
A compose file
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
privileged: true
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["nvidia-smi"]
Copy Code
———————————————————————–
You can also use
nvidia-container-runtime
Copy Code
installed by
yay -S nvidia-container-runtime
Copy Code
which is another options along with nvidia-container-toolkit
Edit you /etc/docker/daemon.json
{
"data-root": "/virtual/data/docker", // just my personal
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Copy Code
sudo systemctl restart docker
Copy Code
There’s no need to edit docker unit. As you run you container with additional runtime option, it’ll search for runtime inside the daemon.json
For a single docker run, with or without privileged mode, just replace
--gpus all
Copy Code
with
--runtime nvidia
Copy Code
For a compose file
Without privilege mode
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
runtime: nvidia
devices:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidiactl:/dev/nvidiactl
- /dev/nvidia-caps:/dev/nvidia-caps
- /dev/nvidia-modeset:/dev/nvidia-modeset
- /dev/nvidia-uvm:/dev/nvidia-uvm
- /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
entrypoint: ["nvidia-smi"]
Copy Code
With privilege mode
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
runtime: nvidia
privileged: true
entrypoint: ["nvidia-smi"]
Copy Code
run a privileged container with `docker run`’s option `–privileged` to have access to the GPU:
docker run --gpus all --privileged nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi
Copy Code
[Need assistance with a different issue? Our team is available 24/7.]
Conclusion
In conclusion, our Support Engineers demonstrated how to resolve docker failed to initialize nvml unknown error
PREVENT YOUR SERVER FROM CRASHING!
Never again lose customers to poor server speed! Let us help you.
Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.
0 Comments