After using ollama to run local LLMs on my computer, I recently switched to llama.cpp. That was because ollama, when running a model, would use up a lot of CPU power, despite the entire model being on 100% GPU. The whole laptop sounded like a plane taking off anytime it tried to create anything coherent. llama.cpp is faster (not using much CPU), use Vulkan as the backend for inference so cross compatible with the Intel integrated GPU (although you can compile one with CUDA in mind).
I tested llama.cpp using various backends:
| Backend | Tokens/s |
|---|---|
| CPU only (i7-9750H) | 5 |
| Intel UHD Graphics 630 (Vulkan) | 2 |
| Nvidia Geforce GTX 1660 Ti (Vulkan) | 32 |
| Nvidia Geforce GTX 1660 Ti (CUDA) | 37 |
Looks like CUDA surpassed Vulkan. But good luck setting that up in Docker when the development environment
weighed over 9GB, need the exact version of CUDA to the CUDA on nvidia-smi,
AND have to compile for 20 minutes with -j6 enabled. Just stick to Vulkan.
While toying with an OCR(Optical character recognition) model, LightOnOCR, I came up with an idea. What if I can do my homework on paper, then have a model transcribe into text on the computer? But taking pictures and passing through the chat window is a lot of work.Maybe if there exists a specific page to do one thing. So I got to work for my idea, ocr-web. I whipped up a Dockerfile from ubuntu, got llama-cpp-python to download a model in the container. If I ran llama.cpp at this moment, it would only use CPU.
Following online guides on how to get Nvidia GPU working with Docker, I installed the container toolkit from Nvidia. and ran the demo:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Successful. It printed the graphics card and the percentage in use.
Next, I went to get it working on Docker Compose. I followed this
guide, but it was out of date in capabilities part. Still I used it in the beginning. And llama.cpp couldn't find
any GPU. If I put
devices:
- /dev/dri:/dev/dri
in, llama.cpp would find the Intel GPU. The Nvidia Container Toolkit was installed, nvidia-smi found the GPU, but
llama.cpp couldn't find the Nvidia GPU. This was after I already installed libvulkan1.
I installed vulkan-tools and used vulkaninfo | vi - to find out what happened. It only found LLVM as GPU0, which
is software-based Vulkan, but no Nvidia. ChatGPT suggested adding runtime: nvidia to the Compose yaml.
Previously after running the Docker Compose yaml, I would recreate the container with that runtime in
the setting. Now Docker Compose would deploy with nvidia runtime in mind, but still not work.
There was a part inside the container toolkit's page I hadn't read, Specialized Configurations with Docker.
The page showed many things missing from the Docker page. capabilities now is [graphics] to use Vulkan. I tried
capabilities: all but it just got an error from Docker Compose. I also added
NVIDIA_VISIBLE_DEVICES=all, NVIDIA_DRIVER_CAPABILITIES=all. Still not working as llama.cpp refused to see the
dedicated GPU (Nvidia).
Another suggestion from ChatGPT was using vulkaninfo | grep "GPU id". So I did, and got this result:
ERROR: [Loader Message] Code 0 : libXext.so.6: cannot open shared object file: No such file or directory
ERROR: [Loader Message] Code 0 : loader_icd_scan: Failed loading library associated with ICD JSON libGLX_nvidia.so.0. Ignoring this JSON
ERROR: [Loader Message] Code 0 : libXext.so.6: cannot open shared object file: No such file or directory
ERROR: [Loader Message] Code 0 : loader_icd_scan: Failed loading library associated with ICD JSON libGLX_nvidia.so.0. Ignoring this JSON
'DISPLAY' environment variable not set... skipping surface info
error: XDG_RUNTIME_DIR is invalid or not set in the environment.
GPU id = 0 (llvmpipe (LLVM 20.1.2, 256 bits))
GPU id = 0 (llvmpipe (LLVM 20.1.2, 256 bits))
GPU id = 0 (llvmpipe (LLVM 20.1.2, 256 bits))
GPU id = 0 (llvmpipe (LLVM 20.1.2, 256 bits))
For libXext.so.6, it is missing libxext6. I added that library into the Docker Compose file. After that, this was the
new result:
ERROR: [Loader Message] Code 0 : loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0
ERROR: [Loader Message] Code 0 : loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0
'DISPLAY' environment variable not set... skipping surface info
error: XDG_RUNTIME_DIR is invalid or not set in the environment.
GPU id = 0 (llvmpipe (LLVM 20.1.2, 256 bits))
GPU id = 0 (llvmpipe (LLVM 20.1.2, 256 bits))
GPU id = 0 (llvmpipe (LLVM 20.1.2, 256 bits))
GPU id = 0 (llvmpipe (LLVM 20.1.2, 256 bits))
So, almost there. Just a bit more. For this error, ChatGPT suggested looking for /usr/share/vulkan/icd.d/nvidia_icd.json.
And for some stupid reason, it didn't exist in the container, so I had to map it from the host in readonly mode. Of course,
didn't work. This time, Chat said that file referenced OpenGL, not Vulkan. It kept calling out libnvidia-vulkan.so.1,
even if I combed through my host machine to find it. It didn't exist, but ran Vulkan fine. Typical hallucination.
Up to Google. Now, searching libnvidia-vulkan.so.1, I found this article
on Stack Overflow. It led to this Github issue. Seems
to be a common problem. But there was no solution. Undaunted, I searched once more, this time docker ERROR: [Loader Message] Code 0 : loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0.
Then there was a Github issue about this problem.
We were once again getting somewhere. The container was missing libegl1 library.
After installing this library, perfecto. The container recognized Nvidia GPU.
'DISPLAY' environment variable not set... skipping surface info
error: XDG_RUNTIME_DIR is invalid or not set in the environment.
GPU id = 0 (NVIDIA GeForce GTX 1660 Ti)
GPU id = 1 (NVIDIA GeForce GTX 1660 Ti)
GPU id = 2 (llvmpipe (LLVM 20.1.2, 256 bits))
Now why did the container need EGL? According to Nvidia forum moderator:
For any kind of rendering, even headless or off-screen rendering, you will need to have a render context,
which is provided for example by EGL, not GLX. GLX are just GL extensions to be able to use X11 at all.
- MarkusHoHo
Ah. I should have paid more attention to my Graphics class.
While testing llama-cli inside Docker, I ran into another problem. Turns out llama-cpp uses OpenMP, even with using
Vulkan as the backend for inference. ldd libggml-cpu-haswell.so revealed that it need libgomp.so.1, which is in libgomp1
library, as in this issue.
In summary:
- ChatGPT is only your first diagnostic. It alone is not enough. You still need Stack Overflow, Github issue, Google, too. But LLMs are good for brainstorming. Trust, but verify.
- This entire wild goose chase would have never happened if it was mentioned at all in Nvidia official guide (libegl1). Stupid. While Intel/AMD GPU, I just imported /dev/dri + libvulkan1 = Worked deliciously.
- Now I know why people hate Nvidia Closed-source driver.