Build ggml-sycl from upstream llama.cpp (commit a5bb8ba4, matching ollama's
vendored ggml) using Intel oneAPI 2025.1.1 in a multi-stage Docker build.
Patch two ollama-specific API divergences via patch-sycl.py: added batch_size
parameter to graph_compute, removed GGML_TENSOR_FLAG_COMPUTE skip-check that
caused all compute nodes to be bypassed.
Tested: gemma3:1b — 27/27 layers on GPU, 10.2 tok/s gen, 65.3 tok/s prompt eval.
Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the IPEX-LLM portable zip (bundling a patched ollama 0.9.3 with SYCL)
with the official ollama 0.15.6 release using the Vulkan backend for Intel GPU
acceleration. The official ollama project does not ship a SYCL backend; Vulkan
is their supported path for Intel GPUs.
- Use official ollama binary with Vulkan runner (OLLAMA_VULKAN=1)
- Strip CUDA/MLX runners from image to save space
- Add mesa-vulkan-drivers for Intel ANV Vulkan ICD
- Remove all IPEX-LLM env vars and wrapper scripts
- Simplify entrypoint to /usr/bin/ollama serve directly
- Clean up docker-compose.yml: remove IPEX build args and env vars
Tested: Intel Arc Graphics (MTL) detected, 17/17 layers offloaded to Vulkan0
Co-authored-by: Cursor <cursoragent@cursor.com>
Revised `IPEXLLM_RELEASE_REPO` value and adjusted file and path references for consistency. Updated `docker-compose.yml` with refined environment variables, device mapping, restart policies, and added necessary port bindings for better functionality and maintainability.