Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC A730M (PI_ERROR_BUILD_PROGRAM_FAILURE) #12765

Open
cyear opened this issue Feb 3, 2025 · 12 comments
Open

ARC A730M (PI_ERROR_BUILD_PROGRAM_FAILURE) #12765

cyear opened this issue Feb 3, 2025 · 12 comments

Comments

@cyear
Copy link

cyear commented Feb 3, 2025

Follow the documentation 在Intel GPU上使用IPEX-LLM运行Ollama

export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1

source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
# [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
# [optional] if you want to run on single GPU, use below command to limit GPU may improve performance
export ONEAPI_DEVICE_SELECTOR=level_zero:0

./ollama serve

:: initializing oneAPI environment ...
   zsh: ZSH_VERSION = 5.9
   args: Using "$@" for setvars.sh arguments: advisor=latest ccl=latest compiler=latest dal=latest debugger=latest dev-utilities=latest dnnl=latest dpcpp-ct=latest dpl=latest ipp=latest ippcp=latest mkl=latest mpi=latest tbb=latest vtune=latest
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
 
2025/02/03 16:24:18 routes.go:1194: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/nian/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:localhost,127.0.0.1]"
time=2025-02-03T16:24:18.303+08:00 level=INFO source=images.go:753 msg="total blobs: 11"
time=2025-02-03T16:24:18.303+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-02-03T16:24:18.304+08:00 level=INFO source=routes.go:1245 msg="Listening on 127.0.0.1:11434 (version 0.5.1-ipexllm-20250123)"
time=2025-02-03T16:24:18.304+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3579327106/runners
time=2025-02-03T16:24:18.396+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners=[ipex_llm]
[GIN] 2025/02/03 - 16:24:23 | 200 |      55.274µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/02/03 - 16:24:23 | 200 |   33.486176ms |       127.0.0.1 | POST     "/api/show"
time=2025-02-03T16:24:23.332+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2025-02-03T16:24:23.332+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-02-03T16:24:23.335+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-02-03T16:24:23.335+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-02-03T16:24:23.347+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-02-03T16:24:23.411+08:00 level=INFO source=server.go:105 msg="system memory" total="30.9 GiB" free="13.0 GiB" free_swap="13.3 GiB"
time=2025-02-03T16:24:23.412+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[13.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.0 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[6.0 GiB]" memory.weights.total="4.9 GiB" memory.weights.repeating="4.5 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2025-02-03T16:24:23.416+08:00 level=INFO source=server.go:401 msg="starting llama server" cmd="/tmp/ollama3579327106/runners/ipex_llm/ollama_llama_server --model /home/nian/.ollama/models/blobs/sha256-6340dc3229b0d08ea9cc49b75d4098702983e17b4c096d57afbbf2ffc813f2be --ctx-size 8192 --batch-size 512 --n-gpu-layers 999 --threads 6 --no-mmap --parallel 4 --port 33109"
time=2025-02-03T16:24:23.416+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-03T16:24:23.416+08:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-02-03T16:24:23.417+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-02-03T16:24:23.515+08:00 level=INFO source=runner.go:963 msg="starting go runner"
time=2025-02-03T16:24:23.516+08:00 level=INFO source=runner.go:964 msg=system info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=6
time=2025-02-03T16:24:23.516+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:33109"
llama_model_loader: loaded meta data with 28 key-value pairs and 292 tensors from /home/nian/.ollama/models/blobs/sha256-6340dc3229b0d08ea9cc49b75d4098702983e17b4c096d57afbbf2ffc813f2be (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Llama 8B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Llama
llama_model_loader: - kv   4:                         general.size_label str              = 8B
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                       llama.context_length u32              = 131072
llama_model_loader: - kv   7:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   8:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   9:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  10:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  15:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-02-03T16:24:23.668+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  20:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.58 GiB (4.89 BPW) 
llm_load_print_meta: general.name     = DeepSeek R1 Distill Llama 8B
llm_load_print_meta: BOS token        = 128000 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 128001 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 128001 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128001 '<|end▁of▁sentence|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
Abort was called at 1073 line in file:
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
/var/tmp/portage/dev-libs/intel-compute-runtime-24.35.30872.32/work/compute-runtime-24.35.30872.32/shared/source/os_interface/linux/drm_neo.cpp
SIGABRT: abort
PC=0x7f6de04a5bcc m=5 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 19 gp=0xc000186540 m=5 mp=0xc000100008 [syscall]:
runtime.cgocall(0x55cab7f0fab0, 0xc000094b90)
	runtime/cgocall.go:157 +0x4b fp=0xc000094b68 sp=0xc000094b30 pc=0x55cab7c904eb
ollama/llama/llamafile._Cfunc_llama_load_model_from_file(0x7f6d78000b70, {0x3e7, 0x1, 0x0, 0x0, 0x0, 0x55cab7f0f4a0, 0xc0001a4258, 0x0, 0x0, ...})
	_cgo_gotypes.go:692 +0x50 fp=0xc000094b90 sp=0xc000094b68 pc=0x55cab7d8e390
ollama/llama/llamafile.LoadModelFromFile.func1({0x7ffe75b905fb?, 0x0?}, {0x3e7, 0x1, 0x0, 0x0, 0x0, 0x55cab7f0f4a0, 0xc0001a4258, 0x0, ...})
	ollama/llama/llamafile/llama.go:228 +0xfa fp=0xc000094c78 sp=0xc000094b90 pc=0x55cab7d90c9a
ollama/llama/llamafile.LoadModelFromFile({0x7ffe75b905fb, 0x67}, {0x3e7, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc000192160, ...})
	ollama/llama/llamafile/llama.go:228 +0x2d5 fp=0xc000094db8 sp=0xc000094c78 pc=0x55cab7d909d5
main.(*Server).loadModel(0xc0001c2120, {0x3e7, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc000192160, 0x0}, ...)
	ollama/llama/runner/runner.go:861 +0xc5 fp=0xc000094f10 sp=0xc000094db8 pc=0x55cab7f0cfe5
main.main.gowrap1()
	ollama/llama/runner/runner.go:997 +0xda fp=0xc000094fe0 sp=0xc000094f10 pc=0x55cab7f0ea1a
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000094fe8 sp=0xc000094fe0 pc=0x55cab7cf8f01
created by main.main in goroutine 1
	ollama/llama/runner/runner.go:997 +0xc6c

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0xc00004ca08?, 0x0?, 0xc0?, 0x61?, 0xc000041898?)
	runtime/proc.go:402 +0xce fp=0xc000041860 sp=0xc000041840 pc=0x55cab7cc712e
runtime.netpollblock(0xc0000418f8?, 0xb7c8fc46?, 0xca?)
	runtime/netpoll.go:573 +0xf7 fp=0xc000041898 sp=0xc000041860 pc=0x55cab7cbf377
internal/poll.runtime_pollWait(0x7f6de0c99770, 0x72)
	runtime/netpoll.go:345 +0x85 fp=0xc0000418b8 sp=0xc000041898 pc=0x55cab7cf3bc5
internal/poll.(*pollDesc).wait(0x3?, 0x3fe?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0000418e0 sp=0xc0000418b8 pc=0x55cab7d43ae7
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0001fe080)
	internal/poll/fd_unix.go:611 +0x2ac fp=0xc000041988 sp=0xc0000418e0 pc=0x55cab7d44fac
net.(*netFD).accept(0xc0001fe080)
	net/fd_unix.go:172 +0x29 fp=0xc000041a40 sp=0xc000041988 pc=0x55cab7db3bc9
net.(*TCPListener).accept(0xc0001c41c0)
	net/tcpsock_posix.go:159 +0x1e fp=0xc000041a68 sp=0xc000041a40 pc=0x55cab7dc48fe
net.(*TCPListener).Accept(0xc0001c41c0)
	net/tcpsock.go:327 +0x30 fp=0xc000041a98 sp=0xc000041a68 pc=0x55cab7dc3c50
net/http.(*onceCloseListener).Accept(0xc0001c21b0?)
	<autogenerated>:1 +0x24 fp=0xc000041ab0 sp=0xc000041a98 pc=0x55cab7eeae64
net/http.(*Server).Serve(0xc00021a000, {0x55cab8216540, 0xc0001c41c0})
	net/http/server.go:3260 +0x33e fp=0xc000041be0 sp=0xc000041ab0 pc=0x55cab7ee1c7e
main.main()
	ollama/llama/runner/runner.go:1022 +0x10cd fp=0xc000041f50 sp=0xc000041be0 pc=0x55cab7f0e68d
runtime.main()
	runtime/proc.go:271 +0x29d fp=0xc000041fe0 sp=0xc000041f50 pc=0x55cab7cc6cfd
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000041fe8 sp=0xc000041fe0 pc=0x55cab7cf8f01

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000084fa8 sp=0xc000084f88 pc=0x55cab7cc712e
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.forcegchelper()
	runtime/proc.go:326 +0xb8 fp=0xc000084fe0 sp=0xc000084fa8 pc=0x55cab7cc6fb8
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x55cab7cf8f01
created by runtime.init.6 in goroutine 1
	runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000085780 sp=0xc000085760 pc=0x55cab7cc712e
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.bgsweep(0xc000038070)
	runtime/mgcsweep.go:278 +0x94 fp=0xc0000857c8 sp=0xc000085780 pc=0x55cab7cb1c74
runtime.gcenable.gowrap1()
	runtime/mgc.go:203 +0x25 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x55cab7ca67a5
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x55cab7cf8f01
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xc000038070?, 0x55cab7f8e308?, 0x1?, 0x0?, 0xc000007340?)
	runtime/proc.go:402 +0xce fp=0xc000085f78 sp=0xc000085f58 pc=0x55cab7cc712e
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.(*scavengerState).park(0x55cab83e0680)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000085fa8 sp=0xc000085f78 pc=0x55cab7caf669
runtime.bgscavenge(0xc000038070)
	runtime/mgcscavenge.go:653 +0x3c fp=0xc000085fc8 sp=0xc000085fa8 pc=0x55cab7cafbfc
runtime.gcenable.gowrap2()
	runtime/mgc.go:204 +0x25 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x55cab7ca6745
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x55cab7cf8f01
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0xa5

goroutine 18 gp=0xc000186380 m=nil [finalizer wait]:
runtime.gopark(0xc000084648?, 0x55cab7c9a0a5?, 0xa8?, 0x1?, 0x55cab82107a0?)
	runtime/proc.go:402 +0xce fp=0xc000084620 sp=0xc000084600 pc=0x55cab7cc712e
runtime.runfinq()
	runtime/mfinal.go:194 +0x107 fp=0xc0000847e0 sp=0xc000084620 pc=0x55cab7ca57e7
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x55cab7cf8f01
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:164 +0x3d

goroutine 20 gp=0xc000186700 m=nil [semacquire]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x60?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000080e08 sp=0xc000080de8 pc=0x55cab7cc712e
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.semacquire1(0xc0001c2128, 0x0, 0x1, 0x0, 0x12)
	runtime/sema.go:160 +0x22c fp=0xc000080e70 sp=0xc000080e08 pc=0x55cab7cd954c
sync.runtime_Semacquire(0x0?)
	runtime/sema.go:62 +0x25 fp=0xc000080ea8 sp=0xc000080e70 pc=0x55cab7cf5385
sync.(*WaitGroup).Wait(0x0?)
	sync/waitgroup.go:116 +0x48 fp=0xc000080ed0 sp=0xc000080ea8 pc=0x55cab7d13e08
main.(*Server).run(0xc0001c2120, {0x55cab8216b80, 0xc0001800a0})
	ollama/llama/runner/runner.go:315 +0x47 fp=0xc000080fb8 sp=0xc000080ed0 pc=0x55cab7f096a7
main.main.gowrap2()
	ollama/llama/runner/runner.go:1002 +0x28 fp=0xc000080fe0 sp=0xc000080fb8 pc=0x55cab7f0e908
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x55cab7cf8f01
created by main.main in goroutine 1
	ollama/llama/runner/runner.go:1002 +0xd3e

goroutine 21 gp=0xc0001868c0 m=nil [IO wait]:
runtime.gopark(0x94?, 0xc0001ef958?, 0x40?, 0xf9?, 0xb?)
	runtime/proc.go:402 +0xce fp=0xc0001ef910 sp=0xc0001ef8f0 pc=0x55cab7cc712e
runtime.netpollblock(0x55cab7d2d678?, 0xb7c8fc46?, 0xca?)
	runtime/netpoll.go:573 +0xf7 fp=0xc0001ef948 sp=0xc0001ef910 pc=0x55cab7cbf377
internal/poll.runtime_pollWait(0x7f6de0c99678, 0x72)
	runtime/netpoll.go:345 +0x85 fp=0xc0001ef968 sp=0xc0001ef948 pc=0x55cab7cf3bc5
internal/poll.(*pollDesc).wait(0xc0001fe100?, 0xc000296000?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0001ef990 sp=0xc0001ef968 pc=0x55cab7d43ae7
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0001fe100, {0xc000296000, 0x1000, 0x1000})
	internal/poll/fd_unix.go:164 +0x27a fp=0xc0001efa28 sp=0xc0001ef990 pc=0x55cab7d4463a
net.(*netFD).Read(0xc0001fe100, {0xc000296000?, 0xc0001efa98?, 0x55cab7d43fa5?})
	net/fd_posix.go:55 +0x25 fp=0xc0001efa70 sp=0xc0001efa28 pc=0x55cab7db2ac5
net.(*conn).Read(0xc000194090, {0xc000296000?, 0x0?, 0xc00028a038?})
	net/net.go:185 +0x45 fp=0xc0001efab8 sp=0xc0001efa70 pc=0x55cab7dbcd85
net.(*TCPConn).Read(0xc00028a030?, {0xc000296000?, 0xc0001fe100?, 0xc0001efaf0?})
	<autogenerated>:1 +0x25 fp=0xc0001efae8 sp=0xc0001efab8 pc=0x55cab7dc8765
net/http.(*connReader).Read(0xc00028a030, {0xc000296000, 0x1000, 0x1000})
	net/http/server.go:789 +0x14b fp=0xc0001efb38 sp=0xc0001efae8 pc=0x55cab7ed7a8b
bufio.(*Reader).fill(0xc000294000)
	bufio/bufio.go:110 +0x103 fp=0xc0001efb70 sp=0xc0001efb38 pc=0x55cab7e94383
bufio.(*Reader).Peek(0xc000294000, 0x4)
	bufio/bufio.go:148 +0x53 fp=0xc0001efb90 sp=0xc0001efb70 pc=0x55cab7e944b3
net/http.(*conn).serve(0xc0001c21b0, {0x55cab8216b48, 0xc000196db0})
	net/http/server.go:2079 +0x749 fp=0xc0001effb8 sp=0xc0001efb90 pc=0x55cab7edd7e9
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3290 +0x28 fp=0xc0001effe0 sp=0xc0001effb8 pc=0x55cab7ee2068
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc0001effe8 sp=0xc0001effe0 pc=0x55cab7cf8f01
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3290 +0x4b4

rax    0x0
rbx    0x1e6e92
rcx    0x7f6de04a5bcc
rdx    0x6
rdi    0x1e6e8e
rsi    0x1e6e92
rbp    0x7f6d82dfc6c0
rsp    0x7f6d82dfa370
r8     0x0
r9     0x1
r10    0x8
r11    0x246
r12    0x7f6d5d7e9b10
r13    0x6
r14    0x7f6d82dfa668
r15    0x0
rip    0x7f6de04a5bcc
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
time=2025-02-03T16:24:24.686+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 2"
[GIN] 2025/02/03 - 16:24:24 | 500 |  1.382866606s |       127.0.0.1 | POST     "/api/generate"

Execute only one line

source /opt/intel/oneapi/setvars.sh 

(llm-cpp) nian:llama-cpp/ $ ./ollama serve                       [16:27:28]
2025/02/03 16:27:31 routes.go:1194: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/nian/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-02-03T16:27:31.243+08:00 level=INFO source=images.go:753 msg="total blobs: 11"
time=2025-02-03T16:27:31.244+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-02-03T16:27:31.245+08:00 level=INFO source=routes.go:1245 msg="Listening on 127.0.0.1:11434 (version 0.5.1-ipexllm-20250123)"
time=2025-02-03T16:27:31.246+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama4209733493/runners
time=2025-02-03T16:27:31.342+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners=[ipex_llm]
[GIN] 2025/02/03 - 16:27:33 | 200 |     405.912µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/02/03 - 16:27:33 | 200 |   32.927453ms |       127.0.0.1 | POST     "/api/show"
time=2025-02-03T16:27:33.128+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2025-02-03T16:27:33.128+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-02-03T16:27:33.130+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-02-03T16:27:33.130+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-02-03T16:27:33.140+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-02-03T16:27:33.205+08:00 level=INFO source=server.go:105 msg="system memory" total="30.9 GiB" free="7.1 GiB" free_swap="12.9 GiB"
time=2025-02-03T16:27:33.206+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[7.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.0 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[6.0 GiB]" memory.weights.total="4.9 GiB" memory.weights.repeating="4.5 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2025-02-03T16:27:33.214+08:00 level=INFO source=server.go:401 msg="starting llama server" cmd="/tmp/ollama4209733493/runners/ipex_llm/ollama_llama_server --model /home/nian/.ollama/models/blobs/sha256-6340dc3229b0d08ea9cc49b75d4098702983e17b4c096d57afbbf2ffc813f2be --ctx-size 8192 --batch-size 512 --n-gpu-layers 999 --threads 6 --no-mmap --parallel 4 --port 36591"
time=2025-02-03T16:27:33.216+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-03T16:27:33.216+08:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-02-03T16:27:33.216+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-02-03T16:27:33.297+08:00 level=INFO source=runner.go:963 msg="starting go runner"
time=2025-02-03T16:27:33.298+08:00 level=INFO source=runner.go:964 msg=system info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=6
time=2025-02-03T16:27:33.299+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:36591"
llama_model_loader: loaded meta data with 28 key-value pairs and 292 tensors from /home/nian/.ollama/models/blobs/sha256-6340dc3229b0d08ea9cc49b75d4098702983e17b4c096d57afbbf2ffc813f2be (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Llama 8B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Llama
llama_model_loader: - kv   4:                         general.size_label str              = 8B
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                       llama.context_length u32              = 131072
llama_model_loader: - kv   7:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   8:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   9:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  10:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  15:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  20:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
time=2025-02-03T16:27:33.469+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.58 GiB (4.89 BPW) 
llm_load_print_meta: general.name     = DeepSeek R1 Distill Llama 8B
llm_load_print_meta: BOS token        = 128000 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 128001 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 128001 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128001 '<|end▁of▁sentence|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 2 SYCL devices:
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  1263.13 MiB
llm_load_tensors:      SYCL1 buffer size =  3140.37 MiB
llm_load_tensors:  SYCL_Host buffer size =   281.81 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|               Intel Arc A730M Graphics|    1.5|    384|    1024|   32| 12160M|            1.3.30872|
| 1| [level_zero:gpu:1]|                 Intel Iris Xe Graphics|    1.5|     96|     512|   32| 30750M|            1.3.30872|
llama_kv_cache_init:      SYCL0 KV buffer size =   320.00 MiB
llama_kv_cache_init:      SYCL1 KV buffer size =   704.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     2.02 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =    96.00 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =   258.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 3
time=2025-02-03T16:27:46.129+08:00 level=WARN source=runner.go:894 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A730M Graphics':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-llama-cpp/ggml/src/ggml-sycl.cpp, line:2927
time=2025-02-03T16:27:46.281+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 1"
[GIN] 2025/02/03 - 16:27:46 | 500 | 13.190232474s |       127.0.0.1 | POST     "/api/generate"

@qiuxin2012
Copy link
Contributor

This is a known issue, we will add a better error message for this case. You can useOneAPI device selector to use A730M only before your run ollama. Like export ONEAPI_DEVICE_SELECTOR=level_zero:0 on linux.

@cyear
Copy link
Author

cyear commented Feb 5, 2025

This is a known issue, we will add a better error message for this case. You can useOneAPI device selector to use A730M only before your run ollama. Like export ONEAPI_DEVICE_SELECTOR=level_zero:0 on linux.

I have already specified the GPU in the first run

@qiuxin2012
Copy link
Contributor

found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|               Intel Arc A730M Graphics|    1.5|    384|    1024|   32| 12160M|            1.3.30872|
| 1| [level_zero:gpu:1]|                 Intel Iris Xe Graphics|    1.5|     96|     512|   32| 30750M|            1.3.30872|
llama_kv_cache_init:      SYCL0 KV buffer size =

But your log shows you still have both igpu and A730M, if you set export ONEAPI_DEVICE_SELECTOR=level_zero:0 correctly, only A730M will be found here.

@cyear
Copy link
Author

cyear commented Feb 5, 2025

found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|               Intel Arc A730M Graphics|    1.5|    384|    1024|   32| 12160M|            1.3.30872|
| 1| [level_zero:gpu:1]|                 Intel Iris Xe Graphics|    1.5|     96|     512|   32| 30750M|            1.3.30872|
llama_kv_cache_init:      SYCL0 KV buffer size =

But your log shows you still have both igpu and A730M, if you set export ONEAPI_DEVICE_SELECTOR=level_zero:0 correctly, only A730M will be found here.

This is the second execution, only loading the oneAPI environment. Please refer to the code block above for the first execution

@cyear
Copy link
Author

cyear commented Feb 5, 2025

found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|               Intel Arc A730M Graphics|    1.5|    384|    1024|   32| 12160M|            1.3.30872|
| 1| [level_zero:gpu:1]|                 Intel Iris Xe Graphics|    1.5|     96|     512|   32| 30750M|            1.3.30872|
llama_kv_cache_init:      SYCL0 KV buffer size =

But your log shows you still have both igpu and A730M, if you set export ONEAPI_DEVICE_SELECTOR=level_zero:0 correctly, only A730M will be found here.

export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1

source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
# [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
# [optional] if you want to run on single GPU, use below command to limit GPU may improve performance
export ONEAPI_DEVICE_SELECTOR=level_zero:0

./ollama serve

An incomprehensible error occurred

ggml_sycl_init: found 1 SYCL devices:
/var/tmp/portage/dev-libs/intel-compute-runtime-24.35.30872.32/work/compute-runtime-24.35.30872.32/shared/source/os_interface/linux/drm_neo.cpp
SIGABRT: abort
PC=0x7f6de04a5bcc m=5 sigcode=18446744073709551610
signal arrived during cgo execution
goroutine 19 gp=0xc000186540 m=5 mp=0xc000100008 [syscall]:
runtime.cgocall(0x55cab7f0fab0, 0xc000094b90)

@qiuxin2012
Copy link
Contributor

We didn't meet this error before, maybe caused by your environment. How about export ONEAPI_DEVICE_SELECTOR=level_zero:1 ? And could you show me clinfo | grep "Device Name"?

@cyear
Copy link
Author

cyear commented Feb 5, 2025

We didn't meet this error before, maybe caused by your environment. How about export ONEAPI_DEVICE_SELECTOR=level_zero:1 ? And could you show me clinfo | grep "Device Name"?

nian:~/ $ clinfo | grep "Device Name"                                                                                                            [10:31:12]
  Device Name                                     Intel(R) Arc(TM) A730M Graphics
  Device Name                                     Intel(R) Iris(R) Xe Graphics
  Device Name                                     12th Gen Intel(R) Core(TM) i7-12700H
    Device Name                                   Intel(R) Arc(TM) A730M Graphics
    Device Name                                   Intel(R) Arc(TM) A730M Graphics
    Device Name                                   Intel(R) Arc(TM) A730M Graphics

Same error as level_zero: 0

export ONEAPI_DEVICE_SELECTOR=level_zero:1

found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel Iris Xe Graphics|    1.5|     96|     512|   32| 15139M|            1.3.30872|
llama_kv_cache_init:      SYCL0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     2.02 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 2
time=2025-02-05T10:33:29.416+08:00 level=WARN source=runner.go:894 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
The program was built for 1 devices
Build program log for 'Intel(R) Iris(R) Xe Graphics':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-llama-cpp/ggml/src/ggml-sycl.cpp, line:2927
time=2025-02-05T10:33:30.046+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 1"

@qiuxin2012
Copy link
Contributor

I just try deepseek-r1:8b on A770, the model works fine. Could you uninstall your intel driver and oneapi, then follow https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#install-gpu-driver to install the environment.

@cyear
Copy link
Author

cyear commented Feb 5, 2025

I just try deepseek-r1:8b on A770, the model works fine. Could you uninstall your intel driver and oneapi, then follow https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#install-gpu-driver to install the environment.

But I am using Gentoo Linux, and for Windows, I have to disable the graphics card to use it properly. If I use the oneAPI to specify the GPU, there will also be an error. Currently, I am running the model on Windows

@cyear
Copy link
Author

cyear commented Feb 5, 2025

I just try deepseek-r1:8b on A770, the model works fine. Could you uninstall your intel driver and oneapi, then follow https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#install-gpu-driver to install the environment.

But I am using Gentoo Linux, and for Windows, I have to disable the graphics card to use it properly. If I use the oneAPI to specify the GPU, there will also be an error. Currently, I am running the model on Windows

Disable XE core graphics and keep A730m dedicated graphics to run normally on Windows

@qiuxin2012
Copy link
Contributor

On ubuntu, we can use ONEAPI_DEVICE_SELECTOR to choose GPU correctly.
Gentoo is not our supported system, we have no idea what will happen on it.

@cyear
Copy link
Author

cyear commented Feb 5, 2025

On ubuntu, we can use ONEAPI_DEVICE_SELECTOR to choose GPU correctly.
Gentoo is not our supported system, we have no idea what will happen on it.

Windows can also use this specified GPU, but the error is the same as mine, so I gave up and only disabled the XE core display on Windows to use it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants