Is there a possiblity to support sycl natively as backend for llamafile? #498

rahulunair · 2024-07-22T18:57:29Z

rahulunair
Jul 22, 2024

The simplicity of LlamaFile is incredible, as all of us users know. I am just curious if there are any plans to support SYCL as a backend for LlamaFile. With Intel's current-gen Meteor Lake and next-generation Lunar Lake platforms, the iGPUs are getting at least half of the system RAM (so in a 32 GB RAM machine, we get around 16 GB VRAM), and these are based on the Arc Architecture. It would be awesome to get native support in LlamaFile so that applied AI folks can use whichever hardware they want (CPU, GPU, or even NPU in the near future) on Intel platforms. (Also, SYCL code can be run on GPUs of other vendors as well, using plugins from Codeplay.) Although there is Vulkan or OpenCL support, the SYCL backend on runtimes like llama.cpp (which already has support for sycl) has given me the best performance on my Meteor Lake and also on my Arc 770.

For people who are truly GPU-poor, having LlamaFile support for SYCL would enable users to run fast on Arc GPUs (an Arc 770 with 16 GB VRAM costs around 300 USD). This can be an excellent LLM server for home automation, and CPU cores can be used for other things like databases and non-matrix multiply operations.

jart · 2024-07-23T01:28:58Z

jart
Jul 23, 2024
Maintainer

I think someone from Intel was working on this four months ago, but it was a weekend project and he ran into difficulties getting Windows to work. Integrating foreign code into Cosmopolitan Libc presents unique challenges that have a nontrivial learning curve. I'm the person most qualified to integrate something like SYCL. However I don't have any devices that need SYCL so there's no reason for me to do it. Contributions are very much welcome from outsiders if someone is up for the challenge of maintaining llamafile SYCL support. Otherwise it's unlikely to happen. We obviously do support Intel as a project very much. However I believe our resources are best devoted to having outstanding support for Intel's CPUs which is where Intel has always shined the most. It surprises me that the focus has been shifting to novel GPU architectures. Perhaps in the future things will change, but without the resources we can't be of much help.

0 replies

ghchris2021 · 2024-08-24T05:01:27Z

ghchris2021
Aug 24, 2024

I agree, it'd be nice to see. Not just speaking for this project, but in general, it's unfortunate that there is so very much fragmentation of usability wrt. GPUs with respect to them acting as compute accelerator devices such as in ML / NPU applications like this. In that sense they basically perform fast matrix / vector / tensor operations which are inherently conceptually and architecturally potentially very portable which is why BLAS, etc. have been core libraries across CPU and even GPU platforms for decades.

Yet despite the modest "I'm just a compute accelerator" use cases we've ended up in a tower of Babylon where one has to be VERY LUCKY if any given SW project supports ANY GPU, and if they do support any one, it's probably a single specific platform e.g. nvidia / cuda ecosystem, apple / metal, whatever and likely do not support anything whatever else.

We wouldn't nowadays commonly tolerate that kind of non-portability / fragmentation for CPU codes which can usually be / ideally are pretty platform / target portable across many languages, but we're still in the tar pit for GPUs.

For GPUs (and even CPUs and other kinds of accelerators) there are (relatively) more open standards / cross platform based middleware / runtime / framework options like OpenCL, Vulkan compute, SYCL, OpenMP, OpenACC, but it's still quite a rarity to see such "more target portable" compute layers supported at all or first class by many projects vs. supporting natively 0-1 GPU type and nothing else.

Eventually maybe we'll see the GPU acceleration capabilities eclipsed and made obsolete by incorporating relevant capabilities into the CPUs themselves at which point maybe we'll be back to more cross-platform portable parallel / vector accelerated codes in cross platform standard languages using layers like LLVM for targeting.

In the mean time it's almost like anything but nvidia GPUs are just considered irrelevant in much of the ML space though not because they're not equally usable or because they're particularly hard to support for core use cases like BLAS etc.

The thing about first class support for intel CPUs vs intel GPUs though is sort-of an interesting juxtaposition since SYCL / OneAPI etc. are originally from Intel and are Intel's own conceived solution for having a platform independent (Intel CPUs, Intel GPUs) HPC / parallel SW development & runtime framework that would be effective to use for projects that want to write the code once and be able to run it efficiently on a variety of intel CPUs, intel GPUs, as well as other vendor platforms / targets that also support SYCL (nvidia, I think, is one other).

1 reply

jart Aug 24, 2024
Maintainer

If you want portable then check out ansiBLAS: https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/ansiblas.h It only goes 20% slower than tinyBLAS on CPU https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/tinyblas_cpu.h and ansiBLAS is written with 100% pure standardized C++ code. It's thanks to the miracle of the auto-vectorizer in the compiler. Unfortunately it only supports f32. It actually uses f64 too because the compiler somehow makes that go faster than f32. It's very exciting stuff. But it's not competitive so we currently only have it in the codebase for educational purposes.

You know what's just as important as portability? Having a general computing platform that hands over control to software. NVIDIA built that with CUDA and it's popular for a reason. Most accelerators are just apps etched in hardware which software developers can't change. CUDA lets you build anything and it lets you do it in mostly normal C/C++. It should become the standardized interface for programming GPUs. Other vendors like AMD have already adopted it, which is the reason why we support them too.

Together NVIDIA and AMD represent 97% of the GPU market. If you're part of the 3% then we still offer you outstanding CPU support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a possiblity to support sycl natively as backend for llamafile? #498

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is there a possiblity to support sycl natively as backend for llamafile? #498

rahulunair Jul 22, 2024

Replies: 2 comments · 1 reply

jart Jul 23, 2024 Maintainer

ghchris2021 Aug 24, 2024

jart Aug 24, 2024 Maintainer

rahulunair
Jul 22, 2024

Replies: 2 comments 1 reply

jart
Jul 23, 2024
Maintainer

ghchris2021
Aug 24, 2024

jart Aug 24, 2024
Maintainer