-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@mtlprintf #418
base: main
Are you sure you want to change the base?
@mtlprintf #418
Conversation
@maleadt Any idea how we can implement the version check for the Also can we get rid of the |
Would it be worth benchmarking the performance difference between having logging active vs not? |
@christiangnrd Sure. I don't expect there to be much overhead besides allocation of the log buffer and checking it for logs after running a kernel. But we might want to look into only conditionally adding |
Given that the macro expands way to early, I don't think there's anything we can do but checking in the kernel. Why are you opposed to that? GPUCompiler.jl has infrastructure to optimize those checks away, see e.g. how CUDA.jl exposes the device capability and PTX ISA version to the kernel. |
We could also wrap the macro and accompanying functions in |
I we do that we should have definitions in both cases and give an informative error if |
Actually, looks like I provided the run-time queries already: Metal.jl/src/device/intrinsics/version.jl Lines 64 to 65 in 6c82916
So we can just use that in the generated code, generating an I'd rather not simply check based on the macOS version during macro expansion, since we might want to target older Metal versions than the system supports. |
4ee3467
to
b43bcb1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! However do you know what's causing the tests to hang?
@christiangnrd The hangs are caused by this one line: @print_and_throw "@mtlprintf requires Metal 3.2 (macOS 15) or higher" |
@maleadt Could we have one of the Apple Silicon runners upgraded to Sequoia so the output tests don't get ignored? Edit: All the runners are running 13.3.1. Should we also have one on macOS 14? I would also like to see #420 merged first (with benchmarks run on macOS 15) to see how big the impact of enabling logging is. |
@christiangnrd I recently made changes so that logging (e.g. MTLLogState and friends) is only enabled whenever we actually use the feature. |
Just pushed a whitespace-only formatting commit |
In that case I still think we should be able to test on macOS 15, but I think we should merge this as soon as it's ready. |
How did this get fixed? |
I assume by no longer running when |
Right; but that's not great. It means that any kernel using logging output will first generate a non-fatal error message on the host, and then hang in the kernel? Or, when we on macOS 15 use (the hypothetical, but useful) EDIT: suggested capability implemented here: #430 |
The following code hangs in the REPL, but not when run using
|
Isn't that because in the REPL we force synchronization via an AST transform hook? What happens if you synchronize manually? |
@maleadt When I add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
Benchmark suite | Current: 0840aa4 | Previous: 8652754 | Ratio |
---|---|---|---|
latency/precompile |
4599693584 ns |
4401680834 ns |
1.04 |
latency/ttfp |
6702643541.5 ns |
6678542687 ns |
1.00 |
latency/import |
722647167 ns |
721498042 ns |
1.00 |
integration/metaldevrt |
715958 ns |
708167 ns |
1.01 |
integration/byval/slices=1 |
1498958.5 ns |
1530625 ns |
0.98 |
integration/byval/slices=3 |
11746791 ns |
11010542 ns |
1.07 |
integration/byval/reference |
1489417 ns |
1585084 ns |
0.94 |
integration/byval/slices=2 |
2602291.5 ns |
2472708 ns |
1.05 |
kernel/indexing |
464895.5 ns |
454333 ns |
1.02 |
kernel/indexing_checked |
466812.5 ns |
455667 ns |
1.02 |
kernel/launch |
8417 ns |
8459 ns |
1.00 |
array/construct |
27659.666666666668 ns |
27638.916666666664 ns |
1.00 |
array/broadcast |
460729.5 ns |
464625 ns |
0.99 |
array/random/randn/Float32 |
804708 ns |
813083 ns |
0.99 |
array/random/randn!/Float32 |
610958 ns |
634041 ns |
0.96 |
array/random/rand!/Int64 |
552250 ns |
552750 ns |
1.00 |
array/random/rand!/Float32 |
581958.5 ns |
577083 ns |
1.01 |
array/random/rand/Int64 |
795125 ns |
800833.5 ns |
0.99 |
array/random/rand/Float32 |
599209 ns |
583709 ns |
1.03 |
array/copyto!/gpu_to_gpu |
639042 ns |
643166.5 ns |
0.99 |
array/copyto!/cpu_to_gpu |
585875.5 ns |
600020.5 ns |
0.98 |
array/copyto!/gpu_to_cpu |
736041.5 ns |
777166.5 ns |
0.95 |
array/accumulate/1d |
1332458 ns |
1334916 ns |
1.00 |
array/accumulate/2d |
1420438 ns |
1419167 ns |
1.00 |
array/iteration/findall/int |
2084291.5 ns |
2072542 ns |
1.01 |
array/iteration/findall/bool |
1812750 ns |
1854833 ns |
0.98 |
array/iteration/findfirst/int |
1687750 ns |
1674333 ns |
1.01 |
array/iteration/findfirst/bool |
1644416.5 ns |
1643833 ns |
1.00 |
array/iteration/scalar |
3675458.5 ns |
3625334 ns |
1.01 |
array/iteration/logical |
3255666 ns |
3281021 ns |
0.99 |
array/iteration/findmin/1d |
1615416 ns |
1572104 ns |
1.03 |
array/iteration/findmin/2d |
1319125 ns |
1325292 ns |
1.00 |
array/reductions/reduce/1d |
1048770.5 ns |
1055583 ns |
0.99 |
array/reductions/reduce/2d |
691041.5 ns |
690959 ns |
1.00 |
array/reductions/mapreduce/1d |
1052625 ns |
1057604.5 ns |
1.00 |
array/reductions/mapreduce/2d |
694708 ns |
700416.5 ns |
0.99 |
array/permutedims/4d |
836583 ns |
846917 ns |
0.99 |
array/permutedims/2d |
846937.5 ns |
856979.5 ns |
0.99 |
array/permutedims/3d |
922750 ns |
916917 ns |
1.01 |
array/copy |
610166 ns |
610041 ns |
1.00 |
metal/synchronization/stream |
14208 ns |
14667 ns |
0.97 |
metal/synchronization/context |
14500 ns |
14916 ns |
0.97 |
This comment was automatically generated by workflow using github-action-benchmark.
I opened an issue for the hang: #433 |
f70ccac
to
95e47f1
Compare
In the assumption that the conditional julia> using Metal
julia> function kernel()
@mtlprint("Hello, World\n")
return
end
kernel (generic function with 1 method)
julia> Metal.@sync @metal kernel()
Hello, World
# hang The |
5a8daaa
to
b9610e3
Compare
Rebased, however, this has regressed and fails to compile now:
Something with the IR downgrader not correctly handling the |
Alright, the downgrader fix was twofold, and will be bumped in JuliaPackaging/Yggdrasil#10053:
With that, we're back at the hang from #433 |
Implement @mtlprintf and friends using os_log
TODO:
depends on: JuliaGPU/GPUCompiler.jl#630
notify: #226