-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undeterministically task==NULL at runtime when finetuning GPT-2 #12529
Comments
To triage an undeterministic issue like this, I would be very helpful to be able to run the reproduction steps with sanitizers: AddressSanitizer, and separately, ThreadSanitizer. This page says:
Since you write above that this reproduces in Simulator, let's then focus on that. In particular, Even a negative outcome (the sanitizer doesn't see anything) would be useful information in itself, as that would help rule out classes of issues. We have sanitizers docs here, But I wrote that a while ago and it's not optimal. Here's the important steps:
Then re-build the IREE runtime (
If the reproducing program is your own ( Then re-run your
If the reproducing program is your own ( |
Thanks @bjacob ! I rebuild the IREE compiler and runtime for macOS/M1 with the following additional CMake flags
The building was alright except that I had to fix libyaml a little bit yaml/libyaml#267 Then, I compiled the iree-compile /tmp/gpt2.mlir \
--iree-input-type=mhlo \
--iree-hal-target-backends=llvm-cpu \
-o /tmp/gpt2-san.vmfb \
--iree-llvm-sanitize=thread --iree-llvm-link-embedded=false 2>&1 | tee /tmp/log It gave me errors like the following. (The more complete error message is at https://gist.github.com/wangkuiyi/b4ef1a867e6f129fe3287a0ef0e1d600. The complete one is too big to upload to GitHub.)
It works if I remove |
I don't know the fix for these linking errors, but, FYI:
The |
Interesting! The linker command line from your gist is
and it is itself generated by this code: https://github.com/openxla/iree/blob/1148f720be7e267f248e034b3cfb488633884980/compiler/src/iree/compiler/Dialect/HAL/Target/LLVM/internal/UnixLinkerTool.cpp#L82-L92 This is as if on the Apple platform, the TSan instrumentation library needed to be explicitly linked in (?) We need someone with Apple experience here.... maybe @powderluv ? |
Maybe try adding That is, at UnixLinkerTool.cpp:90 (above linked code), add unconditionally
If that works, we'll figure how to do that conditionally. |
clang -fsantize /tmp/a.c -o /tmp/a is equivalent to the following two: clang /tmp/a.c -c -o /tmp/a.o and ld /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin/libclang_rt.tsan_osx_dynamic.dylib \
-rpath @executable_path \
-rpath /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin \
/tmp/a.o -o /tmp/a \
-lSystem -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk |
I suspect that this issue is not coming from generated code. In that case, you may be able to get away with just building the iree runtime with sanitizers and not fiddling with the (It would obviously be good if this all worked better on apple platforms so just offering an option that night lead through the maze faster -- it is still useful to figure out how to fully enable sanitizers) |
Other things that can be done to bisect the area that is having the problem:
I suggest the last one because that error makes me think there is something going on with the task scheduler in threaded mode. Narrowing down which piece is crashing can help scope debugging activity. |
Agree that this issue does not look like it comes from the generated code.... but TSan specifically (as opposed to other sanitizers) does not allow taking advantage of that in that way, because a TSan-enabled IREE runtime can only call TSan-enabled module code (TSan is an ABI break). Well, it will run, but it will crash.
Ah good idea, that does enable running a TSan-enabled IREE-runtime without having to get TSan to work in module code. My above objection is specific to llvm-cpu target backend.
+1 |
@bjacob @wangkuiyi Looks like this went a bit stale, any further update? |
Deferring to @wangkuiyi . |
@allieculp and @bjacob - I got GPT-2 fine-tuning work a month ago, but via @antiagainst 's Metal GPU backend. This issue comes with the CPU backend, but not the Metal GPU one. |
What happened?
After we fixed #12369, I can make GPT-2 generate text well, so I'm moving on to fine-tuning GPT-2.
In iree-org/iree-jax#58, I added a loss function to the file
iree-jax/models/gpt2/model.py
. In JAX-Python, the fine-tuning works well.Then, in iree-org/iree-jax#59, I add the fine-tuning feature as an MLIR function. The compilation went well, and I got the file
/tmp/gpt2.vmfb
.I can run the module using
iree-run-module
Because the finetune function only updates the paramter and does not return anything, the above run prints only
EXEC @finetune
.To check if the finetuning really works on macOS, I wrote a C++ program to run this vmfb file. Sometimes it works well, but sometimes it crashes with
Bus error: 10
.By putting the C++ program into an iOS app written in Objective-C, I can run the app on my iPhone 13 or the iOS Simulator. On these two platforms, the program crashes with
EXC_BAD_ACCESS
almost every time. I am attaching a stack trace from Xcode.Steps to reproduce your issue
gpt2.vmfb
gpt2.vmfb
on macOS/M1.gpt2.vmfb
on the iOS Simulator or an iPhone.What component(s) does this issue relate to?
Runtime
Version information
IREE da22c84
Additional context
macOS
M1 Max
The text was updated successfully, but these errors were encountered: