Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] kmeans.daph #876

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

[WIP] kmeans.daph #876

wants to merge 5 commits into from

Conversation

Garic152
Copy link
Contributor

I have no come to a state of the code where it's functionality should be ready very soon, but had to include many type conversions and small workarounds to make it work.

Some things I noticed where:

  • Having different datatypes (like an si64 and f64) inside the ln or log function leads to errors
  • The same goes for the ifelse construct(x = y == 1 ? 1 : 2.0), having outputs of different types leads to errors
  • There is no way to not specify a sparsity or min/max value in rand(), which is possible in systemds and unfortunately also used twice in the kmeans algorithm, for line 237 in the original algorithm i for example assumed a sparsity of 1.0 to make the code work, which might not actually be the case when you truly randomly fill the matrix

After my last adjustment of the file in line 188-190 which changed the values to be inserted from si64 to f64 for insert_col to work, i now have another error message of this type:

eric@daphne-container:/daphne$ ./bin/daphne  tools/dml2daph/translated_files/test_kmeans.daph
BEGIN K-MEANS SCRIPT
dim X=
5
x
100
Taking data samples for initialization...
Initializing the centroids for all runs...
Performing k-means iterations for all runs...
Run 1, At Start-Up:  Centroid WCSS = 6457.81
./bin/daphne(+0x148b8e2)[0x5fa05da8d8e2]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x79716d045320]
/daphne/bin/../lib/libAllKernels.so(_ZN11DenseMatrixIdE17getValuesInternalEPK21IAllocationDescriptorPK5Range+0x412)[0x797161dcf8d2]
/daphne/bin/../lib/libAllKernels.so(_transpose__DenseMatrix_double__DenseMatrix_double+0x8e)[0x797161c9405e]
[0x79716d699477]
[0x79716d69d109]
[0x79716d69d16d]
[0x79716d69d2dd]
./bin/daphne(+0x1e04e15)[0x5fa05e406e15]
./bin/daphne(+0x14ad1b3)[0x5fa05daaf1b3]
./bin/daphne(+0x14b262d)[0x5fa05dab462d]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x79716d02a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x79716d02a28b]
./bin/daphne(+0x148a3a5)[0x5fa05da8c3a5]
[error]: Got an abort signal from the execution engine. Most likely an exception in a shared library. Check logs!
Execution error: Returning from signal 11

When using the --explain function, all passes up to llvm work fine, but the mlir_codegen pass doesn't seem to work. I have included a small test file in this PR that reproduces the error message.

After fixing the remaining errors I will first test the functionality of the translated algorithm and then work on proper formatting and commenting.

@philipportner
Copy link
Collaborator

philipportner commented Oct 24, 2024

Hi @Garic152 , thanks for working on this.

Our code generation pipeline is currently not run by default and only executed by adding --mlir-codegen when executing daphne. As the code generation pipeline does not support all workloads at the moment, it should be sufficient if you run daphne without the --mlir-codegen flag.

When you run --explain, the last explain output we provide is llvm, so that's at the end of our pipeline, and if that one printed something than the lowering pipeline has completed.

Looking at the stack trace you provided, if you pipe if through c++filt you get demangled names.

cat stacktrace.txt | c++filt

./bin/daphne(+0x148b8e2)[0x5fa05da8d8e2]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x79716d045320]
/daphne/bin/../lib/libAllKernels.so(DenseMatrix<double>::getValuesInternal(IAllocationDescriptor const*, Range const*)+0x412)[0x797161dcf8d2]
/daphne/bin/../lib/libAllKernels.so(_transpose__DenseMatrix_double__DenseMatrix_double+0x8e)[0x797161c9405e]
[0x79716d699477]
[0x79716d69d109]
[0x79716d69d16d]
[0x79716d69d2dd]
./bin/daphne(+0x1e04e15)[0x5fa05e406e15]
./bin/daphne(+0x14ad1b3)[0x5fa05daaf1b3]
./bin/daphne(+0x14b262d)[0x5fa05dab462d]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x79716d02a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x79716d02a28b]
./bin/daphne(+0x148a3a5)[0x5fa05da8c3a5]
[error]: Got an abort signal from the execution engine. Most likely an exception in a shared library. Check logs!
Execution error: Returning from signal 11

Seems like the problem is triggered by the _transpose__DenseMatrix_double__DenseMatrix_double kernel at one of the calls to DenseMatrix::getValues.

I'd suggest compiling with --debug and starting daphne with a debugger to figure out what's going on. gdb --args ./bin/daphne tools/dml2daph/translated_files/test_kmeans.daph

Otherwise, it would be good if you can create a minimal reproducible example that triggers this problem and create an issue :)

@philipportner
Copy link
Collaborator

My bad, I missed that you already included a reproducible example.

Here's the backtrace:

(gdb) bt
#0  std::__uniq_ptr_impl<Range, std::default_delete<Range> >::_M_ptr (this=0x10) at /usr/include/c++/9/bits/unique_ptr.h:154
#1  std::unique_ptr<Range, std::default_delete<Range> >::get (this=0x10) at /usr/include/c++/9/bits/unique_ptr.h:361
#2  std::unique_ptr<Range, std::default_delete<Range> >::operator bool (this=0x10) at /usr/include/c++/9/bits/unique_ptr.h:375
#3  std::operator==<Range, std::default_delete<Range> >(std::unique_ptr<Range, std::default_delete<Range> > const&, decltype(nullptr)) (__x=std::unique_ptr<Range> = {...}) at /usr/include/c++/9/bits/unique_ptr.h:722
#4  DenseMatrix<double>::getValuesInternal (this=0x555559cef0b0, alloc_desc=<optimized out>, range=<optimized out>) at /home/philipportner/daphne/src/runtime/local/datastructures/DenseMatrix.cpp:194
#5  0x00007fffee2730a6 in DenseMatrix<double>::getValues (range=0x0, alloc_desc=0x0, this=<optimized out>) at /home/philipportner/daphne/src/runtime/local/datastructures/DenseMatrix.h:221
#6  Transpose<DenseMatrix<double>, DenseMatrix<double> >::apply (ctx=0x555559c15c60, arg=<optimized out>, res=@0x7fffffffb1d0: 0x555559dac1b0) at /home/philipportner/daphne/src/runtime/local/kernels/Transpose.h:63
#7  transpose<DenseMatrix<double>, DenseMatrix<double> > (ctx=0x555559c15c60, arg=<optimized out>, res=@0x7fffffffb1d0: 0x555559dac1b0) at /home/philipportner/daphne/src/runtime/local/kernels/Transpose.h:40
#8  _transpose__DenseMatrix_double__DenseMatrix_double (res=0x7fffffffb1d0, arg=0x555559cef0b0, kId=245, ctx=0x555559c15c60) at /home/philipportner/daphne/build/src/runtime/local/kernels/kernels_62.cpp:15
#9  0x00007ffff7fc4477 in m_kmeans-2-1 ()
#10 0x00007ffff7fc8109 in main ()
#11 0x00007ffff7fc816d in _mlir_ciface_main ()
#12 0x00007ffff7fc82dd in _mlir__mlir_ciface_main ()
#13 0x00005555575544ab in mlir::ExecutionEngine::invokePacked(llvm::StringRef, llvm::MutableArrayRef<void*>) ()
#14 0x0000555556cdb041 in mlir::ExecutionEngine::invoke<>(llvm::StringRef) (funcName=..., this=0x555559d325f0) at /usr/include/c++/9/bits/basic_string.h:940
#15 startDAPHNE (argc=2, argv=0x7fffffffd878, daphneLibRes=0x0, id=<optimized out>, user_config=...) at /home/philipportner/daphne/src/api/internal/daphne_internal.cpp:613
#16 0x0000555556ce0806 in mainInternal (argc=2, argv=0x7fffffffd878, daphneLibRes=0x0) at /home/philipportner/daphne/src/api/internal/daphne_internal.cpp:668
#17 0x00007ffff75cb083 in __libc_start_main (main=0x555556cae2f0 <main(int, char const**)>, argc=2, argv=0x7fffffffd878, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd868) at ../csu/libc-start.c:308
#18 0x0000555556cae22e in _start ()

@Garic152
Copy link
Contributor Author

After diving into some more debugging, I now came across another very weird error, this time in line 123. This error occurs during the execution of an element-wise multiplication operation within the EWMin function.
Here the code crashes with the following error message:

eric@daphne-container:/daphne$ ./bin/daphne tools/dml2daph/translated_files/test_kmeans.daph 
./bin/daphne(+0x148b8e2)[0x5e54cafba8e2]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x7d89f2e45320]
/daphne/bin/../lib/libAllKernels.so(_ZNK14MetaDataObject9getLatestEv+0xf)[0x7d89e75e61cf]
/daphne/bin/../lib/libAllKernels.so(_ZN11DenseMatrixIdE17getValuesInternalEPK21IAllocationDescriptorPK5Range+0x3c0)[0x7d89e75cf770]
/daphne/bin/../lib/libAllKernels.so(+0x77651e)[0x7d89e737651e]
/daphne/bin/../lib/libAllKernels.so(_ewMin__DenseMatrix_double__DenseMatrix_double__DenseMatrix_double+0x4e)[0x7d89e7377f7e]
[0x7d89f3759e2d]
[0x7d89f375ad94]
[0x7d89f375addd]
[0x7d89f375af3d]
./bin/daphne(+0x1e04e15)[0x5e54cb933e15]
./bin/daphne(+0x14ad1b3)[0x5e54cafdc1b3]
./bin/daphne(+0x14b262d)[0x5e54cafe162d]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7d89f2e2a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7d89f2e2a28b]
./bin/daphne(+0x148a3a5)[0x5e54cafb93a5]
[error]: Got an abort signal from the execution engine. Most likely an exception in a shared library. Check logs!
Execution error: Returning from signal 11
corrupted double-linked list
./bin/daphne(+0x148b8e2)[0x5e54cafba8e2]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x7d89f2e45320]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7d89f2e9eb1c]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7d89f2e4526e]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7d89f2e288ff]
/lib/x86_64-linux-gnu/libc.so.6(+0x297b6)[0x7d89f2e297b6]
/lib/x86_64-linux-gnu/libc.so.6(+0xa8fe5)[0x7d89f2ea8fe5]
/lib/x86_64-linux-gnu/libc.so.6(+0xa9b6c)[0x7d89f2ea9b6c]
/lib/x86_64-linux-gnu/libc.so.6(+0xa9d1b)[0x7d89f2ea9d1b]
/lib/x86_64-linux-gnu/libc.so.6(+0xaad95)[0x7d89f2eaad95]
/lib/x86_64-linux-gnu/libc.so.6(+0xab42a)[0x7d89f2eab42a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_free+0x7e)[0x7d89f2eadd9e]
/usr/local/lib/libantlr4-runtime.so.4.9.2(_ZN6antlr43atn12ATNConfigSetD2Ev+0x91)[0x7d89f36b5351]
/usr/local/lib/libantlr4-runtime.so.4.9.2(_ZN6antlr43atn12ATNConfigSetD0Ev+0xd)[0x7d89f36b543d]
/usr/local/lib/libantlr4-runtime.so.4.9.2(_ZN6antlr43dfa8DFAStateD0Ev+0xd)[0x7d89f3704d0d]
/usr/local/lib/libantlr4-runtime.so.4.9.2(_ZN6antlr43dfa3DFAD1Ev+0x53)[0x7d89f37015c3]
./bin/daphne(+0x152512c)[0x5e54cb05412c]
/lib/x86_64-linux-gnu/libc.so.6(+0x47a66)[0x7d89f2e47a66]
/lib/x86_64-linux-gnu/libc.so.6(+0x47bae)[0x7d89f2e47bae]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1d1)[0x7d89f2e2a1d1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7d89f2e2a28b]
./bin/daphne(+0x148a3a5)[0x5e54cafb93a5]
*** longjmp causes uninitialized stack frame ***: terminated
./bin/daphne(+0x148b8e2)[0x5e54cafba8e2]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x7d89f2e45320]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7d89f2e9eb1c]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7d89f2e4526e]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7d89f2e288ff]
/lib/x86_64-linux-gnu/libc.so.6(+0x297b6)[0x7d89f2e297b6]
/lib/x86_64-linux-gnu/libc.so.6(+0x136c19)[0x7d89f2f36c19]
/lib/x86_64-linux-gnu/libc.so.6(+0x135c21)[0x7d89f2f35c21]
/lib/x86_64-linux-gnu/libc.so.6(__longjmp_chk+0x32)[0x7d89f2f37302]
./bin/daphne(+0x148b909)[0x5e54cafba909]

As the matrices themselves should work totally fine when printed and the EWMin operation on itself also works great, I did another analysis with gdb like @philipportner recommended.

This lead to the following backtrace:

Thread 1 "daphne" received signal SIGSEGV, Segmentation fault.
0x00007c21b1eabe87 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007c21b1eabe87 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007c21b1ead6e4 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007c21b22bb904 in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007c21a61bd658 in std::__new_allocator<unsigned long>::allocate (this=0x7fff97eeb430, __n=1) at /usr/include/c++/13/bits/new_allocator.h:151
#4  0x00007c21a61bd2cc in std::allocator<unsigned long>::allocate (__n=1, this=0x7fff97eeb430) at /usr/include/c++/13/bits/allocator.h:198
#5  std::allocator_traits<std::allocator<unsigned long> >::allocate (__n=1, __a=...) at /usr/include/c++/13/bits/alloc_traits.h:482
#6  std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_M_allocate (this=0x7fff97eeb430, __n=1) at /usr/include/c++/13/bits/stl_vector.h:381
#7  0x00007c21a64bb3b3 in std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_M_create_storage (this=0x7fff97eeb430, __n=1)
    at /usr/include/c++/13/bits/stl_vector.h:398
#8  0x00007c21a64ba849 in std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_Vector_base (this=0x7fff97eeb430, __n=1, __a=...)
    at /usr/include/c++/13/bits/stl_vector.h:335
#9  0x00007c21a64fd005 in std::vector<unsigned long, std::allocator<unsigned long> >::vector (this=0x7fff97eeb430, 
    __x=std::vector of length 1, capacity 1 = {...}) at /usr/include/c++/13/bits/stl_vector.h:603
#10 0x00007c21a64fc3ca in MetaDataObject::getLatest (this=0x5d50a500d600) at /daphne/src/runtime/local/datastructures/MetaDataObject.cpp:91
#11 0x00007c21a64cc573 in DenseMatrix<double>::getValuesInternal (this=0x5d50a5133770, alloc_desc=0x0, range=0x0)
    at /daphne/src/runtime/local/datastructures/DenseMatrix.cpp:188
#12 0x00007c21a6088ad9 in DenseMatrix<double>::getValues (this=0x5d50a5133770, alloc_desc=0x0, range=0x0)
    at /daphne/src/runtime/local/datastructures/DenseMatrix.h:222
#13 0x00007c21a6176ce1 in EwBinaryMat<DenseMatrix<double>, DenseMatrix<double>, DenseMatrix<double> >::apply (opCode=BinaryOpCode::MUL, 
    res=@0x7fff97eeb940: 0x5d50a50de580, lhs=0x5d50a5133770, rhs=0x5d50a56ec970, ctx=0x5d50a58ef3a0) at /daphne/src/runtime/local/kernels/EwBinaryMat.h:66
#14 0x00007c21a61760d2 in ewBinaryMat<DenseMatrix<double>, DenseMatrix<double>, DenseMatrix<double> > (opCode=BinaryOpCode::MUL, 
    res=@0x7fff97eeb940: 0x5d50a50de580, lhs=0x5d50a5133770, rhs=0x5d50a56ec970, ctx=0x5d50a58ef3a0) at /daphne/src/runtime/local/kernels/EwBinaryMat.h:43
#15 0x00007c21a616086e in _ewMul__DenseMatrix_double__DenseMatrix_double__DenseMatrix_double (res=0x7fff97eeb940, lhs=0x5d50a5133770, rhs=0x5d50a56ec970, 
    kId=82, ctx=0x5d50a58ef3a0) at /daphne/build/src/runtime/local/kernels/kernels_21.cpp:186
#16 0x00007c21b3c055c6 in m_kmeans-2-1 ()
#17 0x00007c21b3c06d94 in main ()
#18 0x00007c21b3c06ddd in _mlir_ciface_main ()
#19 0x00007c21b3c06f3d in _mlir__mlir_ciface_main ()
#20 0x00005d50747eab45 in mlir::ExecutionEngine::invokePacked(llvm::StringRef, llvm::MutableArrayRef<void*>) ()
#21 0x00005d50738b3d93 in mlir::ExecutionEngine::invoke<>(llvm::StringRef) (this=0x5d50a5129d00, funcName=...)
    at /usr/local/include/mlir/ExecutionEngine/ExecutionEngine.h:180
#22 0x00005d5073891b00 in startDAPHNE (argc=2, argv=0x7fff97eede98, daphneLibRes=0x0, id=0x7fff97eedb68, user_config=...)
    at /daphne/src/api/internal/daphne_internal.cpp:613
#23 0x00005d50738935e8 in mainInternal (argc=2, argv=0x7fff97eede98, daphneLibRes=0x0) at /daphne/src/api/internal/daphne_internal.cpp:668
#24 0x00005d507388c552 in main (argc=2, argv=0x7fff97eede98) at /daphne/src/api/cli/daphne.cpp:19

There where several things i noticed when analyzing the backtrace:

In frame 13 inside of EwBinaryMat<DenseMatrix<double>, DenseMatrix<double>, DenseMatrix<double> >, while valuesRhs is a valid adress, valuesLhs is a null pointer, which should not be the case.

In frame 12 (DenseMatrix<double>::getValues), all local variables (isLatest, id, ptr) report memory access errors:

(gdb) info locals
isLatest = <error reading variable: Cannot access memory at address 0x5d00a5073810>
id = <error reading variable: Cannot access memory at address 0x5>
ptr = <error reading variable: Cannot access memory at address 0x1>

This could indicate that the error seems to originate from frame 11 (getValuesInternal()) or even before that.

This is the *this I get from frame 11:

#11 0x00007c21a64cc573 in DenseMatrix<double>::getValuesInternal (this=0x5d50a5133770, alloc_desc=0x0, range=0x0) at /daphne/src/runtime/local/datastructures/DenseMatrix.cpp:188
188                 auto latest = this->mdo->getLatest();
(gdb) print this
$23 = (DenseMatrix<double> * const) 0x5d50a5133770
(gdb) print *this
$24 = {<Matrix<double>> = {<Structure> = {_vptr.Structure = 0x7c21a739b8f0, refCounter = 1, refCounterMutex = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
            __size = '\000' <repeats 39 times>, __align = 0}}, <No data fields>}, row_offset = 0, col_offset = 0, numRows = 1, numCols = 5, mdo = std::shared_ptr<MetaDataObject> (use count 1, weak count 0) = {get() = 0x5d50a500d600}}, <No data fields>}, is_view = false, rowSkip = 5, 
  values = std::shared_ptr<double []> (use count 1, weak count 0) = {get() = 0x5d50a52fe9e0}, bufferSize = 40, lastAppendedRowIdx = 0, lastAppendedColIdx = 0}

This is unfortunately the point where I am not sure anymore whether the Structure behaves normally or not because so far I didn't come into contact with the data structure generation part of Daphne at all.

4 Side notes that may help to better identify the source of the error:

  1. As a side note, just like in [WIP] Dev issue #766 #847, I tried removing the for construct and execute the loop manually a few times, which did work perfectly without any crashes. Is this problem maybe related to the other issue?
  2. Also, the code does work when instead of using the ifelse construct at the end you only use one of the equal statements which would be applied depending on the ifelse.
  3. I have installed systemds to cross validate the matrix dimensions and outputs and except for some matrix value differences due to the different seeds, the dimensions are the same in daphne and systemds up the segfault error.
  4. Executing the code multiple times leads can lead to different error log outcomes, one of them is the one I included at the top of this message and the other ones I put into this gist

I will push another reproducible error file kmeans_second_error.daph in a second.

@philipportner
Copy link
Collaborator

Also, the code does work when instead of using the ifelse construct at the end you only use one of the equal statements which would be applied depending on the ifelse.

I tried running it locally on the pr #758 as this PR refactors code concerning the getValues functions. Doesn't fix the problem, and as you already pointed out, if lhs is a nullptr we messed up somewhere before.

The ternary at kmeans_final.daph:123 doesn't look right to me. Looks like you are trying to assign a f64 to a DenseMatrix in one of the branches.

When adding this diff, the type is still DenseMatrix(375, 1) , but trying to print(min_distances) itself fails.

min_distances = i == 1 ? as.matrix<f64>(is_row_in_samples * distances) : as.f64(min(min_distances, distances));
+ print(typeOf(min_distances));

@Garic152
Copy link
Contributor Author

Garic152 commented Nov 7, 2024

Translating the script turned out to be more of a challenge than expected.

After going through the code and comparing the auto-translated code variables line by line with the kmeans.dml output, there were many extremely sneaky errors, mostly due to the 0 vs. 1 based indexing translation errors, especially in the ctable function and also the reshape function, which does not reshape column by column (like in systemds), but row by row (maybe a functionality to choose between these 2 methods would be useful in daphne?)

After fixing these errors and making sure every operations does what it is supposed to when being compared to systemds, I still once again arrived at the segfault error I initially mentioned regarding the k-means-iteration step in line 153.

I tried reproducing this error in another file:

C = rand(5, 5, 0.0, 1.0, 1.0, 1234);

a = 10.0;

counter = 1.0;

while (counter < 5) {
  print(C);
  C_new = C + C;

  b = a;
  a = 1.0;

  if (a < b) {
    print("false statement");
  } else {
    counter = counter + 1;
    C = C_new;
  }
}

I replaced all variables and calculations with much simpler terms and also removed unnecessary parts of the original code that didn't have anything to do with the error to improve the codes debuggability.

When running the code, the while loop runs once, and after assigning C_new to C like in the kmeans algorithm, the contents of C are somehow corrupted.

I took a look at the IR, but didn't notice anything particularly wrong, the code also runs fine in numpy so there shouldn't be any logic errors. I also checked all the types, but there weren't any visible mistakes here either.

I would really appreciate some advice here, as this bug is unfortunately something I just cannot get behind.

Edit: This problem seems similar to the one in #558, where nested if statements also caused problems. (I came across this while looking at the multiLogReg.daph file I was also supposed to be working on).

@pdamme
Copy link
Collaborator

pdamme commented Nov 13, 2024

Thanks for putting all this effort into translating the kmeans script, @Garic152! I know from my own experience that it can be a very cumbersome process. That's why it's good to identify all those points that the dml2daph tool doesn't handle correctly yet; with that we can make the translator better over time.

  1. We already discovered some of the tricky 0/1-based indexing-related translation issues before. You could find them by comparing the decision tree script in DaphneDSL to the original version in DML from the SystemDS repo.

  2. I wasn't aware that reshape() works differently in DAPHNE and SystemDS. Can you give an example?

  3. I had a look at the example in your latest comment. I further simplified it to further isolate the error. It turned out the problem is a double-free due to a bug in object reference management in combination with the arith.select op. It can be seen with --explain obj_ref_mgnt. For more details, see issue Double-free related to arith.select and reference counters #911. I've prepared a fix in PR [DAPHNE-#911] Correct reference counter for arith.select result. #912, but it's still in draft state, since it might create memory leaks (I will further investigate this). However, the change is essentially one line in src/compiler/lowering/ManageObjRefsPass.cpp, so you could apply that to your clone and try if your scripts work then.

    In very rare cases, such memory corruptions can happen through bugs in DAPHNE's reference counter management. It can be helpful to try running DAPHNE with --no-obj-ref-mgnt, which switches off garbage collection, i.e., no data object will be freed. If a crashing DaphneDSL script succeeds with that flag, then the problem is likely related to DAPHNE's garbage collection.

@pdamme
Copy link
Collaborator

pdamme commented Nov 13, 2024

The ternary at kmeans_final.daph:123 doesn't look right to me. Looks like you are trying to assign a f64 to a DenseMatrix in one of the branches.

@philipportner The conditional op looks good to me. The as.f64(...) only sets the value type of the result, while retaining the data type (see the DaphneDSL language reference). I.e., if the input to the cast is a matrix, then the result is a matrix of f64.

@Garic152
Copy link
Contributor Author

Thanks for the support, the changes in #912 make the code work!
The kmeans algorithm now runs to the end and works for some simple test cases I created.
I will now add the commenting from systemds and create some more test cases to see if the algorithm works correctly in all cases.

Regarding the daphne reshape() vs. systemds matrix(), the reason for the difference was actually due to the byrow=bool argument (which I just noticed) which was set to false in the kmeans algorithm, so I had to transpose the matrix before applying the reshape function in the daphne translation.

I made some changes to dml2daph.py as well (mostly adding new functions), which I could push together with the other smaller algorithms I translated so far.

@Garic152
Copy link
Contributor Author

I added the missing comments to the kmeans.daph file and added new tests.
Inside the test file, I had to do a little workaround and first reassign the found clusters to the real ones using pairwise distance calculations, because there is no guarantee that the algorithm will find the clusters in the same order as they were generated.

The tests work for most cases, but I noticed 2 things:

  • When there is no noise, the algorithm will work up to a certain number of points, but after a certain threshold it will assign 2 clusters to the same points that are in the middle of 2 other real clusters.
  • The same behavior occurs with multiple clusters and moderate noise, for this I left a failed test (which should work theoratically) inside the test file for reference.

Since the algorithm works well in almost all cases except these special cases, I might suspect that my settings for the kmeans algorithm are just not optimal or that the generated random data is at fault?

runs = 25;
max_iter = 200;
eps = 0.001;
avg_sample_size_per_centroid = max_points;
found_C, found_Y = kmeans.m_kmeans(as.matrix<f64>(data), num_centroids, runs, max_iter, eps, as.bool(0), avg_sample_size_per_centroid, seed);

If there were major logic errors in the code, I doubt any test case would succeed at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants