Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race-conditions with scratchspaces #381

Closed
lgoettgens opened this issue Dec 21, 2021 · 4 comments
Closed

Race-conditions with scratchspaces #381

lgoettgens opened this issue Dec 21, 2021 · 4 comments

Comments

@lgoettgens
Copy link
Member

lgoettgens commented Dec 21, 2021

I want to run some oscar code in parallel (e.g. on 12 cores).
When stating @everywhere using Oscar at the start of my script, this fails most times with slightly different errors. I have collected a few of them below:

ERROR: LoadError: InitError: IOError: symlink("/home/data/goettgens/.julia/artifacts/8a00b49558b2899cd70a789945e05d41723e3931", "/home/data/goettgens/.julia/scratchspaces/d720cf60-89b5-51f5-aff5-213f193123e7/polymake_12943900445023608064_1.6_depstree/deps/GMP_jll"): file already exists (EEXIST)
Stacktrace:
  [1] uv_error
    @ ./libuv.jl:97 [inlined]
  [2] symlink(target::String, link::String; dir_target::Bool)
    @ Base.Filesystem ./file.jl:1046
  [3] symlink(target::String, link::String)
    @ Base.Filesystem ./file.jl:1010
  [4] prepare_deps_tree(targetdir::String)
    @ Polymake ~/.julia/artifacts/337994e1146a8a47c31b601eaa87acf37794bcb9/share/polymake/generate_deps_tree.jl:44
  [5] __init__()
    @ Polymake ~/.julia/packages/Polymake/qzDKS/src/Polymake.jl:98
  [6] _include_from_serialized(path::String, depmods::Vector{Any})
    @ Base ./loading.jl:674
  [7] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String)
    @ Base ./loading.jl:760
  [8] _tryrequire_from_serialized(modkey::Base.PkgId, build_id::UInt64, modpath::String)
    @ Base ./loading.jl:689
  [9] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String)
    @ Base ./loading.jl:749
 [10] _require(pkg::Base.PkgId)
    @ Base ./loading.jl:998
 [11] require(uuidkey::Base.PkgId)
    @ Base ./loading.jl:914
 [12] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:901
during initialization of module Polymake
in expression starting at /home/data/goettgens/MyScript.jl:17
ERROR: LoadError: InitError: IOError: symlink("/home/data/goettgens/.julia/artifacts/538dcd9c7494dec1f231a958a12e1da9d61c2c40", "/home/data/goettgens/.julia/scratchspaces/d720cf60-89b5-51f5-aff5-213f193123e7/polymake_12943900445023608064_1.6_depstree/deps/FLINT_jll"): file already exists (EEXIST)
Stacktrace:              
  [1] uv_error
    @ ./libuv.jl:97 [inlined]
  [2] symlink(target::String, link::String; dir_target::Bool)
    @ Base.Filesystem ./file.jl:1046
  [3] symlink(target::String, link::String)
    @ Base.Filesystem ./file.jl:1010
  [4] prepare_deps_tree(targetdir::String)
    @ Polymake ~/.julia/artifacts/337994e1146a8a47c31b601eaa87acf37794bcb9/share/polymake/generate_deps_tree.jl:44
  [5] __init__()
    @ Polymake ~/.julia/packages/Polymake/qzDKS/src/Polymake.jl:98
  [6] _include_from_serialized(path::String, depmods::Vector{Any})
    @ Base ./loading.jl:674
  [7] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String)
    @ Base ./loading.jl:760
  [8] _tryrequire_from_serialized(modkey::Base.PkgId, build_id::UInt64, modpath::String)
    @ Base ./loading.jl:689
  [9] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String)
    @ Base ./loading.jl:749
 [10] _require(pkg::Base.PkgId)
    @ Base ./loading.jl:998
 [11] require(uuidkey::Base.PkgId)
    @ Base ./loading.jl:914
 [12] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:901
during initialization of module Polymake
in expression starting at /home/data/goettgens/MyScript.jl:17

I guess that this arises from the scratchspace being regenerated multiple times at once and thus some race conditions there.

Minimal working example (probability for occurrence of the error is higher on machines with more cores):

$ julia -p 10
julia> using Polymake
polymake version 4.5
Copyright (c) 1997-2021
Ewgenij Gawrilow, Michael Joswig, and the polymake team
Technische Universität Berlin, Germany
https://polymake.org

This is free software licensed under GPL; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ERROR: On worker 4:
InitError: IOError: symlink("/home/data/goettgens/.julia/artifacts/8a00b49558b2899cd70a789945e05d41723e3931", "/home/data/goettgens/.julia/scratchspaces/d720cf60-89b5-51f5-aff5-213f193123e7/polymake_12943900445023608064_1.6_depstree/deps/GMP_jll"): file already exists (EEXIST)
Stacktrace:
  [1] uv_error
    @ ./libuv.jl:97 [inlined]
  [2] #symlink#30
    @ ./file.jl:1046
  [3] symlink
    @ ./file.jl:1010
  [4] prepare_deps_tree
    @ ~/.julia/artifacts/337994e1146a8a47c31b601eaa87acf37794bcb9/share/polymake/generate_deps_tree.jl:44
  [5] __init__
    @ ~/.julia/packages/Polymake/qzDKS/src/Polymake.jl:98
  [6] _include_from_serialized
    @ ./loading.jl:674
  [7] _require_search_from_serialized
    @ ./loading.jl:760
  [8] _require
    @ ./loading.jl:998
  [9] require
    @ ./loading.jl:914
 [10] #1
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/Distributed.jl:79
 [11] #103
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:274
 [12] run_work_thunk
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:63
 [13] run_work_thunk
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:72
 [14] #96
    @ ./task.jl:411
during initialization of module Polymake

...and 4 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:369
 [2] macro expansion
   @ ./task.jl:388 [inlined]
 [3] _require_callback(mod::Base.PkgId)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/Distributed.jl:76
 [4] #invokelatest#2
   @ ./essentials.jl:708 [inlined]
 [5] invokelatest
   @ ./essentials.jl:706 [inlined]
 [6] require(uuidkey::Base.PkgId)
   @ Base ./loading.jl:920
 [7] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:901
 [8] top-level scope
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/macros.jl:204

My only workaround for now is a downgrade to v0.6.1.

Version information:
julia: 1.6.1 and 1.7.0
Oscar: 0.7.1
Polymake.jl: 0.6.2

Edit: added version information

@fingolfin
Copy link
Member

I guess we either need some kinda of locking (but race free! which can is tricky, esp. if scratch is stored in ~/.julia and the homedir is on NFS...) so that only one process creates the scratch; alternatively, one can do this:

  • if scratch is missing:
    • create it in another dir
    • once done, check if scratch is still missing
      • if not missing, discard what we did above and use the scratch space that's now there (presumably generated by another process)
      • if missing, then atomically insert the new scratch

Alas, that won't work if the data in the scratch space hardcodes paths (which I have a nagging suspicion happens here??)

@fingolfin
Copy link
Member

I think this will affect other package using scratch spaces, so I also filed JuliaPackaging/Scratch.jl#26

@benlorenz
Copy link
Member

I have also seen a few issues for julia regarding similar problems in the .julia folder with some Pkg operations.
For precompilation this was solved by atomically overwriting the target file, which is something we could also do for single symlinks. But unfortunately this operation is not possible for directories. They cannot be overwritten atomically. So the problem is this line:

* if missing, then atomically insert the new scratch

One way to make that atomic is to make that insert operation just a symlink, but then we will end up with many unneeded such trees (and that extra indirection might cause other problems). Or we can probably generate the same tree in parallel (with mkpath) and for each symlink use the rename+overwrite approach.
I will try to experiment a bit with this second approach.

Alas, that won't work if the data in the scratch space hardcodes paths (which I have a nagging suspicion happens here??)

This scratchspace contains symlinks to files and directories in a few .julia/artifacts/<somehash> directories so I don't think these paths should cause any problems (since all processes should be creating exactly the same symlinks).

@benlorenz
Copy link
Member

This should be fixed in Polymake 0.7, in my experiments with up to 24 parallel processes I did not see any errors running this script:

using Distributed
@everywhere begin
    using Pkg; Pkg.activate(@__DIR__)
    Pkg.instantiate();
end
@everywhere begin
    using Polymake
    println(polytope.cube(3).F_VECTOR);
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants