Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot install async_io op even if it's compatible flag is displaying OK by ds_report cmd! #6920

Closed
LZhengguo opened this issue Dec 31, 2024 · 3 comments
Assignees
Labels
bug Something isn't working build Improvements to the build and testing systems.

Comments

@LZhengguo
Copy link

LZhengguo commented Dec 31, 2024

When i use DS_BUILD_AIO=1 CFLAGS="-I$CONDA_PREFIX/include/ -I/usr/include/" LDFLAGS="-L$CONDA_PREFIX/lib/ -L/usr/lib/x86_64-linux-gnu/" pip install -e . to install async_io op, i get fake successful msg.
it indeed displays Successfully installed deepspeed , but i use ds_report and only get Image .

And i use print stderr msg and i find that Image

To figure out how to result in this case's coming. I read the source code such as "setup.py"...
and i find problem in "setup.py line 182"
for op_name, builder in ALL_OPS.items(): op_compatible = builder.is_compatible()
When op_name is "async_io", builder.is_compatible() returns false. And i open the "DeepSpeed/deepspeed/ops/op_builder/async_io.py" and find "line 93" def is_compatible(self, verbose=False) . It's result depends on "line 99" aio_compatible = self.has_function('io_submit', ('aio', )) .
Go on to find def has_function() in "DeepSpeed/deepspeed/ops/op_builder/builder.py line308" , and i confirm it raise linkerror in line362
compiler.link_executable(objs, os.path.join(tempdir, 'a.out'), extra_preargs=self.strip_empty_entries(ldflags), libraries=libraries, library_dirs=library_dirs) by "distutils.unixccompiler.UnixCCompiler"
I don't know why it happened and to address this issue i had to change the "class AsyncIOBuilder"("DeepSpeed/deepspeed/ops/op_builder/async_io.py") like the following picture Image .

And i install it again and get the correct result.Image

I hope u can figure out why it caused link error. And i don't know my change whether to cause aio disabled when i use offload.

@LZhengguo LZhengguo changed the title {{ env.GITHUB_WORKFLOW }} Cannot install async_io op even if it's compatible flag is displaying OK by ds_report cmd! Cannot install async_io op even if it's compatible flag is displaying OK by ds_report cmd! Dec 31, 2024
@loadams
Copy link
Contributor

loadams commented Jan 2, 2025

Hi @LZhengguo, can you please share your pip list and the verison of lib_aio that you have installed as well as your OS?

@loadams loadams added bug Something isn't working build Improvements to the build and testing systems. and removed ci-failure labels Jan 2, 2025
@loadams loadams self-assigned this Jan 4, 2025
@loadams
Copy link
Contributor

loadams commented Jan 10, 2025

@LZhengguo, I wasn't able to repro this on my side:

    1  sudo apt-get install libaio1
    2  ds_report
    3  pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    4  git clone https://github.com/microsoft/deepspeed
    5  cd deepspeed/
    6  DS_BUILD_AIO=1 pip install .
    7  ds_report
[2025-01-10 10:56:34,249] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  FP Quantizer is using an untested triton version (3.1.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
 [WARNING]  using untested triton version (3.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------

Perhaps the async_io path isn't being found? Its odd that there are no errors in the build log, could you share that output from the DS_BUILD_AIO pip install . command?

@loadams
Copy link
Contributor

loadams commented Jan 13, 2025

Closing as stale due to no reply, can re-open and investigate if needed.

@loadams loadams closed this as completed Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build Improvements to the build and testing systems.
Projects
None yet
Development

No branches or pull requests

2 participants