Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation error in Polaris for SOC-QMC calculations #5283

Open
kayahans opened this issue Jan 15, 2025 · 11 comments
Open

Segmentation error in Polaris for SOC-QMC calculations #5283

kayahans opened this issue Jan 15, 2025 · 11 comments

Comments

@kayahans
Copy link
Contributor

kayahans commented Jan 15, 2025

Describe the bug
DMC-SOC run produces segmentation error on Polaris, does not give any feedback to the user what might be wrong about the input. QMCPACK output does not give any indication of insufficient memory from the output, and I have tested the same run with reduced meshfactors to see if the job will go through, but none of those jobs succeeded. The same job with more demanding parameters run successfully on Baseline (OLCF) with no issues.

Here are the last few lines from the output file with meshfactor=1.0 using debug queue (1 node 4 MPI ranks, minimal setup) in Polaris:

=========================================================
  Start VMCBatched
  File Root dmc.s000
=========================================================
==============================================================
--- Memory usage report : VMCBatched before initialization ---
==============================================================
Available memory on node 0, free + buffers :  178134 MiB
Memory footprint by rank 0 on node 0       :   61632 MiB
Device memory allocated via OpenMP offload :       0 MiB
Device memory allocated via CUDA allocator :       0 MiB
Free memory on the default device          :   38648 MiB
==============================================================
VMCBatched Driver running with
             total_walkers     = 512
             walkers_per_rank  = [128(x4)]
             num_crowds        = 8
  on rank 0, walkers_per_crowd = [16(x8)]

                         steps = 1
                        blocks = 5

===================================================================
--- Memory usage report : VMCBatched after initialLogEvaluation ---
===================================================================
Available memory on node 0, free + buffers :  176221 MiB
Memory footprint by rank 0 on node 0       :   62045 MiB
Device memory allocated via OpenMP offload :     127 MiB
Device memory allocated via CUDA allocator :       0 MiB
Free memory on the default device          :   38366 MiB
===================================================================

I have tried reducing meshfactor of the splines, but it made no effect to the outcome.

dmc.err file is the following:


Lmod is automatically replacing "nvhpc/23.9" with "gcc-native/12.3".


Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.28

QMCPACK ERROR Primitive cell ion 0 vs supercell ion 0 atomic number not matching: 0 vs 75
QMCPACK ERROR Primitive cell ion 1 vs supercell ion 1 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 2 vs supercell ion 2 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 3 vs supercell ion 3 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 4 vs supercell ion 4 atomic number not matching: 0 vs 17
QMCPACK ERROR Primitive cell ion 5 vs supercell ion 5 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 6 vs supercell ion 6 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 7 vs supercell ion 7 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 8 vs supercell ion 11 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 9 vs supercell ion 8 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 10 vs supercell ion 12 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 11 vs supercell ion 9 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 12 vs supercell ion 13 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 13 vs supercell ion 10 atomic number not matching: 0 vs 6
QMCPACK ERROR Primitive cell ion 14 vs supercell ion 14 atomic number not matching: 0 vs 1
QMCPACK ERROR Primitive cell ion 15 vs supercell ion 15 atomic number not matching: 0 vs 16
QMCPACK ERROR Primitive cell ion 16 vs supercell ion 16 atomic number not matching: 0 vs 16
[x3005c0s13b0n0.hsn.cm.polaris.alcf.anl.gov](http://x3005c0s13b0n0.hsn.cm.polaris.alcf.anl.gov/): rank 2 died from signal 11 and dumped core

Manual and the workshop materials say that the primitive/supercell ERROR lines are expected, because the file converted using convertpw4qmc does not contain the ionic species information. I get the same ERROR lines in Baseline, but they do not affect the calculation.

To Reproduce
Steps to reproduce the behavior:

  1. Using the precompiled executable in: /soft/applications/qmcpack/develop-20241118//build_polaris_Clang18_offload_cuda_cplx/bin
  2. Using the job script in /soft/applications/qmcpack/develop-20241118/qmcpack-polaris.job
  3. Wavefunction is produced using QMCPACK in Baseline (via a short job using save_coefs="yes") and then transferred to Polaris. Will be available upon request.
  4. Smaller sized input/output files are attached for review.

System:

  • Polaris and Baseline (OLCF)
  • modules loaded: available in the attached files.
  • other systems where this is reproducible [e.g. "my laptop", "none"]

soc_baseline.zip
soc_error_polaris.zip

@ye-luo
Copy link
Contributor

ye-luo commented Jan 15, 2025

QMCPACK ERROR Primitive cell ion 0 vs supercell ion 0 atomic number not matching: 0 vs 75

"Primitive cell ion 0" comes from h5.
"supercell ion 0" comes from xml.
It seems your hdf5 doesn't contain atomic info. Which converter did you use?

@prckent
Copy link
Contributor

prckent commented Jan 15, 2025

Is this reproducibile on one node? Have you tried to narrow down the issue? Are there other similar runs that work for you?

This looks like a problem with the inputs. If so you would get a failure for all similar runs and on one cpu core or 1 gpu only.

@camelto2
Copy link
Contributor

QMCPACK ERROR Primitive cell ion 0 vs supercell ion 0 atomic number not matching: 0 vs 75

"Primitive cell ion 0" comes from h5. "supercell ion 0" comes from xml. It seems your hdf5 doesn't contain atomic info. Which converter did you use?

This is expected output, convertpw4qmc doesn't have ion information. So this warning is expected. Another reason it would be nice to have pw2qmcpack handle spinors

@ye-luo
Copy link
Contributor

ye-luo commented Jan 15, 2025

Actually QMCPACK ERROR is a false alarm. It is a separate issue from the segfault.
QMCPACK ERROR comes out of atomic number checks. the code doesn't abort on this failed check when SOC is used.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 15, 2025

Since it is a one node run, is the failure reproducible? Could you try run it again?

@kayahans
Copy link
Contributor Author

The error that terminates the job is in this line: [x3005c0s13b0n0.hsn.cm.polaris.alcf.anl.gov](http://x3005c0s13b0n0.hsn.cm.polaris.alcf.anl.gov/): rank 2 died from signal 11 and dumped core

Yes, the QMCPACK ERROR is a false alarm, as I said I get it in the job I ran in Baseline as well, and it does not affect the calculation.

@kayahans
Copy link
Contributor Author

Since it is a one node run, is the failure reproducible? Could you try run it again?

I ran using multiple nodes, those failed as well. I don't have the output from that job now. I can try resubmitting with multiple nodes.

@kayahans
Copy link
Contributor Author

QMCPACK ERROR Primitive cell ion 0 vs supercell ion 0 atomic number not matching: 0 vs 75

"Primitive cell ion 0" comes from h5. "supercell ion 0" comes from xml. It seems your hdf5 doesn't contain atomic info. Which converter did you use?

I used convert4qmc

@Hyeondeok-Shin
Copy link
Contributor

Is it giving the same error with cpu build?

@kayahans
Copy link
Contributor Author

kayahans commented Jan 15, 2025

Is it giving the same error with cpu build?

@Hyeondeok-Shin CPU build gives no error, executed successfully.
dmc.out.zip

@kayahans
Copy link
Contributor Author

kayahans commented Jan 22, 2025

@prckent @ye-luo I have compiled the files I used in here /lus/grand/projects/PSFMat_2/shared/MAE at Polaris. Thank you for looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants