Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce testing data #671

Open
3 tasks done
forsyth2 opened this issue Feb 4, 2025 · 12 comments
Open
3 tasks done

Reduce testing data #671

forsyth2 opened this issue Feb 4, 2025 · 12 comments
Assignees

Comments

@forsyth2
Copy link
Collaborator

forsyth2 commented Feb 4, 2025

Request criteria

  • I searched the zppy GitHub Discussions to find a similar question and didn't find it.
  • I searched the zppy documentation.
  • This issue does not match the other templates (i.e., it is not a bug report, documentation request, feature request, or a question.)

Issue description

For initial explanation, see #634 (reply in thread).

Currently (2025-02-03), zppy's v2 testing data is 18T and its v3 testing data is 24T. That's a total of 42T. This poses a problem with quotas on Compy. /qfs home directories are limited to 400 GB, and /compyfs home directories are limited to 30T. Obviously, 30T < 42T. What this means currently is I can't actually transfer v3 data to Compy for testing Unified 1.11.0.

@forsyth2 forsyth2 self-assigned this Feb 4, 2025
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 4, 2025

The best path forward is to determine which subdirectories/files are actually used and store only those. The difficulty is I'm not sure how to go about determining that. I suppose a good starting point would be removing all files with year numbers greater than anything used in the test cfg files.

@xylar
Copy link
Contributor

xylar commented Feb 4, 2025

I suppose a good starting point would be removing all files with year numbers greater than anything used in the test cfg files.

Yes, for sure. I can help with the ocean and sea-ice output. But my suggestion would be to start with a mostly empty new location on Chrysalis somewhere and to add output files (i.e. all files within the year range you're testing with a given prefix) only when analysis breaks without it. This way, you end up with a minimum set.

@xylar
Copy link
Contributor

xylar commented Feb 4, 2025

If you had to choose between v2 and v3 data, it seems like you should pick v3 data.

@xylar
Copy link
Contributor

xylar commented Feb 4, 2025

Below is what I am pretty confident you need for MPAS-Ocean and -Seaice for the v3 data. The v2 data should be similar.

I would only copy these, preserving the directory structure, of course:

  • the namelist and streams files from run
    • run/mpaso_in
    • run/mpassi_in
    • run/streams.ocean
    • run/streams.seaice
  • a restart file for each:
    • run/v2.LR.historical_0201.mpaso.rst.2015-01-01_00000.nc
    • run/v2.LR.historical_0201.mpassi.rst.2015-01-01_00000.nc
  • history files for only the required range of years (e.g. for year in $(seq 2000 2004); do ...):
    • archive/ocn/hist/v2.LR.historical_0201.mpaso.hist.am.timeSeriesStatsMonthly.${year}-*.nc
    • archive/ice/hist/v2.LR.historical_0201.mpassi.hist.am.timeSeriesStatsMonthly.${year}-*.nc
    • archive/ocn/hist/v2.LR.historical_0201.mpaso.hist.am.timeSeriesStatsMonthlyMin.${year}-*.nc
    • archive/ocn/hist/v2.LR.historical_0201.mpaso.hist.am.timeSeriesStatsMonthlyMax.${year}-*.nc
    • archive/ocn/hist/v2.LR.historical_0201.mpaso.hist.am.meridionalHeatTransport.${year}-*.nc
    • archive/ocn/hist/v2.LR.historical_0201.mpaso.hist.am.oceanHeatContent.${year}-*.nc

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 5, 2025

Thanks so much @xylar. I think I actually have a decent minimal test data set now! I'll transfer that to Compy for testing.

It's 859G, which is a very welcome reduction from 24T!

Script to generate v3 test data:

# 2025-02-04

version="v3" # Options: v3, v2

if [ "${version}" == "v3" ]; then
    case_name="v3.LR.historical_0051"
    # This is the path to the complete simulation output.
    # This has a lot of data. We don't want to copy over everything.
    # So, this script will copy over only the necessary files.
    complete_simulation_output="/lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051"
    restart_year="0051"
    start_year=1985
    end_year_short=1988
    end_year_long=1994
    end_year_closed_interval=1995
fi

case_prefix="/lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SM${version}/${case_name}"
rm -rf ${case_prefix} # Start fresh
echo "Creating reduced data set: ${case_prefix}"
mkdir -p ${case_prefix}/archive/atm/hist
mkdir -p ${case_prefix}/archive/ice/hist
mkdir -p ${case_prefix}/archive/lnd/hist
mkdir -p ${case_prefix}/archive/ocn/hist
mkdir -p ${case_prefix}/archive/rof/hist
mkdir -p ${case_prefix}/run

for year in $(seq ${start_year} ${end_year_closed_interval}); do
    cd ${complete_simulation_output}/archive/ice/hist
    # For mpas_analysis
    cp ${case_name}.mpassi.hist.am.timeSeriesStatsMonthly.${year}*.nc ${case_prefix}/archive/ice/hist/

    cd ${complete_simulation_output}/archive/ocn/hist
    # For mpas_analysis, global_time_series
    cp ${case_name}.mpaso.hist.am.timeSeriesStatsMonthly.${year}*.nc ${case_prefix}/archive/ocn/hist/
    # For mpas_analysis only
    cp ${case_name}.mpaso.hist.am.timeSeriesStatsMonthlyMin.${year}*.nc ${case_prefix}/archive/ocn/hist/
    cp ${case_name}.mpaso.hist.am.timeSeriesStatsMonthlyMax.${year}*.nc ${case_prefix}/archive/ocn/hist/
    cp ${case_name}.mpaso.hist.am.meridionalHeatTransport.${year}*.nc ${case_prefix}/archive/ocn/hist/
    cp ${case_name}.mpaso.hist.am.oceanHeatContent.${year}*.nc ${case_prefix}/archive/ocn/hist/
done

for year in $(seq ${start_year} ${end_year_long}); do
    cd ${complete_simulation_output}/archive/atm/hist
    # For climo_atm_monthly, ts_atm_monthly (end_year_short)
    # For ts_atm_monthly_glb (end_year_long)
    cp ${case_name}.eam.h0.${year}-*.nc ${case_prefix}/archive/atm/hist/

    cd ${complete_simulation_output}/archive/lnd/hist
    # For climo_land_monthly, ts_land_monthly (end_year_short)
    # For ts_lnd_monthly_glb (end_year_long)
    cp ${case_name}.elm.h0.${year}-*.nc ${case_prefix}/archive/lnd/hist/
done

for year in $(seq ${start_year} ${end_year_short}); do
    cd ${complete_simulation_output}/archive/atm/hist
    # For ts_atm_daily
    cp ${case_name}.eam.h1.${year}-*.nc ${case_prefix}/archive/atm/hist/
    # For climo_atm_monthly_diurnal
    cp ${case_name}.eam.h3.${year}-*.nc ${case_prefix}/archive/atm/hist/

    cd ${complete_simulation_output}/archive/rof/hist
    # For ts_rof_monthly
    cp ${case_name}.mosart.h0.${year}-*.nc ${case_prefix}/archive/rof/hist/
done

cd ${complete_simulation_output}/run
cp ${case_name}.mpaso.rst.${restart_year}-01-01_00000.nc ${case_prefix}/run/
cp ${case_name}.mpassi.rst.${restart_year}-01-01_00000.nc ${case_prefix}/run/
cp mpaso_in ${case_prefix}/run/
cp mpassi_in ${case_prefix}/run/
cp streams.ocean ${case_prefix}/run/
cp streams.seaice ${case_prefix}/run/

echo "Complete simulation output: ${complete_simulation_output}"
echo "Reduced data set: ${case_prefix}"
echo "Size:"
du -sh ${case_prefix}
# 859G	/lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SMv3/v3.LR.historical_0051

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 5, 2025

I'm running into a data transfer issue.

I used Globus to transfer data:

Source Collection: LCRC Improv DTN
Source Path: /lcrc/group/e3sm/ac.forsyth2/zppy_test_data/

Destination Collection: pic#compy-dtn
Destination Path: /compyfs/fors729/zppy_test_data

But I get differing data set sizes:

# Chrysalis:
du -sh /lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SMv3
# 860G	/lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SMv3
# Compy:
du -sh /compyfs/fors729/zppy_test_data/E3SMv3/
# 572G	/compyfs/fors729/zppy_test_data/E3SMv3/

Globus says the transfer succeeded though. But then why is the Compy version ~300G smaller?

@xylar
Copy link
Contributor

xylar commented Feb 5, 2025

That seems worth double checking At least make sure all the files are there. Are the sizes of individual files at least the same if you pick a few at random?

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 5, 2025

In /lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SMv3/v3.LR.historical_0051/run:

> du -sh *
48K	mpaso_in
32K	mpassi_in
48K	streams.ocean
16K	streams.seaice
3.7G	v3.LR.historical_0051.mpaso.rst.0051-01-01_00000.nc
1.4G	v3.LR.historical_0051.mpassi.rst.0051-01-01_00000.nc

In /compyfs/fors729/zppy_test_data/E3SMv3/v3.LR.historical_0051/run:

> du -sh *
40K	mpaso_in
40K	mpassi_in
40K	streams.ocean
30K	streams.seaice
2.9G	v3.LR.historical_0051.mpaso.rst.0051-01-01_00000.nc
439M	v3.LR.historical_0051.mpassi.rst.0051-01-01_00000.nc

This is pretty strange. Some files get bigger and some get smaller!

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 5, 2025

In /lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SMv3/v3.LR.historical_0051:

> ls -R | wc -l
1221

In /compyfs/fors729/zppy_test_data/E3SMv3/v3.LR.historical_0051:

> ls -R | wc -l
1221

Well they appear to have the same number of files -- the problem seems to be the files change size haphazardly.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 5, 2025

Files in archive/rof/hist seem universally smaller. Down from ~30M to ~6.5M

@xylar
Copy link
Contributor

xylar commented Feb 5, 2025

Can you vim the text files and see if the bottom is the same? Can you ncdump the NetCDF files and at least verify that they are dumpable? It may just be that Compy's file system compresses things or something. That might explain why it's the slowest file system I have ever had the misfortune to work with.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 5, 2025

It may just be that Compy's file system compresses things or something.

Oh interesting, I was wondering that myself.

Can you vim the text files and see if the bottom is the same? Can you ncdump the NetCDF files and at least verify that they are dumpable?

A preliminary look does seem to suggest this is true. So maybe Compy is just compressing things.

slowest file system

I agree; commands take much longer to run on Compy for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants