Reduce testing data #671

forsyth2 · 2025-02-04T00:14:09Z

Request criteria

I searched the zppy GitHub Discussions to find a similar question and didn't find it.
I searched the zppy documentation.
This issue does not match the other templates (i.e., it is not a bug report, documentation request, feature request, or a question.)

Issue description

For initial explanation, see #634 (reply in thread).

Currently (2025-02-03), zppy's v2 testing data is 18T and its v3 testing data is 24T. That's a total of 42T. This poses a problem with quotas on Compy. /qfs home directories are limited to 400 GB, and /compyfs home directories are limited to 30T. Obviously, 30T < 42T. What this means currently is I can't actually transfer v3 data to Compy for testing Unified 1.11.0.

The text was updated successfully, but these errors were encountered:

forsyth2 · 2025-02-04T00:15:46Z

The best path forward is to determine which subdirectories/files are actually used and store only those. The difficulty is I'm not sure how to go about determining that. I suppose a good starting point would be removing all files with year numbers greater than anything used in the test cfg files.

xylar · 2025-02-04T07:28:24Z

I suppose a good starting point would be removing all files with year numbers greater than anything used in the test cfg files.

Yes, for sure. I can help with the ocean and sea-ice output. But my suggestion would be to start with a mostly empty new location on Chrysalis somewhere and to add output files (i.e. all files within the year range you're testing with a given prefix) only when analysis breaks without it. This way, you end up with a minimum set.

xylar · 2025-02-04T07:37:49Z

If you had to choose between v2 and v3 data, it seems like you should pick v3 data.

xylar · 2025-02-04T07:51:16Z

Below is what I am pretty confident you need for MPAS-Ocean and -Seaice for the v3 data. The v2 data should be similar.

I would only copy these, preserving the directory structure, of course:

the namelist and streams files from run
- run/mpaso_in
- run/mpassi_in
- run/streams.ocean
- run/streams.seaice
a restart file for each:
- run/v2.LR.historical_0201.mpaso.rst.2015-01-01_00000.nc
- run/v2.LR.historical_0201.mpassi.rst.2015-01-01_00000.nc
history files for only the required range of years (e.g. for year in $(seq 2000 2004); do ...):
- archive/ocn/hist/v2.LR.historical_0201.mpaso.hist.am.timeSeriesStatsMonthly.${year}-*.nc
- archive/ice/hist/v2.LR.historical_0201.mpassi.hist.am.timeSeriesStatsMonthly.${year}-*.nc
- archive/ocn/hist/v2.LR.historical_0201.mpaso.hist.am.timeSeriesStatsMonthlyMin.${year}-*.nc
- archive/ocn/hist/v2.LR.historical_0201.mpaso.hist.am.timeSeriesStatsMonthlyMax.${year}-*.nc
- archive/ocn/hist/v2.LR.historical_0201.mpaso.hist.am.meridionalHeatTransport.${year}-*.nc
- archive/ocn/hist/v2.LR.historical_0201.mpaso.hist.am.oceanHeatContent.${year}-*.nc

forsyth2 · 2025-02-05T01:21:53Z

Thanks so much @xylar. I think I actually have a decent minimal test data set now! I'll transfer that to Compy for testing.

It's 859G, which is a very welcome reduction from 24T!

Script to generate v3 test data:

# 2025-02-04

version="v3" # Options: v3, v2

if [ "${version}" == "v3" ]; then
    case_name="v3.LR.historical_0051"
    # This is the path to the complete simulation output.
    # This has a lot of data. We don't want to copy over everything.
    # So, this script will copy over only the necessary files.
    complete_simulation_output="/lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051"
    restart_year="0051"
    start_year=1985
    end_year_short=1988
    end_year_long=1994
    end_year_closed_interval=1995
fi

case_prefix="/lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SM${version}/${case_name}"
rm -rf ${case_prefix} # Start fresh
echo "Creating reduced data set: ${case_prefix}"
mkdir -p ${case_prefix}/archive/atm/hist
mkdir -p ${case_prefix}/archive/ice/hist
mkdir -p ${case_prefix}/archive/lnd/hist
mkdir -p ${case_prefix}/archive/ocn/hist
mkdir -p ${case_prefix}/archive/rof/hist
mkdir -p ${case_prefix}/run

for year in $(seq ${start_year} ${end_year_closed_interval}); do
    cd ${complete_simulation_output}/archive/ice/hist
    # For mpas_analysis
    cp ${case_name}.mpassi.hist.am.timeSeriesStatsMonthly.${year}*.nc ${case_prefix}/archive/ice/hist/

    cd ${complete_simulation_output}/archive/ocn/hist
    # For mpas_analysis, global_time_series
    cp ${case_name}.mpaso.hist.am.timeSeriesStatsMonthly.${year}*.nc ${case_prefix}/archive/ocn/hist/
    # For mpas_analysis only
    cp ${case_name}.mpaso.hist.am.timeSeriesStatsMonthlyMin.${year}*.nc ${case_prefix}/archive/ocn/hist/
    cp ${case_name}.mpaso.hist.am.timeSeriesStatsMonthlyMax.${year}*.nc ${case_prefix}/archive/ocn/hist/
    cp ${case_name}.mpaso.hist.am.meridionalHeatTransport.${year}*.nc ${case_prefix}/archive/ocn/hist/
    cp ${case_name}.mpaso.hist.am.oceanHeatContent.${year}*.nc ${case_prefix}/archive/ocn/hist/
done

for year in $(seq ${start_year} ${end_year_long}); do
    cd ${complete_simulation_output}/archive/atm/hist
    # For climo_atm_monthly, ts_atm_monthly (end_year_short)
    # For ts_atm_monthly_glb (end_year_long)
    cp ${case_name}.eam.h0.${year}-*.nc ${case_prefix}/archive/atm/hist/

    cd ${complete_simulation_output}/archive/lnd/hist
    # For climo_land_monthly, ts_land_monthly (end_year_short)
    # For ts_lnd_monthly_glb (end_year_long)
    cp ${case_name}.elm.h0.${year}-*.nc ${case_prefix}/archive/lnd/hist/
done

for year in $(seq ${start_year} ${end_year_short}); do
    cd ${complete_simulation_output}/archive/atm/hist
    # For ts_atm_daily
    cp ${case_name}.eam.h1.${year}-*.nc ${case_prefix}/archive/atm/hist/
    # For climo_atm_monthly_diurnal
    cp ${case_name}.eam.h3.${year}-*.nc ${case_prefix}/archive/atm/hist/

    cd ${complete_simulation_output}/archive/rof/hist
    # For ts_rof_monthly
    cp ${case_name}.mosart.h0.${year}-*.nc ${case_prefix}/archive/rof/hist/
done

cd ${complete_simulation_output}/run
cp ${case_name}.mpaso.rst.${restart_year}-01-01_00000.nc ${case_prefix}/run/
cp ${case_name}.mpassi.rst.${restart_year}-01-01_00000.nc ${case_prefix}/run/
cp mpaso_in ${case_prefix}/run/
cp mpassi_in ${case_prefix}/run/
cp streams.ocean ${case_prefix}/run/
cp streams.seaice ${case_prefix}/run/

echo "Complete simulation output: ${complete_simulation_output}"
echo "Reduced data set: ${case_prefix}"
echo "Size:"
du -sh ${case_prefix}
# 859G	/lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SMv3/v3.LR.historical_0051

forsyth2 · 2025-02-05T17:48:30Z

I'm running into a data transfer issue.

I used Globus to transfer data:

Source Collection: LCRC Improv DTN
Source Path: /lcrc/group/e3sm/ac.forsyth2/zppy_test_data/

Destination Collection: pic#compy-dtn
Destination Path: /compyfs/fors729/zppy_test_data

But I get differing data set sizes:

# Chrysalis:
du -sh /lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SMv3
# 860G	/lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SMv3
# Compy:
du -sh /compyfs/fors729/zppy_test_data/E3SMv3/
# 572G	/compyfs/fors729/zppy_test_data/E3SMv3/

Globus says the transfer succeeded though. But then why is the Compy version ~300G smaller?

xylar · 2025-02-05T18:34:50Z

That seems worth double checking At least make sure all the files are there. Are the sizes of individual files at least the same if you pick a few at random?

forsyth2 · 2025-02-05T18:43:52Z

In /lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SMv3/v3.LR.historical_0051/run:

> du -sh *
48K	mpaso_in
32K	mpassi_in
48K	streams.ocean
16K	streams.seaice
3.7G	v3.LR.historical_0051.mpaso.rst.0051-01-01_00000.nc
1.4G	v3.LR.historical_0051.mpassi.rst.0051-01-01_00000.nc

In /compyfs/fors729/zppy_test_data/E3SMv3/v3.LR.historical_0051/run:

> du -sh *
40K	mpaso_in
40K	mpassi_in
40K	streams.ocean
30K	streams.seaice
2.9G	v3.LR.historical_0051.mpaso.rst.0051-01-01_00000.nc
439M	v3.LR.historical_0051.mpassi.rst.0051-01-01_00000.nc

This is pretty strange. Some files get bigger and some get smaller!

forsyth2 · 2025-02-05T18:47:59Z

In /lcrc/group/e3sm/ac.forsyth2/zppy_test_data/E3SMv3/v3.LR.historical_0051:

> ls -R | wc -l
1221

In /compyfs/fors729/zppy_test_data/E3SMv3/v3.LR.historical_0051:

> ls -R | wc -l
1221

Well they appear to have the same number of files -- the problem seems to be the files change size haphazardly.

forsyth2 · 2025-02-05T18:54:39Z

Files in archive/rof/hist seem universally smaller. Down from ~30M to ~6.5M

xylar · 2025-02-05T18:56:23Z

Can you vim the text files and see if the bottom is the same? Can you ncdump the NetCDF files and at least verify that they are dumpable? It may just be that Compy's file system compresses things or something. That might explain why it's the slowest file system I have ever had the misfortune to work with.

forsyth2 · 2025-02-05T19:05:32Z

It may just be that Compy's file system compresses things or something.

Oh interesting, I was wondering that myself.

Can you vim the text files and see if the bottom is the same? Can you ncdump the NetCDF files and at least verify that they are dumpable?

A preliminary look does seem to suggest this is true. So maybe Compy is just compressing things.

slowest file system

I agree; commands take much longer to run on Compy for me.

forsyth2 self-assigned this Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce testing data #671

Reduce testing data #671

forsyth2 commented Feb 4, 2025 •

edited

Loading

forsyth2 commented Feb 4, 2025 •

edited

Loading

xylar commented Feb 4, 2025 •

edited

Loading

xylar commented Feb 4, 2025

xylar commented Feb 4, 2025

forsyth2 commented Feb 5, 2025

forsyth2 commented Feb 5, 2025 •

edited

Loading

xylar commented Feb 5, 2025

forsyth2 commented Feb 5, 2025

forsyth2 commented Feb 5, 2025

forsyth2 commented Feb 5, 2025

xylar commented Feb 5, 2025

forsyth2 commented Feb 5, 2025

Reduce testing data #671

Reduce testing data #671

Comments

forsyth2 commented Feb 4, 2025 • edited Loading

Request criteria

Issue description

forsyth2 commented Feb 4, 2025 • edited Loading

xylar commented Feb 4, 2025 • edited Loading

xylar commented Feb 4, 2025

xylar commented Feb 4, 2025

forsyth2 commented Feb 5, 2025

forsyth2 commented Feb 5, 2025 • edited Loading

xylar commented Feb 5, 2025

forsyth2 commented Feb 5, 2025

forsyth2 commented Feb 5, 2025

forsyth2 commented Feb 5, 2025

xylar commented Feb 5, 2025

forsyth2 commented Feb 5, 2025

forsyth2 commented Feb 4, 2025 •

edited

Loading

forsyth2 commented Feb 4, 2025 •

edited

Loading

xylar commented Feb 4, 2025 •

edited

Loading

forsyth2 commented Feb 5, 2025 •

edited

Loading