Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting MFP CSV to YAML schedule #111

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

iuryt
Copy link
Contributor

@iuryt iuryt commented Jan 28, 2025

@ammedd @VeckoTheGecko

The file does not contain the time expected for each of the stations, how should we guess this?

Also, how should we organize this code? Should we embed this to the virtualship init command?
e.g. virtualship init --mfp_file ./CoordinatesExport-Filled.xlsx

So far, this is just a script that creates the yaml file. We can delete them after deciding where to implement it.

I am also using the CSV version to avoid installing openpyxl, but we can also install it, no problem.

@iuryt iuryt changed the title draft script for converting MFP CSV to YAML schedule Converting MFP CSV to YAML schedule Jan 28, 2025
@ammedd
Copy link
Collaborator

ammedd commented Jan 29, 2025

I'll ask the students to add a column with station time. I hope it will become available in later downloads from the MFP website directly as well. It is included online.

virtualship init --mfp_file ./CoordinatesExport-Filled.xlsx sounds great! maybe changing file for csv or DD
When you download from MFP, you choose Export Coordinates - DD (decimal degrees). There are other options but I would not cater for those at the moment.

@iuryt
Copy link
Contributor Author

iuryt commented Jan 29, 2025

@ammedd
Can you give me a version of this file with the times using the template you plan to use?

@VeckoTheGecko
Copy link
Collaborator

I was discussing with @ammedd , and I think the best thing to do would be to use the CSV to populate as many fields as it can in the schedule, and leave the rest up to the user to populate in the YAML. This saves getting the user to add columns to the CSV (which I think would be more error prone).

I'll be able to give this a proper review on Friday - just have some other items to clear before then :)

Comment on lines 18 to 27
- instrument: CTD
location:
latitude: 55.124089
longitude: 5.156524
time: "2023-01-01 01:00:00"
- instrument: DRIFTER
location:
latitude: 55.124089
longitude: 5.156524
time: "2023-01-01 01:00:00"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was discussing with @ammedd , and I think the best thing to do would be to use the CSV to populate as many fields as it can in the schedule, and leave the rest up to the user to populate in the YAML. This saves getting the user to add columns to the CSV (which I think would be more error prone).

I'll be able to give this a proper review on Friday - just have some other items to clear before then :)

But that means that we need to either leave blank for the dates or guess it, right? There are also stations that will have more than one instrument, but it will be hard to guess they are the same if we leave the dates blank. Of course, they have the same location, but if longitudes and latitudes are too close, it might be harder than just adding a column to the CSV. Although I understand that adding a column manually is more prone to error.

@VeckoTheGecko
Copy link
Collaborator

VeckoTheGecko commented Jan 31, 2025

Okay, I've managed to look through the code. Thanks for picking this up @iuryt ! I definitely see how this will streamline things for users - especially for the waypoints and the bounding box.

But that means that we need to either leave blank for the dates or guess it, right? There are also stations that will have more than one instrument, but it will be hard to guess they are the same if we leave the dates blank. Of course, they have the same location, but if longitudes and latitudes are too close, it might be harder than just adding a column to the CSV. Although I understand that adding a column manually is more prone to error.

Correct, we'll have to leave the dates blank - its just information that is not provided from the MFP export. Though the order of points in the YAML should match the CSV. If the MFP export changes in future, then we can adapt. So yes, it would also be difficult to guess whether stations are the same. Honestly, I think that we should avoid making that guess. Making it clear that virtualship init --from-mfp will only populate some of the fields, and the user will be required to take it from there (e.g., merging waypoints to the same location that were initially spaced a bit further apart in MFP, filling in the dates from the UI of MFP - since they weren't included in the export).

Its not optimal, but given MFP->virtualship isn't necessarily 1to1, and given the limited export, I think that its all we can do. I also think this is the clearer approach from a maintenance POV - being explicit with what is supported, and by not making additional assumptions which may not be right - and hands the control to the user to modify the schedule file.

Also, how should we organize this code? Should we embed this to the virtualship init command?
e.g. virtualship init --mfp_file ./CoordinatesExport-Filled.xlsx

Sounds like a good plan! My vote is for --from-mfp since that is more in line with CLI conventions. We can then add in the help message something like this would be good:

--from-mfp    Partially initialise a project from an exported xlsx or csv file from NIOZ' Marine Facilities Planning tool (specifically the "Export Coordinates > DD" option). User edits are required after initialisation.

I am also using the CSV version to avoid installing openpyxl, but we can also install it, no problem.

Let's install openpyxl


Other notes:

  • Could we add an error message to users if the CSV headers are not as expected?
    • Incorrect headers -> Error: "Found columns [...], but expected columns [...]. Are you sure that you're using the correct export from MFP?"
    • Additional headers -> Warning: "Found additional unexpected columns [...]. Manually added columns have no effect. If the MFP export format changed, please submit an issue https://github.com/OceanParcels/virtualship/issues."
  • It would be nice to have a message at the end that its up to the user to populate the rest of the YAML
  • Would you be comfortable writing some unit tests for this functionality as well?

Let me know what your thoughts are on all this ^, and if there's anything I can do to help :)

@iuryt
Copy link
Contributor Author

iuryt commented Feb 3, 2025

@ammedd and @VeckoTheGecko

But I tested it, and the current version is working.
I also added some maximum_depth and buffer for longitude and latitude that depend on the instruments.

    # Define maximum depth and buffer for each instrument
    instrument_properties = {
        "CTD": {"depth": 5000, "buffer": 1},
        "DRIFTER": {"depth": 1, "buffer": 5},
        "ARGO_FLOAT": {"depth": 2000, "buffer": 5},
    }

Any comments on this, @erikvansebille ?

Let me know what you think.

I have not added yet

  • Error message to users if the CSV headers are not as expected
  • Additional headers -> Warning
  • Message at the end that its up to the user to populate the rest of the YAML
  • Unit tests for this functionality

@erikvansebille
Copy link
Member

Looks goo, but perhaps explain what buffer means in the dictionary above?

Copy link
Collaborator

@VeckoTheGecko VeckoTheGecko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress! Left some review comments. I think there is some duplication of information between the comments and the code. I recommend in general: Don't Write Comments - rewrite the code (really good YouTube channel in general for code style - highly recommend). The only time I write comments is to explain why something was done. E.g., # Keeping this legacy format for backward compatibility with older datasets

There are many ways to make things self documenting. E.g.,

mfp_to_yaml(
            mfp_file, str(path)
        )  # Pass the path to save in the correct directory

could become

mfp_to_yaml(
            mfp_file, save_directory = str(path)
        )

.

It's possible to have clean, readable code with 0 duplication, but also 0 comments ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be possible to use the Pydantic objects directly here? (Waypoint, SpaceTimeRegion, then building up to Schedule) I think it would be less error prone than manually creating the dictionary/YAML.

Comment on lines +75 to +78
# Extract unique instruments from dataset
unique_instruments = np.unique(
np.hstack(coordinates_data["Instrument"].apply(lambda a: a.split(", ")).values)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would just be clearer as a for loop.

unique_instruments = set()
for row in ....:
    instruments = ...split(", ")
    unique_instruments |= set(instruments)  # or just `unique_instruments = unique_instruments | set(instruments)`

That way there's no need to write the comment since its evident from the code

Comment on lines +57 to +62
if mfp_file:
# Generate schedule.yaml from the MPF file
click.echo(f"Generating schedule from {mfp_file}...")
mfp_to_yaml(
mfp_file, str(path)
) # Pass the path to save in the correct directory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mfp_to_yaml returned a string representing the yaml instead of doing the file modifications, then we can use schedule.write_text(...) like in the else clause. So something like

if mfp_file:
    # don't repeat the comment here :) the log message is comment enough
    click.echo(f"Generating schedule from {mfp_file}...")
    schedule_body = mfp_to_yaml(
        mfp_file,
    )
else:
    schedule_body = utils.get_example_schedule()
schedule.write_text(...)

@@ -12,6 +12,7 @@ dependencies:
- pip
- pyyaml
- copernicusmarine >= 2
- openpyxl >= 3.1.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- openpyxl >= 3.1.5
- openpyxl

I don't think we need to specify the lower bound. Its best to be flexible unless there is an incompatibility with x version in our codebase

Comment on lines +68 to +73
# Define maximum depth and buffer for each instrument
instrument_properties = {
"CTD": {"depth": 5000, "buffer": 1},
"DRIFTER": {"depth": 1, "buffer": 5},
"ARGO_FLOAT": {"depth": 2000, "buffer": 5},
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Define maximum depth and buffer for each instrument
instrument_properties = {
"CTD": {"depth": 5000, "buffer": 1},
"DRIFTER": {"depth": 1, "buffer": 5},
"ARGO_FLOAT": {"depth": 2000, "buffer": 5},
}
instrument_properties = {
"CTD": {"max_depth": 5000, "buffer": 1},
"DRIFTER": {"max_depth": 1, "buffer": 5},
"ARGO_FLOAT": {"max_depth": 2000, "buffer": 5},
}

mfp_file, str(path)
) # Pass the path to save in the correct directory
else:
# Create a default example schedule
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Create a default example schedule

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

YAML schedule from MFP CSV
4 participants