Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input files having >1 extra empty rows at the end breaks CSVReader #1023

Open
astjoephysics opened this issue Sep 26, 2024 · 6 comments
Open
Labels
version 1.0 tasks that need to be completed before the first software release

Comments

@astjoephysics
Copy link
Collaborator

If an input file has an 2 or more blank lines after data rows CSVReader breaks and can't parse the extra skiprows function. Possibly need to make users aware to format input files correctly?

@astronomerritt
Copy link
Contributor

Saving the traceback here:

Traceback (most recent call last):
  File "/opt/miniconda3/envs/sorcha/bin/sorcha-run", line 8, in <module>
    sys.exit(main())
  File "/Users/stephaniemerritt/Projects/sorcha/src/sorcha_cmdline/run.py", line 121, in main
    return execute(args)
  File "/Users/stephaniemerritt/Projects/sorcha/src/sorcha_cmdline/run.py", line 183, in execute
    runLSSTSimulation(args, configs)
  File "/Users/stephaniemerritt/Projects/sorcha/src/sorcha/sorcha.py", line 190, in runLSSTSimulation
    orbits_df = reader.read_aux_block(block_size=configs["size_serial_chunk"])
  File "/Users/stephaniemerritt/Projects/sorcha/src/sorcha/readers/CombinedDataReader.py", line 241, in read_aux_block
    current_df = reader.read_objects(obj_ids)
  File "/Users/stephaniemerritt/Projects/sorcha/src/sorcha/readers/ObjectDataReader.py", line 125, in read_objects
    res_df = self._read_objects_internal(obj_ids, **kwargs)
  File "/Users/stephaniemerritt/Projects/sorcha/src/sorcha/readers/CSVReader.py", line 229, in _read_objects_internal
    res_df = pd.read_csv(
  File "/opt/miniconda3/envs/sorcha/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/miniconda3/envs/sorcha/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
  File "/opt/miniconda3/envs/sorcha/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/opt/miniconda3/envs/sorcha/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2050, in pandas._libs.parsers.raise_parser_error
IndexError: list index out of range
Error: Command 'sorcha-run' failed with exit code 1.

@astronomerritt
Copy link
Contributor

This is a bit of a pain, because the code is breaking BEFORE it does the validation checks on the input table. See lines 125-126 of ObjectDataReader.py:

res_df = self._read_objects_internal(obj_ids, **kwargs)  # code breaks here
res_df = self._process_and_validate_input_table(res_df, **kwargs)

My only suggestion for fixing this is to wrap one of the failing lines in a try/except IndexError statement and throw an error message that suggests the user might want to check for empty lines at the end of their input files.

I'd suggest doing this to the line I highlighted above, as it's in the parent class so all sub-classes would inherit the behaviour.

@mschwamb
Copy link
Collaborator

mschwamb commented Oct 20, 2024

Can you check whether this is fixed by PR #1042 ?

@mschwamb
Copy link
Collaborator

@astronomerritt ☝️

@astronomerritt
Copy link
Contributor

It is not fixed by that PR, no, sorry!

It's specifically the _read_objects_internal() methods on the reader classes causing the problem - even more specifically, it's something about either the skiprows argument on the Pandas read_csv() call or the way it's being utilised.

I can stop the code from breaking entirely by adding skip_blank_lines=True to the Pandas read_csv() calls, but:

  • this means the user is getting away with bad behaviour, as the code would no longer fail if there were ANY random blank lines in their input files
  • I haven't fully checked for unexpected behaviour

The other option is to do what I suggested above, I guess, unless anyone has a better idea.

@mschwamb mschwamb added the version 1.0 tasks that need to be completed before the first software release label Nov 19, 2024
@jeremykubica
Copy link
Contributor

@astronomerritt can you share the data file on which this is failing? I created a fake data file with blank lines and it is not giving me a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version 1.0 tasks that need to be completed before the first software release
Projects
None yet
Development

No branches or pull requests

4 participants