Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify the Reader classes to support getting files from somewhere other than a local disk #135

Open
alexwlchan opened this issue Jul 30, 2019 · 2 comments

Comments

@alexwlchan
Copy link

The classes in reader take a path to a file on disk, read that file and then parse the contents. For example:

public final class KeyValueReader {
  /**
   * Generic method to read key value pairs from the bagit files, like bagit.txt or bag-info.txt
   * 
   * @param file the file to read
   * @param splitRegex how to split the key from the value
   * @param charset the encoding of the file
   * 
   * @return a list of key value pairs
   */
  public static List<SimpleImmutableEntry<String, String>> readKeyValuesFromFile(final Path file, final String splitRegex, final Charset charset) throws IOException, InvalidBagMetadataException{
    final List<SimpleImmutableEntry<String, String>> keyValues = new ArrayList<>();
    
    try(final BufferedReader reader = Files.newBufferedReader(file, charset)){
       ...
    }

    return keyValues;
  }
}

For the Wellcome storage service (https://github.com/wellcometrust/storage-service), we aren’t keeping bags on the local disk, but in S3. If we want to read a file, we make a GetObject call to the S3 SDK, which returns an InputStream.

We could download the bag files to disk, and read them from there, but that seems a bit icky – would you be open to some pull requests that add allow parsing files even if they aren’t local files? Something like:

public final class KeyValueReader {
  public static List<SimpleImmutableEntry<String, String>> readKeyValuesFromReader(
    final BufferedReader reader,
    final String splitRegex) throws IOException, InvalidBagMetadataException{
    final List<SimpleImmutableEntry<String, String>> keyValues = new ArrayList<>();

    ...    

    return keyValues;
  }

  public static List<SimpleImmutableEntry<String, String>> readKeyValuesFromFile(
    final Path file,
    final String splitRegex,
    final Charset charset) throws IOException, InvalidBagMetadataException{
    try(final BufferedReader reader = Files.newBufferedReader(file, charset)){
       return readKeyValuesFromReader(reader, splitRegex)
    }
  }
}

So the existing API is preserved, and calls into the new method that takes any BufferedReader – and now we can call that rather than round-tripping to the filesystem first.

Thoughts?

@jscancella
Copy link
Contributor

I checked out your links, but I'm afraid I still don't understand your use case. Why do you need to read the bag-info.txt from S3?
It is my opinion that bagit is best suited to the transfer of large batches (which is why it was originally created), and ensure they are complete (all the files are there) and correct (none of the files have changed). I don't recommend it as a long term storage system because it does not deal with the common case of bit rot (random bits flipping and creating errors), nor does it handle storing and handling multiple copys (Lots Of Copies Keeps Stuff Safe - LOCKSS).

@kenoir
Copy link

kenoir commented Jul 31, 2019

@jscancella

It is my opinion that bagit is best suited to the transfer of large batches

That is exactly what our cloud storage service is doing - especially during replication for multiple copies "Lots Of Copies Keeps Stuff Safe - LOCKSS".

To store the things, you have to move them into storage - and to move the things between replication locations between object stores you have to move them in large batches.

It might be useful for us to communicate about this not via GitHub issues - can we invite you to our Slack perhaps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants