Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better storage and export for usage events #15

Open
mwoodiupui opened this issue Jul 11, 2024 · 0 comments
Open

Better storage and export for usage events #15

mwoodiupui opened this issue Jul 11, 2024 · 0 comments

Comments

@mwoodiupui
Copy link
Owner

Is your feature request related to a problem? Please describe.
Solr is not a good primary store for irreproducible data such as usage events. It's designed to cache data that can be reloaded from elsewhere. It's hard to use with external statistical tools. Usage data are a sequence of structured records with multivalued fields containing data that cannot be recreated.

The core expands forever (unless you delete old cases), quickly becoming the largest of DSpace's cores, and the older records are perhaps not worth keeping online.

Describe the solution you'd like
A simple log of usage events. We could use Log4J2 (the same logger used for other logging in DSpace) to manage things like file rollover. An event can be represented in JSON as a single complex object on one line. External statistical tools should be able to ingest such files either directly or with minimal reformatting. Tools such as jq exist to select or transform JSON records, and files can be readily combined using ordinary file tools. Older records can be compressed, and perhaps archived offline.

Describe alternatives or workarounds you've considered
XML is unsuitable because an XML file must be a single document with a top-level end element. It is not a good fit to a conceptually unending stream of records. We would require rules for reading a file of events as millions of tiny separate "document"s.

YAML would work, but YAML does too much and probably requires a third-party parser such as the endlessly-buggy Jackson. JSON is about the right level of complexity and we can use JSON-P (JSR 353).

A relational database is a poor fit due to multivalued fields. We'd need either a forest of tables and foreign keys or tricky encoding rules (and have to build parsers for the rule set).

A graph database would work well, but we don't need a DBMS for this. It's just a time series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant