Skip to content

Commit

Permalink
Merge remote-tracking branch 'IQSS/develop' into DANS-performance
Browse files Browse the repository at this point in the history
  • Loading branch information
qqmyers committed Dec 3, 2024
2 parents d6f03bf + 1b5a1ea commit 2a679b6
Show file tree
Hide file tree
Showing 22 changed files with 1,521 additions and 986 deletions.
6 changes: 6 additions & 0 deletions doc/release-notes/10688_whitespace_trimming.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
### Added whitespace trimming to uploaded custom metadata TSV files

When loading custom metadata blocks using the `api/admin/datasetfield/load` API, whitespace can be introduced into field names.
This change trims whitespace at the beginning and end of all values read into the API before persisting them.

For more information, see #10688.
6 changes: 6 additions & 0 deletions doc/release-notes/10977-globus-filesize-lookup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
## A new Globus optimization setting

An optimization has been added for the Globus upload workflow, with a corresponding new database setting: `:GlobusBatchLookupSize`


See the [Database Settings](https://guides.dataverse.org/en/6.5/installation/config.html#GlobusBatchLookupSize) section of the Guides for more information.
16 changes: 16 additions & 0 deletions doc/release-notes/220-harvard-edu-audit-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
### New API to Audit Datafiles across the database

This is a superuser only API endpoint to audit Datasets with DataFiles where the physical files are missing or the file metadata is missing.
The Datasets scanned can be limited by optional firstId and lastId query parameters, or a given CSV list of Dataset Identifiers.
Once the audit report is generated, a superuser can either delete the missing file(s) from the Dataset or contact the author to re-upload the missing file(s).

The JSON response includes:
- List of files in each DataFile where the file exists in the database but the physical file is not in the file store.
- List of DataFiles where the FileMetadata is missing.
- Other failures found when trying to process the Datasets

curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles"
curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles?firstId=0&lastId=1000"
curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles?datasetIdentifierList=doi:10.5072/FK2/RVNT9Q,doi:10.5072/FK2/RVNT9Q"

For more information, see [the docs](https://dataverse-guide--11016.org.readthedocs.build/en/11016/api/native-api.html#datafile-audit), #11016, and [#220](https://github.com/IQSS/dataverse.harvard.edu/issues/220)
66 changes: 66 additions & 0 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6300,6 +6300,72 @@ Note that if you are attempting to validate a very large number of datasets in y
asadmin set server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds=3600
Datafile Audit
~~~~~~~~~~~~~~
Produce an audit report of missing files and FileMetadata for Datasets.
Scans the Datasets in the database and verifies that the stored files exist. If the files are missing or if the FileMetadata is missing, this information is returned in a JSON response.
The call will return a status code of 200 if the report was generated successfully. Issues found will be documented in the report and will not return a failure status code unless the report could not be generated::
curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles"
Optional Parameters are available for filtering the Datasets scanned.
For auditing the Datasets in a paged manner (firstId and lastId)::
curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles?firstId=0&lastId=1000"
Auditing specific Datasets (comma separated list)::
curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles?datasetIdentifierList=doi:10.5072/FK2/JXYBJS,doi:10.7910/DVN/MPU019"
Sample JSON Audit Response::
{
"status": "OK",
"data": {
"firstId": 0,
"lastId": 100,
"datasetIdentifierList": [
"doi:10.5072/FK2/XXXXXX",
"doi:10.5072/FK2/JXYBJS",
"doi:10.7910/DVN/MPU019"
],
"datasetsChecked": 100,
"datasets": [
{
"id": 6,
"pid": "doi:10.5072/FK2/JXYBJS",
"persistentURL": "https://doi.org/10.5072/FK2/JXYBJS",
"missingFileMetadata": [
{
"storageIdentifier": "local://1930cce4f2d-855ccc51fcbb",
"dataFileId": "7"
}
]
},
{
"id": 47731,
"pid": "doi:10.5072/FK2/MPU019",
"persistentURL": "https://doi.org/10.7910/DVN/MPU019",
"missingFiles": [
{
"storageIdentifier": "s3://dvn-cloud:298910",
"directoryLabel": "trees",
"label": "trees.png"
}
]
}
],
"failures": [
{
"datasetIdentifier": "doi:10.5072/FK2/XXXXXX",
"reason": "Not Found"
}
]
}
}
Workflows
~~~~~~~~~
Expand Down
7 changes: 7 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4849,6 +4849,13 @@ The URL where the `dataverse-globus <https://github.com/scholarsportal/dataverse

The interval in seconds between Dataverse calls to Globus to check on upload progress. Defaults to 50 seconds (or to 10 minutes, when the ``globus-use-experimental-async-framework`` feature flag is enabled). See :ref:`globus-support` for details.

.. _:GlobusBatchLookupSize:

:GlobusBatchLookupSize
++++++++++++++++++++++

In the initial implementation, when files were added to the dataset upon completion of a Globus upload task, Dataverse would make a separate Globus API call to look up the size of every new file. This proved to be a significant bottleneck at Harvard Dataverse with users transferring batches of many thousands of files (this in turn was made possible by the Globus improvements in v6.4). An optimized lookup mechanism was added in response, where the Globus Service makes a listing API call on the entire remote folder, then populates the file sizes for all the new file entries before passing them to the Ingest service. This approach however may in fact slow things down in a scenario where there are already thousands of files in the Globus folder for the dataset, and only a small number of new files are being added. To address this, the number of files in a batch for which this method should be used was made configurable. If not set, it will default to 50 (a completely arbitrary number). Setting it to 0 will always use this method with Globus uploads. Setting it to some very large number will disable it completely. This was made a database setting, as opposed to a JVM option, in order to make it configurable in real time.

:GlobusSingleFileTransfer
+++++++++++++++++++++++++

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,7 @@ public Boolean sendNotificationEmail(UserNotification notification, String comme
if (objectOfNotification != null){
String messageText = getMessageTextBasedOnNotification(notification, objectOfNotification, comment, requestor);
String subjectText = MailUtil.getSubjectTextBasedOnNotification(notification, objectOfNotification);
if (!(messageText.isEmpty() || subjectText.isEmpty())){
if (!(StringUtils.isEmpty(messageText) || StringUtils.isEmpty(subjectText))){
retval = sendSystemEmail(emailAddress, subjectText, messageText, isHtmlContent);
} else {
logger.warning("Skipping " + notification.getType() + " notification, because couldn't get valid message");
Expand Down
Loading

0 comments on commit 2a679b6

Please sign in to comment.