Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/4031 journals added sheet duplicated #2444

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

philipkcl
Copy link
Contributor

@philipkcl philipkcl commented Dec 27, 2024


Journals added sheet is showing all journals

  • improve datalog_journal_added_update background job to make it more robust avoid missing records in datalog
  • schedule of datalog changed to one pre day instead of every 30 minutes
  • the root cause of duplicate records should be some records are lost in datalog_journal_added, even the record have been added to index datalog_journal_added successful

How to deploy

  • run migration script to remove all records in datalog_journal_added
  • consider remove all records in google sheet

This PR...

  • has scripts to run
  • has migrations to run
  • adds new infrastructure
  • changes the CI pipeline
  • affects the public site
  • affects the editorial area
  • affects the publisher area
  • affects the monitoring

Developer Checklist

Developers should review and confirm each of these items before requesting review

  • Code meets acceptance criteria from issue
  • Unit tests are written and all pass
  • User Test Scripts (if required) are written and have been run through
  • Project's coding standards are met
    • No deprecated methods are used
    • No magic strings/numbers - all strings are in constants or messages files
    • ES queries are wrapped in a Query object rather than inlined in the code
    • Where possible our common library functions have been used (e.g. dates manipulated via dates)
    • Cleaned up commented out code, etc
    • Urls are constructed with url_for not hard-coded
  • Code documentation and related non-code documentation has all been updated
  • Migation has been created and tested
  • There is a recent merge from develop

Reviewer Checklist

Reviewers should review and confirm each of these items before approval
If there are multiple reviewers, this section should be duplicated for each reviewer

  • Code meets acceptance criteria from issue
  • Unit tests are written and all pass
  • User Test Scripts (if required) are written and have been run through
  • Project's coding standards are met
    • No deprecated methods are used
    • No magic strings/numbers - all strings are in constants or messages files
    • ES queries are wrapped in a Query object rather than inlined in the code
    • Where possible our common library functions have been used (e.g. dates manipulated via dates)
    • Cleaned up commented out code, etc
    • Urls are constructed with url_for not hard-coded
  • Code documentation and related non-code documentation has all been updated
  • Migation has been created and tested
  • There is a recent merge from develop

Testing

List user test scripts that need to be run

List any non-unit test scripts that need to be run by reviewers

Deployment

What deployment considerations are there? (delete any sections you don't need)

Configuration changes

What configuration changes are included in this PR, and do we need to set specific values for production

Scripts

What scripts need to be run from the PR (e.g. if this is a report generating feature), and when (once, regularly, etc).

Migrations

What migrations need to be run to deploy this

  • run migration script to remove all records in datalog_journal_added
  • consider remove all records in google sheet

Monitoring

What additional monitoring is required of the application as a result of this feature

New Infrastructure

What new infrastructure does this PR require (e.g. new services that need to run on the back-end).

Continuous Integration

What CI changes are required for this

@richard-jones richard-jones requested review from RK206 and removed request for richard-jones January 7, 2025 13:54
Copy link
Contributor

@RK206 RK206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see the duplication in google sheet. May be we need check the function find_new_xlsx_rows.
Every time I run the script, it copying the data again to google sheets.

As per criteria 1, journals added starts with 1st Feb 2024 instead of from 2014.

Can you please check again?

if record is None:
return datetime.datetime.now()
return datetime.datetime(2014, 3, 19)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear why this date should be Mar 19 2024. Should it be 1st Feb 2024?
Also I think it would be better if this date is added in setting file instead of hard coding it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is 2014-03-19, most early date of journal that showed in excel. I don't think that value is configurable and always changed in different environment.

@RK206
Copy link
Contributor

RK206 commented Jan 14, 2025

@philipkcl I done the following to reproduce the issue.

  1. Created new google sheet
  2. Run the migration script python portality/migrate/4031_journals_added_sheet_duplicated/cleanup_4031_journals_added_sheet_duplicated.py
  3. No records in doaj-datalog_journal_added after executing migration script.
  4. Run the script python portality/tasks/datalog_journal_added_update.py
  5. I see 18350 records inserted in doaj-datalog_journal_added
  6. I see the same number of records in the google sheet.
  7. Now the issue starts when I run the script again. No change in the index records doaj-datalog_journal_added
    But Google sheet records doubles.
    When ever I run the script again, the sheet records increases nearly 18350
  8. My guess is that it is writing the records to the sheet every time I run the script.
  9. As I mentioned, there could be an issue in find_new_xlsx_rows method.

Hope this helps. Let me know if I need to test in different way or if I am missing something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants