Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update csv importexport tool. ##Godoy0722/stable-3_3_0 #1606

Merged
merged 7 commits into from
Oct 21, 2024

Conversation

Godoy0722
Copy link
Contributor

@Godoy0722 Godoy0722 commented Jun 24, 2024

Updates:

  • fix some bugs where the code wasn't able to execute this CLI command;
  • adds support for additional fields: abstract, keywords, subjects, book cover image, book cover image alt text, and categories.
  • create a document with guidance on how to use this tool in CLI mode.

Issues

This PR is intended to solve the following issues:

  1. [OMP] General errors using the CSV importexport on CLI mode pkp-lib#10116
  2. [OMP] Add a sample CSV file for the CSV importexport tool pkp-lib#10117
  3. [OMP] Create a document guiding the user how does the CSV import export tool works pkp-lib#10118
  4. [OMP] Add support for additional fields on CSV importexport tool pkp-lib#10119

@CLAassistant
Copy link

CLAassistant commented Jun 24, 2024

CLA assistant check
All committers have signed the CLA.

@kaitlinnewson kaitlinnewson self-requested a review June 25, 2024 15:41
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/index.php Outdated Show resolved Hide resolved
plugins/importexport/csv/locale/en_US/locale.po Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/README.md Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the best place is currently for plugin documentation - I wonder if this would be more suited to the Docs Hub, e.g. in the Admin Guide's import/export section? That way it would be easier for users to find.

plugins/importexport/csv/README.md Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
@jonasraoni
Copy link
Contributor

One extra comment... I've just noticed the number of modified lines is large, and looks like a lot of files are having the line break replaced from \n to \r\n, we've got to revert them to \n.

I'll create an issue to decrease the chances of it happening again.

Copy link
Member

@asmecher asmecher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @Henrique523! This was a little tricky to review as there was quite a large changeset. When working with a stable branch, I'd suggest trying to avoid introducing changes that will result in a large changeset -- code reformatting, extra refactoring, etc. That way the reviewer will have an easy time pinpointing the exact changes being proposed, and there's less chance of regressions. There's much more leeway for making bigger changes on the main branch during a major dev cycle. I've made a few minor suggestions for change.

plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/version.xml Outdated Show resolved Hide resolved
@Godoy0722
Copy link
Contributor Author

Hello folks, and sorry for the delay about this PR! @kaitlinnewson I read your comments and I'll make other commits to make everything in the same pattern as it should be. For @asmecher most comments and @jonasraoni too. My next commit will be a kind of undo for my auto-format so the changes will be more clear.

@asmecher
Copy link
Member

asmecher commented Jul 2, 2024

@Henrique523, don't spend too much time undoing the formatting; I don't mind if it gets included here. Just a note for the future, though, so that it's easier for reviewers next time.

@Godoy0722
Copy link
Contributor Author

Godoy0722 commented Jul 4, 2024

Hello again! After making some updates, I have some news about this PR which I'd like to share with you.

Undos:

Patterns and Structure:

  • Updated the array syntax (#307fd9f);
  • Remove unnecessary comments (#f14fc52);
  • Reason for array_shift explained (#4c0ec5b);
  • Update quotes for single quotes every time I could (#ce10c3a);
  • Spelling for the localized message (#6738cc9);

Bug fixes

  • Fix submission and remove unnecessary title (#ab1c367);
  • Add an author as the primary contact (#5e65efa);
  • Dynamic submission PDF mime-type (#5eee836);
  • Stop script if the file is not present (#f64f4ea);
  • Filename for submission PDF (#c1721ed);
  • Path for assets updated as the same as the CSV file (#e798a7a)(#e63561d);

Docs and Version

I believe I covered all requests from you in these commits. Despite that, I have one more comment, about the documentation. I'll suggest the main PKP documentation (https://docs.pkp.sfu.ca/admin-guide/3.3/en/) about this tool. This way the docs will be available in both code and the docs. If you need anything else from me, or if you find more things to solve before merging this PR, please let me know!

@jonasraoni
Copy link
Contributor

I didn't review the PR, but just to get you prepared, updates will need to be forwarded to the stable-3_4_0 and main branches. But better to wait the initial review to be completed, or you'll have to update too many branches.

Copy link
Member

@asmecher asmecher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, @Godoy0722, I've just added a couple of comments. There are a few others that still need resolution, though.

plugins/importexport/csv/locale/cs_CZ/locale.po Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
@Godoy0722
Copy link
Contributor Author

Hello folks. I reviewed all @asmecher , @kaitlinnewson, and @jonasraoni comments and I think now all your requests and suggestions are covered. Could you please take one last look at the code? Thank you so much for your patience!

@asmecher
Copy link
Member

Thanks, @Godoy0722, I've taken a last skim over it and added just a couple of comments -- we could easily go back and forth over small details like this forever, so take what you like from those and ignore the rest. But @kaitlinnewson, a last test and look over it from you would be much appreciated!

@kaitlinnewson
Copy link
Member

@Godoy0722 I've added a few additional comments for things that came up when I re-tested. Looking good to me otherwise!

@Godoy0722
Copy link
Contributor Author

Hello again!

I just made more updates here. I made a rebase until the stable-3_3_0 to maintain all updates in one commit, removed the permission stuff @asmecher told me about and fix the last issues you guys sent me. Because of the rebase, I had to make a force push in my fork branch.

If you need anything from me because of it please let me know! And thank you again.

@Godoy0722
Copy link
Contributor Author

Hello folks!

Any news about my updates from here? @asmecher @kaitlinnewson

Copy link
Contributor

@jonasraoni jonasraoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Godoy0722! I've left some comments too 😁

plugins/importexport/csv/version.xml Outdated Show resolved Hide resolved
plugins/importexport/csv/locale/en_US/locale.po Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
@Godoy0722
Copy link
Contributor Author

Hello again,

here are all my updates for my last commit:

  • I rolled back the plugin version;
  • Updated the located file and created new keys for the new usage;
  • Removed some imports and static variables from the CSV loop;
  • Cached some values as @jonasraoni suggested;
  • For the CSV sample file, I removed the "https://doi.org/" as suggested by @kaitlinnewson; updated the author's example, and updated the readme file to fit with the new author's information pattern.

New Tool Structure/Behavior

The tool now is working the way as follows below:

  1. For each row, all error and inconsistency validations are made before the process starts. If any of those are found, the tool will try the next row. The process has the same behavior as I already developed.
  2. All wrong/inconsistent rows will be shown on the terminal, followed by its reason after all CSV file process ends.
  3. The tool will generate a CSV file with only the inconsistent rows as explained in the readme.md file. I thought it'd be easier for the user to finish his work.
  4. I changed the author's information pattern on CSV file. You can have a better look on the readme file about this pattern, which I also believe it's easier for the user to manage all authors information on csv file.
  5. Finaly, I'm using the HTML purifier on abstract text to avoid suspicious content there.

Hope this explain helps the Code Review! If you have any questions for me, please let me know! Thanks folks.

Copy link
Contributor

@jonasraoni jonasraoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Done 😁

plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
Copy link
Contributor

@jonasraoni jonasraoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next round 😁

plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
$authorDao->insertObject($author);
} // Authors done.

$sanitizedAbstract = PKPString::stripUnsafeHtml($data->abstract);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to insert the content as it is, the sanitization will be done at the presentation layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a request from Michael, cause there will be another use cases for the abstract beyond the presentation layer. I asked him about this as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what was the outcome of your conversation? All the codebase assumes the cleanup will happen at the UI.

plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
plugins/importexport/csv/CSVImportExportPlugin.inc.php Outdated Show resolved Hide resolved
@jonasraoni
Copy link
Contributor

Ah, a general comment that I forgot to leave... If you have free time, it would be interesting to break that huge method into pieces.

@Godoy0722
Copy link
Contributor Author

Hello @jonasraoni, @kaitlinnewson, and @asmecher,

I wanted to inform you of some significant updates to the codebase:

I've refactored the code, implementing several improvements including private attributes for caching and static variables. The tool is now thoroughly documented with comments, and each submission process has been organized into small sections for clarity. Additionally, I've updated some deprecated methods to their current equivalents.

To enhance reliability, all verifications now occur before processing begins, ensuring field correctness and consistency.

As for error handling, any rows that fail validation are now recorded in a separate CSV file. This file is generated in the same directory as the client's CSV and includes an additional column detailing the reason for each row's failure.

I would greatly appreciate it if you could review these changes when you have the opportunity. 😄

@kaitlinnewson
Copy link
Member

@Godoy0722 I'll aim to give this another test run later today

Copy link
Contributor

@jonasraoni jonasraoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Godoy0722 I've left more comments, but they are not important, just suggestions/nitpicks... Also, I didn't test anything, but looks like @kaitlinnewson will do it 🎉

I think it's looking cleaner than before, also working better than other tools (e.g. having a file just with the errored items, almost ready to be re-imported is great), so I'll approve it.

For the stable-3_4_0 some updates will be needed (I didn't mark the comment of Alec regarding the submission_progress as resolved as a reminder).

Comment on lines 475 to 477
private function getCachedDao($daoType) {
return $this->_daos[$daoType] ?? $this->_daos[$daoType] = DAORegistry::getDAO($daoType);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed to cache these DAOs, they are cheap to get. But as it's already implemented, I don't think it's worth to revert. So, this is just a note :)

Comment on lines 460 to 467
$this->_dirNames = Application::getFileDirectories();
$this->_format = trim($this->_dirNames['context'], '/') . '/%d/' . trim($this->_dirNames['submission'], '/') . '/%d';

$this->_fileManager = new FileManager();
$this->_publicFileManager = new PublicFileManager();

$this->_fileService = Services::get('file');
$this->_publicationService = Services::get('publication');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These things are cheap to initialize, but I don't see any problem to cache them (especially if you're using in a lot of places), so I'm just leaving a note.

Comment on lines 525 to 528
private function processFailedRow($invalidRowsFile, $fields, $reason, $failedRows) {
$invalidRowsFile->fputcsv(array_merge(array_pad($fields, $this->_expectedRowSize, null), [$reason]));

return $failedRows + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you receive the $failedRows by reference, then you can replace the occurrences of $failedRows = $this->processFailedRow(...) by $this->processFailedRow(...)

Suggested change
private function processFailedRow($invalidRowsFile, $fields, $reason, $failedRows) {
$invalidRowsFile->fputcsv(array_merge(array_pad($fields, $this->_expectedRowSize, null), [$reason]));
return $failedRows + 1;
private function processFailedRow($invalidRowsFile, $fields, $reason, &$failedRows) {
$invalidRowsFile->fputcsv(array_merge(array_pad($fields, $this->_expectedRowSize, null), [$reason]));
++$failedRows;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as you've created a couple of private variables, you could also create one for the $failedRows and the $invalidRowsFile, then you won't need to pass them by arguments.

* @param array $args
* @return string[]
*/
private function parseCommandLineArguments($scriptName, $args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's important, but if you want to follow the pattern on the stable-3_3_0, the private methods should be prefixed with _ too. On the stable-3_4_0 and main, the pattern was changed and the _ is not used anymore).

Suggested change
private function parseCommandLineArguments($scriptName, $args) {
private function parseCommandLineArguments($scriptName, $args) {

*/
private function createAndValidateCSVFileInvalidRows($csvForInvalidRowsName) {
$invalidRowsFile = $this->createNewFile($csvForInvalidRowsName, 'w');
$invalidRowsFile->fputcsv(array_merge(array_pad($this->_expectedHeaders, $this->_expectedRowSize, null), ['error']));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$invalidRowsFile->fputcsv(array_merge(array_pad($this->_expectedHeaders, $this->_expectedRowSize, null), ['error']));
$invalidRowsFile->fputcsv(array_merge($this->_expectedHeaders, ['error']));

Comment on lines 245 to 246
// Format is:
// Press Path, Author string, title, abstract, series path, year, is_edited_volume, locale, URL to Asset, doi (optional), keywords list, subjects list, book cover image path, book cover image alt text, categories list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed, we have this description on the $_expectedHeaders

}

// All requirements passed. Start processing from here.
$this->initializeStaticVariables();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should stay out of the loop or it ends up not caching anything.

Comment on lines 569 to 579
* Retrieves a Genre ID by Press ID. If the Genre doesn't exist, the result
* will be false
*
* @param int $pressId
*/
private function getGenreIdByPressId($pressId) {
/** @var GenreDAO $genreDao */
$genreDao = $this->getCachedDao('GenreDAO');
$genre = $genreDao->getByKey('MANUSCRIPT', $pressId);

return !$genre ? false : $genre->getId();
Copy link
Contributor

@jonasraoni jonasraoni Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think the genre name should be also a field on the CSV file, and if not found, then a default one could be passed by a command-line argument (if nothing was passed, then we can use the "MANUSCRIPT" as default).

Suggested change
* Retrieves a Genre ID by Press ID. If the Genre doesn't exist, the result
* will be false
*
* @param int $pressId
*/
private function getGenreIdByPressId($pressId) {
/** @var GenreDAO $genreDao */
$genreDao = $this->getCachedDao('GenreDAO');
$genre = $genreDao->getByKey('MANUSCRIPT', $pressId);
return !$genre ? false : $genre->getId();
* Retrieves the "Manuscript" genre's ID of a given Press ID
*
* @param int $pressId
* @return ?int Null if not found
*/
private function getManuscriptGenreId($pressId) {
/** @var GenreDAO $genreDao */
$genreDao = $this->getCachedDao('GenreDAO');
$genre = $genreDao->getByKey('MANUSCRIPT', $pressId);
return $genre ? $genre->getId() : null;

*
* @param int $pressId
*/
private function getUserGroupIdByPressId($pressId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important, but as we don't have other ways to retrieve (e.g. ByX, ByY, ByZ), I think this can be dropped from the name (there are other occurrences).

$userGroupDao = $this->getCachedDao('UserGroupDAO');
$userGroup = $userGroupDao->getDefaultByRoleId($pressId, ROLE_ID_AUTHOR);

return !$userGroup ? false : $userGroup->getId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other cases returning false, I think null fits better.

Suggested change
return !$userGroup ? false : $userGroup->getId();
return $userGroup ? $userGroup->getId() : null;

$coverImage['altText'] = $bookCoverImageAltText ?? '';

$destFilePath = $this->_publicFileManager->getContextFilesPath($press->getId()) . '/' . $coverImage['uploadName'];
copy($srcFilePath, $destFilePath);
Copy link
Contributor

@jonasraoni jonasraoni Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be interesting to check the return of the copy() and perhaps other places that might fail.

p.s.: We're not using transactions nor checking if the database operations worked fine, so such things don't need to be checked 🙈

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second comment... I think it's better to use the FileManager::copy() to setup the expected permissions.

* @param string $srcFilePath
* @param Publication $publication
*/
private function processBookCoverImage($data, $srcFilePath, $publication) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to type the inputs and returns of the methods, just be careful with the types that can be null. But maybe it's better to leave this for the stable_3_4_0.

@Godoy0722 Godoy0722 changed the title update csv importexport tool. ##Henrique523/stable-3_3_0 update csv importexport tool. ##Godoy0722/stable-3_3_0 Aug 22, 2024
@kaitlinnewson
Copy link
Member

@Godoy0722 Is this ready for another test or are there more changes planned? I wasn't able to get to it before my vacation but can do another test when it's ready!

@Godoy0722
Copy link
Contributor Author

Hi there. Just pinging @kaitlinnewson that an additional review is still needed for this tool. I really appreciate if you could take a look and see if everything is in place now.

Best,
Guilherme Godoy

@kaitlinnewson
Copy link
Member

@Godoy0722 I've made a few additional comments - looking good otherwise!

@Godoy0722
Copy link
Contributor Author

Hello, @kaitlinnewson!

Yes, you were right. I forgot to pass this variable as a parameter to the respective method that processes this info. I updated and tested again the tool and it seems it's working as it should be. If you could please take a look again, I appreciate that!

@Godoy0722
Copy link
Contributor Author

P.S.: I just updated the Docs and the "tab delimited" occurrences. The "submissionPdfs" and "coverImages" parts of the path were removed since the behavior of the tool changed. Now it's working in a way that all the assets (both PDFs and cover images) must be put inside the same path as the CSV file. It makes the tool management easier and clearer for the user.

A final matter, the tool is ready in my idea, there's nothing to add for now unless the tool is with more bugs. So, if you find everything correct, it's ready to be merged on my idea. Thanks for the review @kaitlinnewson !

@kaitlinnewson
Copy link
Member

Looking good to me @Godoy0722! I think @asmecher will need to do the final merge due to the previous requested changes.

@asmecher asmecher merged commit cf45b31 into pkp:stable-3_3_0 Oct 21, 2024
12 checks passed
@asmecher
Copy link
Member

Merged -- thanks, all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants