Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata injection in epub3-to-epub3 does not update dc:identifier and title in content files #22

Open
martinpub opened this issue May 20, 2021 · 11 comments

Comments

@martinpub
Copy link
Collaborator

Currently, the metadata injection is able to update dc:identifier and dc:title in package.opf, however, it is not doing anything to content files with regards to the corresponding <title> and <meta name="dc:identifier" content="xxx"/> values. This causes a mismatch in package.opf metadata and content file metadata, and it also causes non-valid EPUB 3 files according to the Nordic Guidelines.

Could this be easily added @bertfrees? I thought we discussed it at the spec stage, but perhaps we didn't. Also, it apparently hasn't been tested enough until now.

@bertfrees
Copy link
Collaborator

This is because the EPUB specification doesn't say anything about metadata in content documents.

It could be a bit tricky to implement this because you need to detect that the metadata fields in content and package documents indeed correspond. For example we could say a title element in a content document corresponds with a dc:title element in the package document if

  • there is exactly one title element (not more than one is allowed in HTML)
  • there is exactly one dc:title element
  • their text matches

and something similar for dc:identifier vs. <meta name="dc:identifier">. We would then only update metadata fields in content documents that correspond to metadata fields in the package document that were updated (and maybe it should only be done when enabled with an option).

Alternatively, we could have an option similar to update-lang-attributes that synchronizes the metadata in the content documents with the package document:

  • derive <meta name="dc:identifier"> element from the dc:identifier element marked as the "unique identifier", and remove any existing <meta name="dc:identifier"> elements
  • derive title element from the dc:title element if there is exactly one (or from the first)

@martinpub
Copy link
Collaborator Author

Thanks @bertfrees, yes I see the implementation design issues here. I like the second option for its simplicity and similarity as a kind of "force" option like "update-lang-attributes" toggle. It could be named "update-contentdoc-meta"?

@bertfrees
Copy link
Collaborator

Yes I think I agree.

@martinpub
Copy link
Collaborator Author

That's great. Could you prioritise this over #18? This one is earlier in our production line. Also an estimate for when it could be done is very much appreciated, even though I know it's not always possible.

@bertfrees bertfrees self-assigned this May 21, 2021
@martinpub
Copy link
Collaborator Author

Hi @bertfrees, any progress here? Just checking in.

@bertfrees
Copy link
Collaborator

Not yet.

@bertfrees
Copy link
Collaborator

I have implemented this. It's available on the master branch of daisy/pipeline-modules: daisy/pipeline-modules@78a2e6d.

@martinpub
Copy link
Collaborator Author

Great, thank you very much @bertfrees!

@kalaspuffar can I kindly request a cherry-pick to the fork?

@kalaspuffar
Copy link

Hi @bertfrees and @martinpub

Of course, I could help with this, but wouldn't it be safer if Bert creates a PR with this change, as usual, ensuring that all code will be merged.

Best regards
Daniel

@martinpub
Copy link
Collaborator Author

Sure @kalaspuffar. @bertfrees is that possible?

@bertfrees
Copy link
Collaborator

It is possible, and I'll do it, but I wish I wouldn't have to. I think we need to talk about how to organize this in the future.

ensuring that all code will be merged.

IMO, the best way to make sure that everything is there is to always base your fork onto the latest upstream. My suggestion is to maintain a branch that you always keep up to date with the latest upstream version of Pipeline (releases or development version) by doing git rebases. The rebases can be done on the level of the "super project" to avoid the extra technical burden of working with git-subrepo. A consequence of this is that some of the git tags that you create may become dangling commits, but that should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants