Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to make output of git-commit-id-maven-plugin reproducible #825

Open
algomaster99 opened this issue Feb 19, 2025 · 8 comments
Open

Comments

@algomaster99
Copy link

algomaster99 commented Feb 19, 2025

Describe the idea (required)

For the past few months, I have been playing around with the Reproducible Central dataset and I am trying to understand what causes differences. I have roughly found the following attributes (source: chains-project/reproducible-central#19) embedded by this plugin that may be unreproducible by a rebuilder.

  1. git.commit.time - there is a timezone difference.
  2. git.remote.origin.url
  3. git.tags - number of tags at a later time could be different
  4. git.build.host - build host can vary if some third party builds the project
  5. git.build.time
  6. git.build.user.email
  7. git.build.user.name
  8. git.branch - could change if the refs are change and main keeps moving ahead
  9. git.local.branch.ahead
  10. git.local.branch.behind
  11. git.total.commit.count

Tell us about the expected behaviour (required)

Even though all of these attributes are important and well justified by the use cases, it can hinder reproducibility check for a third party builder as some of the attributes are relevant to the environment.

I can propose three solutions for this:

  1. We can make all of these attributes fixed for a particular commit. For example, git.total.commit.count can be number of commits up until HEAD.
  2. I think this plugin generates git.json and git.properties file. We could strip all the above properties (or maybe entire files) like we have done for attributes in MANIFEST. Eg, SCM-Revision in MANIFEST. What do you think @msuozzo?
  3. Another solution could be to skip generating these files while rebuilding using -Dmaven.gitcommitid.skip=true. @hboutemy we can add this to any buildspec like generation of SPDX is skipped right now.

For past releases, 2 in best.
For future releases, 3 is most convenient. 1 is also good but it may not apply to all attributes.

Context (optional)

No response

@msuozzo
Copy link

msuozzo commented Feb 19, 2025

Option 2 sounds viable to add as a stabilizer.

@algomaster99
Copy link
Author

Should I PR for that? I could remove the complete file. Similar to how stripping of signatures is implemented.

@msuozzo
Copy link

msuozzo commented Feb 19, 2025

Yeah I think until the implementation changes, we'd need to take the file out. I think there are different ways to include tag/branch info that's more prescriptive and avoids drift. I also don't think this would be the right place to embed the build host. But I think this is a good place to discuss those concerns while separately pursuing the stabilizer.

Thanks for digging into this!

@TheSnoozer
Copy link
Collaborator

Hello,
thanks for creating this issue.
I think you raise a valid point, but I'm not quite sure what you now expect from the plugin to change.

From the plugin side I'm not responsible for the user's plugin configuration and I also don't want to dictate how user's can/should/can't use the plugin.

As a side-note, from my perspective this plugin already has various option to account for better reproducibility:

  • You could apply an option to just generate the options that you consider "reproducible" by filtering all unwanted properties out. So for example you could just generate the git-commit-hash. See includeOnlyProperties. A similar behaviour could be achieved by using excludeProperties by specifically listing all unwanted options.
  • Differences in time-zone could be accounted for by using dateFormatTimeZone where one could specify a timezone which would then be used as basis for the generated properties (e.g. 'America/Los_Angeles')
  • If you don't like the in-build generated properties file you don't need to use it. You certainly could simply resort to maven's filtering mechanism to generate your own properties file that contains all reproducibile properties, while the plugin may generate properties that may not fit your desired standard (e.g. see https://github.com/git-commit-id/git-commit-id-maven-plugin/blob/master/docs/using-the-plugin-in-more-depth.md#maven-resource-filtering)

With that being mentioned:
What should the plugin now do differently to offer users more "reproducible builds"?

@hboutemy
Copy link
Contributor

hboutemy commented Feb 22, 2025

I recently used this plugin to get the last Git commit timestamp https://github.com/ollama4j/ollama4j/pull/93/files

as you can see, by default, plugin is good: just need to override timestamp format and timezone

but no file is generated, nothing with reproducibility issues

if your objective is to get more reproducible content when people activate file, you'll have to ask people why they generate a file, which is completely another discussion

@hboutemy
Copy link
Contributor

notice: perhaps just documenting this "enable reproducible builds with last Git commit timestamp" use case is what we just need, as it's reasonable (of course, if no config was necessary, that would be even better)

@algomaster99
Copy link
Author

algomaster99 commented Feb 24, 2025

but I'm not quite sure what you now expect from the plugin to change.

I should give constructive suggestions :)

git.commit.time

I suggest changing the default to UTC because java.util.TimeZone.getDefault().getID() returns the timezone of the system.

git.tags
git.total.commit.count

These two have an issue that they change for future rebuilds as number of tags and commit change. I suggest fixing their value for a specific commit.

git.tags until a specific commit can be computed by git tag --merged HEAD.
git.total.commit.count until a specific commit can be computed by git rev-list --count HEAD.

git.branch
git.local.branch.ahead
git.local.branch.behind

These also change, but pinning them for each commit seems to be overkill so it is really up to the users of your plugin. If they use it, their software won't be reproducible out of the box.

git.remote.origin.url
git.build.host
git.build.time
git.build.user.email
git.build.user.name

All of this information can also be recorded in MANIFEST.MF using some apache.maven plugins. But again I agree with you that we should let users decide.

Sorry for not posting them before and if you don't see value in them, that's also okay. We are working on the 2nd solution above. :)

@TheSnoozer
Copy link
Collaborator

Thanks for the follow up! Greatly appreciate :-)

I see you point regarding the time format's, but that can already be controlled using dateFormatTimeZone.

Let me maybe quote from https://reproducible-builds.org - "How does it work?"

First, the build system needs to be made entirely deterministic: transforming a given source must always create the same result. For example, the current date and time must not be recorded and output always has to be written in the same order.

All information that is generated from the plugin comes from git as "source". IMHO you could claim that the plugin generates reproducible results when build with the same git repository [on the same machine/environment].
Now for me it appears that you now ask the plugin to deterministic/reproducible when you "change" the underlying source (e.g. by adding commits/branches/tags, or removing branches). When you change the underlying source it ultimately means that the derived information may be changed as well. Otherwise where should this information be stored or come from? It needs to come from somewhere! Reproducible may also mean that you can reproduce a build from 20 years ago. However by then I would assume you have made a lot more commits (so the git.total.commit.count would by definition not the same). So I think by definition the "source" that the plugin relies on is changed, hence the result can't be expected to be the same.

Second, the set of tools used to perform the build and more generally the build environment should either be recorded or pre-defined.
Third, users should be given a way to recreate a close enough build environment, perform the build process, and validate that the output matches the original build.
Isn't that exactly what properties like git.build.host / git.build.time et.al. try to achieve?

As mentioned above the plugin will not even generate a git-properties file by default. So by default I would claim that the plugin can produce reproducible builds. Given a user had found the option to enable to generate a git-properties file would it be too much to ask user's to review what is even being generated and adjust the configuration as needed (e.g. to only generate the git-commit-hash-id)?

With that being mentioned I don't think there is anything that the plugin will or can change. The git-properties file is the user's choice already, so why should the plugin be responsible for things the user may want/need?
The only thing I see the plugin could do, is issue a warning if the generated git-properties file is non-reproducible...but assume your build log is 1000 lines or longer - is there anyone actually checking for warnings? I doubt...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants