Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash changes if we change our metadata #1152

Closed
schomatis opened this issue May 16, 2022 · 4 comments
Closed

Hash changes if we change our metadata #1152

schomatis opened this issue May 16, 2022 · 4 comments
Assignees
Labels
dif/medium Prior experience is likely helpful effort/hours Estimated to take one or several hours P2 Medium: Good to have, but can wait until someone steps up subtask Issue w/ parent GH issue

Comments

@schomatis
Copy link

Spawned from ipfs/kubo#8974.

tl;dr The CID is not the hash of your file, do not rely on it. The normal learning path can leave you with a wrong impression of an apparent stability between user data and CID/hash representing it.

Brief outline:

  • New users are introduced in the IPFS world through the content-based paradigm: forget where you store it, all that counts is the data itself, which we identified through its hash. In contrast with location, your (user's) data doesn't change, neither will its hash.
  • New users experiment with this paradigm by adding files to the IPFS system (CLI, HTTP, web, whatever) and get a CID/hash in return.
  • There is now a discrepancy of what "data" means:
    • In the theory/docs the users visualize a block (string of bits) of their data, what was contained in the FS file they're adding, nothing more.
    • In practice, through the UnixFS abstraction, the file is formatted in a DAG of many chunks (blocks) of the user's data. The DAG structure is supported by IPFS (not user) metadata, which is also part of the block of data that is being hashed and thus affects its CID.
  • The metadata is leaked in the CID, whether the user cares about it or not. The same file added with different parameters (or even same parameters but new IPFS versions with different defaults) may be represented by different CIDs/hashes.

I think this happens to a lot of people (myself included). The simplest example of a "neutral" block of my data is what I first think of when immutability appears, and at some point we silently jump from that single block to a file without mentioning UnixFS, which is ugly and I get why is not in the foreground, but you normally translate that neutral/single/your block as your file, and therefore the immutability of data as also the immutability of its tag (CID). Not sure when but at some point we need to break it to you, maybe not even mentioning UnixFS but just the generic metadata, that we process your data and add some of our own to better organize and transmit it, and even if that is also immutable we may change our minds (very rarely) as to what the best organization is. And you'll see a different hash reflecting it. Kind of sucks, but that's life, and it's still much better than httping all the time. (We can omit this last remark 😬.)

@schomatis schomatis added the need/triage Needs initial labeling and prioritization label May 16, 2022
@johnnymatthews johnnymatthews moved this to Needs triage in Protocol Docs May 17, 2022
@Annamarie2019 Annamarie2019 added dif/expert Extensive knowledge (implications, ramifications) required P2 Medium: Good to have, but can wait until someone steps up effort/days Estimated to take multiple days, but less than a week and removed need/triage Needs initial labeling and prioritization labels May 22, 2022
@Annamarie2019
Copy link
Contributor

Summary: Explain that we process data and add some of our own (metadata) to better organize and transmit it, and even if that is also immutable we may change our minds (very rarely) as to what the best organization is. And you'll see a different hash reflecting it. A good location may be the bottom of Hashing, right before SHA hashes WON'T match CID.

@Annamarie2019 Annamarie2019 added dif/medium Prior experience is likely helpful effort/hours Estimated to take one or several hours and removed dif/expert Extensive knowledge (implications, ramifications) required effort/days Estimated to take multiple days, but less than a week labels May 22, 2022
@Annamarie2019 Annamarie2019 changed the title Block vs DAG vs UnixFS layers Hash changes if we change our metadata May 22, 2022
@Annamarie2019 Annamarie2019 moved this from Needs triage to Backlog in Protocol Docs May 22, 2022
@lidel
Copy link
Member

lidel commented Jun 10, 2022

I think this is part of a bigger "import parameters that impact the final CID" story described in #1176

@ElPaisano
Copy link
Contributor

ElPaisano commented Aug 21, 2023

Reviving this as a subtask in #1674

@ElPaisano ElPaisano self-assigned this Aug 21, 2023
@ElPaisano ElPaisano added the subtask Issue w/ parent GH issue label Aug 22, 2023
@ElPaisano
Copy link
Contributor

This appears to be addressed in https://docs.ipfs.tech/concepts/content-addressing/#cids-are-not-file-hashes. So, I am closing this issue

@github-project-automation github-project-automation bot moved this from 📋 Backlog to ✅ Done in Protocol Docs Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dif/medium Prior experience is likely helpful effort/hours Estimated to take one or several hours P2 Medium: Good to have, but can wait until someone steps up subtask Issue w/ parent GH issue
Projects
None yet
Development

No branches or pull requests

4 participants