Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent checking out discarded vault commit IDs with version #252

Closed
joshuakarp opened this issue Oct 20, 2021 · 11 comments
Closed

Prevent checking out discarded vault commit IDs with version #252

joshuakarp opened this issue Oct 20, 2021 · 11 comments
Assignees
Labels
bug Something isn't working development Standard development r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management

Comments

@joshuakarp
Copy link
Contributor

Describe the bug

The version command for the vaults can currently move to commit IDs that are expected to be discarded. When checked out at an earlier commit ID and a mutation is performed, we expect to discard the later commit IDs (respective to the earlier checked out commit) such that they are no longer valid and inaccessible.

To Reproduce

  1. Perform some commits on a vault to produce a commit log that appears like c1 -> c2 -> c3
  2. Save the commit ID for c3
  3. Use version to move to c1: expected commit log now c1
  4. Perform another commit: expected commit log now c1 -> c4
  5. Save the commit ID for c4
  6. Use version to move to c3 (with its old commit ID): commit log now back to c1 -> c2 -> c3
  7. Use version to move to c4: commit log now back to c1 -> c4

Expected behavior

In step 5, instead of reverting to the earlier commit history, we would instead expect this to be an invalid operation. We want to restrict the "branching" potential for our vault log (for simplicity purposes), such that any mutations performed on a vault when checked out at an earlier commit will discard all commits later than the current one.

Additional context

@joshuakarp joshuakarp added bug Something isn't working development Standard development labels Oct 20, 2021
@joshuakarp joshuakarp changed the title Prevent checking out discarded commit IDs with version Prevent checking out discarded vault commit IDs with version Oct 21, 2021
@CMCDragonkai
Copy link
Member

In terms of GC of unreachable commits. This appears to require a reachability algorithm. I can think of a basic way by building up a set and removing elements from the set when traversing the canonical branch pointer but I'm wondering if there's a faster way that's already implemented in git https://stackoverflow.com/a/21157751. Also see https://en.m.wikipedia.org/wiki/Reachability.

Additionally because we are updating both HEAD and branch pointer on every commit, we can also eliminate unreachable commits everytime we do a commit op by walking the branch pointer commit ID to the HEAD commit ID before switching the branch pointer. This is done on-line but is exposed to transitory failures. So it's still good to have a git gc implemented internally.

@CMCDragonkai
Copy link
Member

After reviewing fsck-objects.c in upstream git. It appears that ultimately we have to scan all commit objects that exist, then mark the ones that are reachable from the canonical branch pointer and then delete the remaining unreachable objects.

There are several ways to do this:

According to this benchmark:

import b from 'benny';
import packageJson from '../package.json';

async function main () {
  let map = new Map();
  let obj = {};
  let arr = [];
  let set = new Set();
  const summary = await b.suite(
    'gitgc',
    b.add('map', async () => {
      map = new Map();
      return async () => {
        for (let i = 0; i < 1000; i++) {
          map.set(i, undefined);
        }
        for (let i = 0; i < 1000; i++) {
          map.delete(i);
        }
        for (const i of map) {
          // NOOP
        }
      }
    }),
    b.add('obj', async () => {
      obj = {};
      return async () => {
        for (let i = 0; i < 1000; i++) {
          obj[i] = undefined;
        }
        for (let i = 0; i < 1000; i++) {
          delete obj[i];
        }
        for (const i in obj) {
          // NOOP
        }
      };
    }),
    b.add('arr', async () => {
      arr = [];
      return async () => {
        for (let i = 0; i < 1000; i++) {
          arr.push({ id: i, mark: false});
        }
        for (let i = 0; i < 1000; i++) {
          arr[i].mark = true;
        }
        for (let i = 0; i < 1000; i++) {
          if (arr[i].mark === false) {
            // NOOP
          }
        }
      };
    }),
    b.add('set', async () => {
      set = new Set();
      return async () => {
        for (let i = 0; i < 1000; i++) {
          set.add(i);
        }
        for (let i = 0; i < 1000; i++) {
          set.delete(i);
        }
        for (const i of set) {
          // NOOP
        }
      };
    }),
    b.cycle(),
    b.complete(),
    b.save({
      file: 'gitgc',
      folder: 'benches/results',
      version: packageJson.version,
      details: true,
    }),
    b.save({
      file: 'gitgc',
      folder: 'benches/results',
      format: 'chart.html',
    }),
  );
  return summary;
}

if (require.main === module) {
  (async () => {
    await main();
  })();
}

export default main;

The fastest one is is obj style:

Progress: 100%

  map:
    15 782 ops/s, ±0.85%   | 7.55% slower

  obj:
    17 071 ops/s, ±1.35%   | fastest

  arr:
    8 765 ops/s, ±75.78%    | 48.66% slower

  set:
    2 766 ops/s, ±115.43%    | slowest, 83.8% slower

Finished 4 cases!
  Fastest: obj
  Slowest: set

So we just get an obj as record of all commit objects, which is O(n), then delete them as we are traversing which is O(m), which gives us back the remaining unreachable objects which can be O(o) where n = o + m. The maximum possible time spent is therefore O(2n) with memory space usage of O(n) where n is the number of commits.

@CMCDragonkai
Copy link
Member

This would only need to be run if the regular commit failed to delete old commit objects when switching the canonical branch pointer. We could achieve this by first indicating that the vault is "dirty", and then switching it back to "clean" state if everything succeeded. This can be done with the new vault sublevel, and can also help when the working directory is also "dirty". The dirty boolean then allows us to know whether we successfully complete a mutation of the vault.

@CMCDragonkai
Copy link
Member

Should redo the array benchmark with pre-declared size of the array.

@CMCDragonkai
Copy link
Member

After redoing it with arr = new Array(1000) it shows the array significantly faster:

Running "gitgc" suite...
Progress: 100%

  map:
    12 546 ops/s, ±1.49%   | slowest, 84.74% slower

  obj:
    17 644 ops/s, ±0.43%   | 78.53% slower

  arr:
    82 191 ops/s, ±0.83%   | fastest

  set:
    16 630 ops/s, ±1.25%   | 79.77% slower

Finished 4 cases!
  Fastest: arr
  Slowest: map

However the problem with this is that you have to know how many git objects at any moment, this requires a quick count I guess initially. We could do something like C++ vector, by doubling size upon resizing the array. And resizing arrays in JS is easy by just doing arr.length = ?; which will resize according to the beginning of the array.

@CMCDragonkai
Copy link
Member

One issue is that iterating over the git object database, we have to keep track of objects that are not commits.

It appears that isogit does have a function for this: https://isomorphic-git.org/docs/en/walk, however the walkers are predefined, and one of the TREE walkers defaults to start walking from HEAD. Whereas we need to consider all possible commits, so we may just need to walk the FS ourselves, and then try to read those objects using https://isomorphic-git.org/docs/en/readObject.

@CMCDragonkai
Copy link
Member

Example of walking the filesystem: https://gist.github.com/lovasoa/8691344, can be used in case the TREE walker is not useful or it can't just run from no reference.

@CMCDragonkai
Copy link
Member

@tegefaulkes what we talked about regarding garbage collection concept. Can you spec out an algorithm? You'd have to start with applying the dirty boolean.

@tegefaulkes
Copy link
Contributor

This was fixed by #335 when #266 was merged.

@CMCDragonkai
Copy link
Member

@tegefaulkes is there a test for this behaviour?

@tegefaulkes
Copy link
Contributor

tests that cover this is;

  • tests/vaults/VaultInternal.test.ts:731 garbage collection
  • tests/vaults/VaultInternal.test.ts:513 cannot checkout old commits after branching commit

@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management label Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working development Standard development r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management
Development

No branches or pull requests

4 participants