Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds transfers between stores to external attachments #1358

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

Spoffy
Copy link
Contributor

@Spoffy Spoffy commented Jan 4, 2025

Currently a draft, pending final changes. Opened here for initial feedback.

Context

This PR follows on from #1320, and adds support for moving attachments between stores.

This allows a user to change the document's default store, then transfer all of the attachments from their current store(s) to the new default.

This includes transfers from internal (SQLite) storage to external storage, external to internal, and external to external (e.g MinIO to filesystem).

User-facing changes:

  • Adds an API endpoint to migrate all current attachments to the default store set for that document

Internal changes:

  • Adds methods to AttachmentFileManager to facilitate transfers from one storage to another.
  • Exposes new methods on ActiveDoc for triggering transfers and retrieving transfer status.

All of the logic behind these changes should be documented in the source code with comments - if anything is unclear, that would be great feedback.

There a few large block comments (particularly in AttachmentFileManager) that might be a good place to start when reading this code.

Related issues

#1320
#1021

Has this been tested?

  • 👍 yes, I added tests to the test suite
  • 💭 no, because this PR is a draft and still needs work (some tests missing)

@Spoffy Spoffy changed the base branch from main to spoffy/external-attachments-prototype January 4, 2025 04:23
@@ -76,6 +76,9 @@ export interface IAttachmentStore {
// implementation and gives them control over local buffering.
download(docPoolId: DocPoolId, fileId: FileId, outputStream: stream.Writable): Promise<void>;

// Remove attachment from the store
delete(docPoolId: DocPoolId, fileId: FileId): Promise<void>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually used anywhere. I've left it in for now (since it's already implemented), but we could remove it?

@berhalak berhalak self-assigned this Jan 7, 2025
@Spoffy Spoffy force-pushed the spoffy/external-attachments-transfers branch from 9231ba5 to 70f707c Compare January 8, 2025 02:44
Base automatically changed from spoffy/external-attachments-prototype to main January 8, 2025 16:43
@Spoffy Spoffy force-pushed the spoffy/external-attachments-transfers branch from 70f707c to cbaf592 Compare January 8, 2025 16:45
Copy link
Collaborator

@fflorent fflorent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, that looks promising :).

Here are my remarks so far (I need to continue reading the code and start looking at your tests).

const causeDescriptor = causeError ? `: ${cause.message}` : '';
super(`Unable to retrieve '${fileId}' from '${storeId}'${causeDescriptor}`);
const reason = (causeError ? causeError.message : cause) ?? "";
const storeName = storeId? `'${storeId}'` : "internal storage";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pure nitpicking)

Suggested change
const storeName = storeId? `'${storeId}'` : "internal storage";
const storeName = storeId ? `'${storeId}'` : "internal storage";

}

const fileIdents = filesToTransfer.map(file => file.ident);
for(const fileIdent of fileIdents) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also nitpicking:

Suggested change
for(const fileIdent of fileIdents) {
for (const fileIdent of fileIdents) {

* @returns {Promise[Boolean]} True if the file got attached; false if this ident already exists.
*/
public findOrAttachFile(
fileIdent: string,
fileData: Buffer | undefined,
storageId?: string,
shouldUpdate: boolean = false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am usually not very comfortable with boolean flag params, due to their lack of expressivity.

But am I right to say the shouldUpdate may be the default? I only see the flag is kept to false only in tests.

In any case, I would suggest for adding more expressivity the following, if that makes sense to you too:

  • make this function private;
  • introducing 2 new functions named updateOrAttachFile (where you call the private function with shouldUpdate = true) and attachFileIfNew (with shouldUpdate = false);
  • maybe (or maybe not, that's arguable if this is now private) change shouldUpdate: boolean into { shouldUpdate = false }: {shouldUpdate: boolean } = {} or similar;

let isNewFile = true;

try {
// Try to insert a new record with the given ident. It'll fail UNIQUE constraint if exists.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment that used to be line 793 gave some reason that is still useful, I think. So I would add some more context:

Suggested change
// Try to insert a new record with the given ident. It'll fail UNIQUE constraint if exists.
// Try to insert a new record with the given ident. It'll fail UNIQUE constraint if exists.
// Even when attempting to attach a new file exclusively (and do nothing when it exists),
// it's a good idea to first check the existence of the fileIdent and if not insert the file and its data

await db.run('UPDATE _gristsys_Files SET data=?, storageId=? WHERE ident=?', fileData, storageId, fileIdent);
}

return isNewFile;
Copy link
Collaborator

@fflorent fflorent Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is acceptable as it is for me, so please skip if you don't agree much about my idea.

I feel like, even it introduces a supplementary SQL query to maintain, we would increase the readability by separating the query to check the existence of the file and do the INSERT/UPDATE. Am I right to say the below code would be equivalent?

const exists = await db.run('SELECT 1 from _gristsys_Files where ident = ?', fileIdent); // Not very sure what is returned here
if (exists && shouldUpdate) {
  await db.run('UPDATE _gristsys_Files SET data=?, storageId=? WHERE ident=?', fileData, storageId, fileIdent);
} else {
  await db.run('INSERT INTO _gristsys_Files(ident, data, storageId) VALUES (?, ?, ?)', fileIdent, fileData, storageId);
}
return exists;

}
finally {
// If a transfer request comes in mid-transfer, it will need re-running.
if (this._pendingFileTransfers.get(fileIdent) == targetStoreId) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would strict equality work?

Suggested change
if (this._pendingFileTransfers.get(fileIdent) == targetStoreId) {
if (this._pendingFileTransfers.get(fileIdent) === targetStoreId) {

throw new StoreNotAvailableError(storeId);

const fileInfoNoData = await this._docStorage.getFileInfo(fileIdent, false);
const fileExists = fileInfoNoData != null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would strict equality work?

Suggested change
const fileExists = fileInfoNoData != null;
const fileExists = fileInfoNoData !== null;

*/
public getFileInfo(fileIdent: string): Promise<FileInfo | null> {
return this.get('SELECT ident, storageId, data FROM _gristsys_Files WHERE ident=?', fileIdent)
public getFileInfo(fileIdent: string, includeData: boolean = true): Promise<FileInfo | null> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of the following to avoid the boolean flag param:

  public getFileInfo(fileIdent: string): Promise<FileInfo | null> {
    return this._getFileInfo(fileIdent, 'ident, storageId');
  }
  public getFileInfoWithData(fileIdent: string): Promise<FileInfo | null> {
    return this._getFileInfo(fileIdent, 'ident, storageId, data');
  }
  // ...
  private _getFileInfo(fileIdent: string, columns: string): Promise<FileInfo | null> {
    return this.get(`SELECT ${columns} FROM _gristsys_Files WHERE ident=?`, fileIdent)
      .... // just like what you did
   }

return {
fileIdent,
isNewFile: !fileExists,
};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a bit hard to read due to all of the conditions entangled. The good news is that it is private and you have a public method to call it, so you can rework it serenely.

I feel like it would be less complex to split this function into two different ones. I made the following, but have not tested:

  public async addFile(
    storeId: AttachmentStoreId | undefined,
    fileExtension: string,
    fileData: Buffer
  ): Promise<AddFileResult> {
    const fileIdent = await this._getFileIdentifier(fileExtension, Readable.from(fileData));
    return storeId ? 
      this._addFileToExternalStore(storeId, fileIdent, fileData) : 
      this._addFileToLocalStore(fileIdent, fileData);
  }


  // ...

  private async _addFileToExternalStore(
    destStoreId: AttachmentStoreId,
    fileIdent: string,
    fileData: Buffer
  ): Promise<AddFileResult> {
    const destStore = await this._getStore(destStoreId);
    if (!destStore) {
      this._log.warn({ fileIdent, storeId: destStoreId }, "tried to fetch attachment from an unavailable store");
      throw new StoreNotAvailableError(destStoreId);
    }
    const fileInfoNoData = await this._docStorage.getFileInfo(fileIdent, false);
    const fileExists = fileInfoNoData !== null;
    if (fileExists) {
      // File is already stored in a different store (e.g because store has changed and no migration has happened
      const isFileInTargetStore = destStoreId === fileInfoNoData.storageId;

      // Only exit early if the file is stored elsewhere of if the file exists in the store, 
      // otherwise we should allow users to fix any missing files
      // by proceeding to the normal upload logic.
      const fileAlreadyExistsInStore = isFileInTargetStore && await destStore.exists(this._getDocPoolId(), fileIdent);

      if (!isFileInTargetStore || fileAlreadyExistsInStore) {
        return {
          fileIdent,
          isNewFile: false,
        };
      }
    }
    // There's a possible race condition if anything changed the record between the initial checks
    // in this method, and the database being updated below - any changes will be overwritten.
    // However, the database will always end up referencing a valid file, and the pool-based file
    // deletion guarantees any files in external storage will be cleaned up eventually.
    await this._storeFileInAttachmentStore(destStore, fileIdent, fileData);

    return {
      fileIdent,
      isNewFile: !fileExists,
    };
  }

  private async _addFileToLocalStore(
    fileIdent: string,
    fileData: Buffer
  ): Promise<AddFileResult> {
    const fileInfoNoData = await this._docStorage.getFileInfo(fileIdent, false);
    const fileExists = fileInfoNoData !== null;
    if (!fileExists) {
      await this._storeFileInLocalStorage(fileIdent, fileData);
    }
    return {
      fileIdent,
      isNewFile: !fileExists,
    };
}

}
}

// There's a possible race condition if anything changed the record between the initial checks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

Suggested change
// There's a possible race condition if anything changed the record between the initial checks
// There's a possible race condition in anything changed the record between the initial checks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants