Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Further refactor and cleanup SnapshotManager #4090

Merged

Conversation

scottsand-db
Copy link
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This PR further refactors and cleans up SnapshotManager. We consolidate all of our getLogSegment methods to just one. It has a deep implementation. This PR adds nice, clean header blocks to describe the steps during log segment construction.

How was this patch tested?

Refactor. Update existing UTs.

Does this PR introduce any user-facing changes?

No.

@scottsand-db scottsand-db self-assigned this Jan 26, 2025
final CheckpointInstance maxCheckpoint =
versionToLoadOpt.map(CheckpointInstance::new).orElse(CheckpointInstance.MAX_VALUE);
logger.debug("lastCheckpoint: {}", maxCheckpoint);
final List<FileStatus> listedCheckpointFileStatuses = listedCheckpointAndDeltaFileStatuses._1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think we need the listed prefix for all these variables (these plus those above); seems clear enough and can shorten them all a bit

Copy link
Collaborator Author

@scottsand-db scottsand-db Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually disagree. There are the checkpoint file statuses we have listed and then there's the latest+complete checkpoint file statuses that we end up selecting. I think emphasizing that these are the listed ones -- but that doesn't mean they are the ones we are selecting to be in the log segmnt -- is important.

See below:

////////////////////////////////////////////////////////////////////////////////////////////
// Step 10: Grab the actual checkpoint file statuses for latestCompleteCheckpointVersion. //
////////////////////////////////////////////////////////////////////////////////////////////

final List<FileStatus> latestCompleteCheckpointFileStatuses =

Comment on lines +112 to +126
def createMockFSAndJsonEngineForLastCheckpoint(
contents: Seq[FileStatus], lastCheckpointVersion: Optional[java.lang.Long]): Engine = {
mockEngine(
fileSystemClient = new MockListFromFileSystemClient(listFromProvider(contents)),
jsonHandler = if (lastCheckpointVersion.isPresent) {
new MockReadLastCheckpointFileJsonHandler(
s"$logPath/_last_checkpoint",
lastCheckpointVersion.get()
)
} else {
null
}
)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to put this here instead of with the MockReadLastCheckpointFileJsonHandler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is next to all the other createMockFS____ utilities, thought it would be best here

@scottsand-db scottsand-db force-pushed the kernel_snapshot_manager_refactor_4 branch from 6ccd651 to 1f489b3 Compare January 28, 2025 00:48
Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@scottsand-db scottsand-db merged commit c04cb06 into delta-io:master Jan 29, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants