Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File locking module data store #3139

Merged
merged 45 commits into from
Aug 21, 2024
Merged

File locking module data store #3139

merged 45 commits into from
Aug 21, 2024

Conversation

doriable
Copy link
Member

@doriable doriable commented Jul 8, 2024

This adds file locking to the cache implementation of ModuleDataStore, using a
module.lock file. This fixes a race condition we had with concurrent
reads/writes for module data by concurrent invocations of the buf CLI.

In this implementation, we are taking a shared lock on <module_key>/module.lock
when reading from the cache, and taking an exclusive lock for writes.

When clearing out the module directory, the module.lock file is deleted last
to ensure that we hold the lock while everything else is being cleared out.

When writing data to the cache, we first take a shared lock and check module.yaml. If
module.yaml contains valid data, then we do not need to do an extraneous write. If not,
then we can upgrade to an exclusive lock, check module.yaml again, and if it is valid, then
we do not write, otherwise we proceed to writing the data.

A separate module.lock file is used instead of locking on an existing file, e.g. module.yaml
since that is the intended use of the flock package, since the file handler is not exposed
from the lock and Windows prevents the same process from accessing a region on a file it
has taken an exclusive lock on: (https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-lockfileex)

If the locking process opens the file a second time, it cannot access the specified region through this second handle until it unlocks the region.

This PR also adds support for setting a default lock timeout and retry delay on filelock.Locker using LockerOptions when constructing the Locker. If any timeouts and/or retry delays are passed to Lock through LockOptions, then those will take precedence.

doriable added 5 commits July 4, 2024 15:13
This adds a test for concurrent cache reads/writes from _different_
caches. This is meant to capture the race condition that occurs when
concurrent, independent invocations of the CLI interacts with the cache.
Copy link
Contributor

github-actions bot commented Jul 8, 2024

The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedAug 21, 2024, 9:15 PM

.golangci.yml Outdated Show resolved Hide resolved
private/buf/bufcli/cache.go Outdated Show resolved Hide resolved
@doriable doriable requested a review from pkwarren July 9, 2024 15:35
private/pkg/filelock/locker.go Outdated Show resolved Hide resolved
private/pkg/filelock/locker.go Outdated Show resolved Hide resolved
@pkwarren
Copy link
Member

We might consider landing #3123 before this so we can update to the latest filelock version. It requires Go 1.21+ in all of the more recent releases.

doriable added 2 commits July 11, 2024 09:32
This adds some module.lock files to tests that read `testdata` directly
as a cache.
@doriable doriable requested review from pkwarren and bufdev July 24, 2024 17:04
private/pkg/filelock/filelock.go Outdated Show resolved Hide resolved
private/buf/bufcli/cache.go Outdated Show resolved Hide resolved
make/buf/all.mk Outdated Show resolved Hide resolved
private/buf/bufcli/cache.go Outdated Show resolved Hide resolved
private/buf/bufcli/cache.go Outdated Show resolved Hide resolved
data, err := storage.ReadPath(ctx, moduleCacheBucket, externalModuleDataFileName)
p.logDebugModuleKey(
moduleKey,
fmt.Sprintf("module data store put read check %s", externalModuleDataFileName),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this copied from previous code? I don't understand what this is, perhaps it was my fault though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not -- this is part of the comment above:

		// Before writing to the module directory, first get a shared lock and check module.yaml

We are reading the module.yaml preemptively before attempting to write new data to the cache.

}
return p.bucket.DeleteAll(ctx, dirPath)
}

func (p *moduleDataStore) putModuleData(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really difficult to understand what the code in here is trying to accomplish now. We need specific documentation as we go, and comment within explaining what this code block is trying to accomplish. Something like "we're going to first check for X, if that's not the case, we're going to do this overwrite like this, etc etc".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So getModuleDataForModuleKey currently only does read operations, and each step of that read is documented... we can add more documentation, but I need some clarification on which parts.

As the last comment in getModuleDataForModuleKey addresses, validity of content is always determined by module.yaml, and that is always expected to be written last.

Any error and or invalid module.yaml is returned as an error, and then handled by fetching new data and putting it into the cache. There is nothing being overwritten/changed by getModuleDataForModuleKey, it is only reading the cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to the documentation? I can't follow what is going on here.

}
return p.bucket.DeleteAll(ctx, dirPath)
}

func (p *moduleDataStore) putModuleData(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anywhere where invalid data is cleaned up anymore. Perhaps I missed it. What happens if I have invalid data, say file "a.proto", and the valid data is only the file "b.proto", both written to same cache directory...will "a.proto" not be there anymore? Basically, I don't understand how the deleteInvalidModuleData case is now handled. It appears to only be handled for tar files (getModuleDataForModuleKey calls bucket.Delete).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the validity of the contents of v3/modules/<digest_type>/<module_key> is determined by the presence of a valid module.yaml file. As documented at the end of getModuleDataForModuleKey, we rely on module.yaml to be written last.

So the case you're describing, where there is an invalid a.proto but a valid b.proto, this occurs because of some interruption to the process while writing a.proto. In which case, no module.yaml would be written, this would not be considered valid, and then new data is fetched, etc.

// This directory is used to store lock files for synchronizing reading and writing module data from the cache.
//
// Normalized.
v3CacheModuleLockRelDirPath = normalpath.Join("v3", "module_locks")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wellknowntypes doesn't use underscores, this shouldn't either, for consistency - modulelocks

if err := p.bucket.Delete(ctx, tarPath); err != nil {
return nil, err
}
// Return a path error indicating the module data was not found
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be a PR comment - this should explained in the code for future readers who may question it. It should be clear to a reader at first glance why we are returning a fs.PathError instead of the error itself from reading the code comments - perhaps this wasn't the case in the past, but it needs to be going forward.

}
}()
// Only attempt to get a file lock when storing individual files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what this means - what is an "individual file"? It should be clear from the code comments as to what this code does. This may take a paragraph or two of comments.

}
return p.bucket.DeleteAll(ctx, dirPath)
}

func (p *moduleDataStore) putModuleData(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to the documentation? I can't follow what is going on here.

@bufdev bufdev merged commit cb33efb into main Aug 21, 2024
12 checks passed
@bufdev bufdev deleted the file-locking-module-data-store branch August 21, 2024 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants