-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
main: add option to ignore rule cache #1898
Comments
first off, i'm sorry that you were bitten by this! i can only imagine that was pretty annoying to waste time on. i'm a little hesitant that we should add a new cli argument for this, since (ideally) no capa user would ever provide the flag. the cache detects changes to rule content but not source code content. the flag would only be relevant to capa developers that change capa logic (such as rule parsing). could we instead disable the cache when running from source (eg. when installed by |
This also got me before so the idea is good. I agree with Willi that another CLI argument should be avoided (plus I don't think I necessarily would remember it anyway). So, some automatic handling like also inspecting the hash of rule-related files sounds good. |
Maybe we could introduce a new envrionement variable (e.g. @williballenthin's suggestion is also good. We could modify This way, whenever the capa source code changes, the cache identifier will be different, and the existing cache will be invalidated. A new cache will be created the next time |
I'm not sure how to compute the set of file names that are used as source code, and I'm hesitant about getting bogged down figuring that out. If it's easy, then I'm ok exploring this a bit more. I wonder if there's some way to interact with the Python interpreter's cache (pyc files) and derive the info that way. Or could we use Anyways, I'm not sure this is the behavior that I want, since I may edit capa source dozens of times per day, and I don't think I want a new cache for each one. Maybe we could print a big red warning when the situation is detected? |
Basically for source code, I was thinking about focusing on the Here is an example:import hashlib
from pathlib import Path
def compute_cache_identifier(rule_content: List[bytes]) -> CacheIdentifier:
hash = hashlib.sha256()
# note that this changes with each release,
# so cache identifiers will never collide across releases.
version = capa.version.__version__
hash.update(version.encode("utf-8"))
hash.update(b"\x00")
# Add the hash of the source files
source_dir = Path(__file__).parent.parent
source_files = list(source_dir.rglob("*.py"))
for source_file in source_files:
with open(source_file, "rb") as f:
source_content = f.read()
hash.update(hashlib.sha256(source_content).digest())
rule_hashes = sorted([hashlib.sha256(buf).hexdigest() for buf in rule_content])
for rule_hash in rule_hashes:
hash.update(rule_hash.encode("ascii"))
hash.update(b"\x00")
return hash.hexdigest() I believe this will introduce unnecessary overhead each time a user edits a file and re-runs capa, it will be noticable.
We can. We just need to compute the hash using one of the aforementioned methods and alert. Users can then choose to ignore the warning, and generate the cache on-demand when needed. |
I understand the case we're trying to handle is that devs change source code in a way that invalidates the rules cache and it confuses them. So we can assume that this scenario involves a dev, and therefore git is present. And furthermore, we can rely on git to report the files that are tracked and have been modified, and only hash those ones. This avoids the problem of inadvertently including irrelevant files in the hash. |
See https://github.com/mandiant/capa-rules/blob/master/.github/scripts/create_releases.py for an example usage of git in one of our scripts. |
I find this command to be suitable to our need: git ls-files --deleted --modified --exclude-standard --full-name --deduplicate -v
R removed.txt <- file was removed (rm removed.txt)
R renamed.txt <- file was renamed (mv tracked.txt renamed.txt)
C capa/rules/cache.py <- file was modified (vim cache.py) Then we can filter out the deletions (marked as |
Looks great! We'll also want to incorporate the git commit hash. This is shaping up well. |
Yeah, this should help track committed changes which may affect rules / the cache. |
As another alternative, can we compare the timestamps of |
Regardless of the solution discussed after my initial message it appears that we'll still need to introduce a CLI argument, environment variable, etc. to control when the solution is executed. Otherwise, we'll be introducing overhead to all future invocations of capa just to handle a small use case where a developer may not want the cache to confuse their development? |
I think we can guess that we might be running in a dev environment very quickly:
If these pass, we can then look for the Therefore, I think it may still be possible to enable this for all runs, assuming we order the checks correctly. |
Please note that the auto-cache generation approach will leave users will a lot of stale cache files in the cache dir. |
capa's rule caching is great but not obvious. This caused a huge headache when debugging #1897 as the problem code was skipped entirely when capa used its local rule cache. I suggest we add a command-line option like
--no-rule-cache
to make it easier to disable the cache for situations like this. Otherwise, debugging code related to rule parsing requires finding (via the--debug
option) and deleting the rule cache between subsequent executions.The text was updated successfully, but these errors were encountered: