main: add option to ignore rule cache #1898

mike-hunhoff · 2023-12-09T00:36:04Z

capa's rule caching is great but not obvious. This caused a huge headache when debugging #1897 as the problem code was skipped entirely when capa used its local rule cache. I suggest we add a command-line option like --no-rule-cache to make it easier to disable the cache for situations like this. Otherwise, debugging code related to rule parsing requires finding (via the --debug option) and deleting the rule cache between subsequent executions.

The text was updated successfully, but these errors were encountered:

williballenthin · 2023-12-09T06:50:08Z

first off, i'm sorry that you were bitten by this! i can only imagine that was pretty annoying to waste time on.

i'm a little hesitant that we should add a new cli argument for this, since (ideally) no capa user would ever provide the flag. the cache detects changes to rule content but not source code content. the flag would only be relevant to capa developers that change capa logic (such as rule parsing).

could we instead disable the cache when running from source (eg. when installed by pip install -e .) and/or when run with --debug? or, if in source mode, use a hash of the capa source to derive the cache key?

mr-tz · 2023-12-09T06:56:40Z

This also got me before so the idea is good. I agree with Willi that another CLI argument should be avoided (plus I don't think I necessarily would remember it anyway). So, some automatic handling like also inspecting the hash of rule-related files sounds good.

fariss · 2024-05-28T00:47:06Z

Maybe we could introduce a new envrionement variable (e.g. DISABLE_CAPA_CACHE=1) instead of the CLI argument?

@williballenthin's suggestion is also good. We could modify compute_cache_identifier to compute the cache ID not only based on the capa version and rules content, but also by including the hash of the source files.

This way, whenever the capa source code changes, the cache identifier will be different, and the existing cache will be invalidated. A new cache will be created the next time cache_ruleset is called. The only caveat (i.e. performance downgrade) here could be that we have to read in the source files to compute their hash. What do you think? I can draft a PR to test this out.

williballenthin · 2024-05-28T04:43:26Z

I'm not sure how to compute the set of file names that are used as source code, and I'm hesitant about getting bogged down figuring that out. If it's easy, then I'm ok exploring this a bit more.

I wonder if there's some way to interact with the Python interpreter's cache (pyc files) and derive the info that way.

Or could we use git status of the source repository?? Maybe this is simplest.

Anyways, I'm not sure this is the behavior that I want, since I may edit capa source dozens of times per day, and I don't think I want a new cache for each one. Maybe we could print a big red warning when the situation is detected?

fariss · 2024-05-29T02:17:21Z

Basically for source code, I was thinking about focusing on the *.py files.

Here is an example:

import hashlib
from pathlib import Path

def compute_cache_identifier(rule_content: List[bytes]) -> CacheIdentifier:
    hash = hashlib.sha256()

    # note that this changes with each release,
    # so cache identifiers will never collide across releases.
    version = capa.version.__version__

    hash.update(version.encode("utf-8"))
    hash.update(b"\x00")

    # Add the hash of the source files
    source_dir = Path(__file__).parent.parent
    source_files = list(source_dir.rglob("*.py"))
    for source_file in source_files:
        with open(source_file, "rb") as f:
            source_content = f.read()
        hash.update(hashlib.sha256(source_content).digest())

    rule_hashes = sorted([hashlib.sha256(buf).hexdigest() for buf in rule_content])
    for rule_hash in rule_hashes:
        hash.update(rule_hash.encode("ascii"))
        hash.update(b"\x00")

    return hash.hexdigest()

I believe this will introduce unnecessary overhead each time a user edits a file and re-runs capa, it will be noticable.

Or could we use git status of the source repository?? Maybe this is simplest.

git sounds like a good way to track changes, just unsure about how practical it is.

Anyways, I'm not sure this is the behavior that I want, since I may edit capa source dozens of times per day, and I don't think I want a new cache for each one. Maybe we could print a big red warning when the situation is detected?

We can. We just need to compute the hash using one of the aforementioned methods and alert. Users can then choose to ignore the warning, and generate the cache on-demand when needed.

williballenthin · 2024-05-29T07:25:57Z

git sounds like a good way to track changes, just unsure about how practical it is.

I understand the case we're trying to handle is that devs change source code in a way that invalidates the rules cache and it confuses them. So we can assume that this scenario involves a dev, and therefore git is present. And furthermore, we can rely on git to report the files that are tracked and have been modified, and only hash those ones.

This avoids the problem of inadvertently including irrelevant files in the hash.

mr-tz · 2024-05-29T09:50:32Z

See https://github.com/mandiant/capa-rules/blob/master/.github/scripts/create_releases.py for an example usage of git in one of our scripts.

fariss · 2024-06-04T00:20:22Z

I find this command to be suitable to our need:

git ls-files --deleted --modified --exclude-standard --full-name --deduplicate -v               
R removed.txt                                   <- file was removed (rm removed.txt)
R renamed.txt                                   <- file was renamed (mv tracked.txt renamed.txt)
C capa/rules/cache.py                           <- file was modified (vim cache.py)

Then we can filter out the deletions (marked as R). This will leave us with tracked, and modified files only.

williballenthin · 2024-06-04T04:54:33Z

Looks great!

We'll also want to incorporate the git commit hash.

This is shaping up well.

mr-tz · 2024-06-04T09:59:29Z

We'll also want to incorporate the git commit hash.

Yeah, this should help track committed changes which may affect rules / the cache.

mr-tz · 2024-06-04T16:56:27Z

As another alternative, can we compare the timestamps of capa/rules/cache.py vs. the most recent cache and print out a warning that this may result in unexpected behavior. We should keep this simple and little intrusive.

mike-hunhoff · 2024-06-04T17:06:15Z

Regardless of the solution discussed after my initial message it appears that we'll still need to introduce a CLI argument, environment variable, etc. to control when the solution is executed. Otherwise, we'll be introducing overhead to all future invocations of capa just to handle a small use case where a developer may not want the cache to confuse their development?

williballenthin · 2024-06-04T18:22:35Z

Otherwise, we'll be introducing overhead to all future invocations of capa just to handle a small use case where a developer may not want the cache to confuse their development?

I think we can guess that we might be running in a dev environment very quickly:

not PyInstaller, which is by far the most common way to invoke capa, and
inspecting capa.main.__file__ doesn't contain site-packages, which should be very quick. (this might take some real world testing but i think the idea will work)

If these pass, we can then look for the .git directory (slightly slower) and then do the strategies already discussed (which will be fairly slow, but still only like 0.25s or so).

Therefore, I think it may still be possible to enable this for all runs, assuming we order the checks correctly.

fariss · 2024-06-05T21:40:06Z

Please note that the auto-cache generation approach will leave users will a lot of stale cache files in the cache dir.

mike-hunhoff added enhancement New feature or request question Further information is requested labels Dec 9, 2023

mr-tz added gsoc Work related to Google Summer of Code project. usability Related to using capa and displaying results (CLI/GUI) labels May 22, 2024

mr-tz added this to @s-ff GSoC 2024 May 22, 2024

fariss moved this to In progress in @s-ff GSoC 2024 May 26, 2024

fariss moved this from In progress to Backlog in @s-ff GSoC 2024 May 26, 2024

fariss moved this from Backlog to In progress in @s-ff GSoC 2024 May 28, 2024

fariss moved this from In progress to Backlog in @s-ff GSoC 2024 May 28, 2024

fariss moved this from Backlog to Ready in @s-ff GSoC 2024 May 30, 2024

mr-tz mentioned this issue Jun 3, 2024

tighten rule pre-selection #2080

Closed

6 tasks

fariss mentioned this issue Jun 7, 2024

feat: auto-generate ruleset cache on source change #2133

Merged

fariss moved this from Ready to In review in @s-ff GSoC 2024 Jun 10, 2024

mr-tz closed this as completed in #2133 Aug 26, 2024

github-project-automation bot moved this from In review to Done in @s-ff GSoC 2024 Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

main: add option to ignore rule cache #1898

main: add option to ignore rule cache #1898

mike-hunhoff commented Dec 9, 2023

williballenthin commented Dec 9, 2023 •

edited

Loading

mr-tz commented Dec 9, 2023

fariss commented May 28, 2024 •

edited

Loading

williballenthin commented May 28, 2024

fariss commented May 29, 2024

williballenthin commented May 29, 2024 •

edited

Loading

mr-tz commented May 29, 2024

fariss commented Jun 4, 2024 •

edited

Loading

williballenthin commented Jun 4, 2024

mr-tz commented Jun 4, 2024

mr-tz commented Jun 4, 2024

mike-hunhoff commented Jun 4, 2024

williballenthin commented Jun 4, 2024

fariss commented Jun 5, 2024 •

edited

Loading

main: add option to ignore rule cache #1898

main: add option to ignore rule cache #1898

Comments

mike-hunhoff commented Dec 9, 2023

williballenthin commented Dec 9, 2023 • edited Loading

mr-tz commented Dec 9, 2023

fariss commented May 28, 2024 • edited Loading

williballenthin commented May 28, 2024

fariss commented May 29, 2024

williballenthin commented May 29, 2024 • edited Loading

mr-tz commented May 29, 2024

fariss commented Jun 4, 2024 • edited Loading

williballenthin commented Jun 4, 2024

mr-tz commented Jun 4, 2024

mr-tz commented Jun 4, 2024

mike-hunhoff commented Jun 4, 2024

williballenthin commented Jun 4, 2024

fariss commented Jun 5, 2024 • edited Loading

williballenthin commented Dec 9, 2023 •

edited

Loading

fariss commented May 28, 2024 •

edited

Loading

williballenthin commented May 29, 2024 •

edited

Loading

fariss commented Jun 4, 2024 •

edited

Loading

fariss commented Jun 5, 2024 •

edited

Loading