Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main: add option to ignore rule cache #1898

Closed
mike-hunhoff opened this issue Dec 9, 2023 · 14 comments · Fixed by #2133
Closed

main: add option to ignore rule cache #1898

mike-hunhoff opened this issue Dec 9, 2023 · 14 comments · Fixed by #2133
Labels
enhancement New feature or request gsoc Work related to Google Summer of Code project. question Further information is requested usability Related to using capa and displaying results (CLI/GUI)

Comments

@mike-hunhoff
Copy link
Collaborator

capa's rule caching is great but not obvious. This caused a huge headache when debugging #1897 as the problem code was skipped entirely when capa used its local rule cache. I suggest we add a command-line option like --no-rule-cache to make it easier to disable the cache for situations like this. Otherwise, debugging code related to rule parsing requires finding (via the --debug option) and deleting the rule cache between subsequent executions.

@mike-hunhoff mike-hunhoff added enhancement New feature or request question Further information is requested labels Dec 9, 2023
@williballenthin
Copy link
Collaborator

williballenthin commented Dec 9, 2023

first off, i'm sorry that you were bitten by this! i can only imagine that was pretty annoying to waste time on.

i'm a little hesitant that we should add a new cli argument for this, since (ideally) no capa user would ever provide the flag. the cache detects changes to rule content but not source code content. the flag would only be relevant to capa developers that change capa logic (such as rule parsing).

could we instead disable the cache when running from source (eg. when installed by pip install -e .) and/or when run with --debug? or, if in source mode, use a hash of the capa source to derive the cache key?

@mr-tz
Copy link
Collaborator

mr-tz commented Dec 9, 2023

This also got me before so the idea is good. I agree with Willi that another CLI argument should be avoided (plus I don't think I necessarily would remember it anyway). So, some automatic handling like also inspecting the hash of rule-related files sounds good.

@mr-tz mr-tz added gsoc Work related to Google Summer of Code project. usability Related to using capa and displaying results (CLI/GUI) labels May 22, 2024
@fariss fariss moved this to In progress in @s-ff GSoC 2024 May 26, 2024
@fariss fariss moved this from In progress to Backlog in @s-ff GSoC 2024 May 26, 2024
@fariss
Copy link
Collaborator

fariss commented May 28, 2024

Maybe we could introduce a new envrionement variable (e.g. DISABLE_CAPA_CACHE=1) instead of the CLI argument?

@williballenthin's suggestion is also good. We could modify compute_cache_identifier to compute the cache ID not only based on the capa version and rules content, but also by including the hash of the source files.

This way, whenever the capa source code changes, the cache identifier will be different, and the existing cache will be invalidated. A new cache will be created the next time cache_ruleset is called. The only caveat (i.e. performance downgrade) here could be that we have to read in the source files to compute their hash. What do you think? I can draft a PR to test this out.

@fariss fariss moved this from Backlog to In progress in @s-ff GSoC 2024 May 28, 2024
@fariss fariss moved this from In progress to Backlog in @s-ff GSoC 2024 May 28, 2024
@williballenthin
Copy link
Collaborator

I'm not sure how to compute the set of file names that are used as source code, and I'm hesitant about getting bogged down figuring that out. If it's easy, then I'm ok exploring this a bit more.

I wonder if there's some way to interact with the Python interpreter's cache (pyc files) and derive the info that way.

Or could we use git status of the source repository?? Maybe this is simplest.

Anyways, I'm not sure this is the behavior that I want, since I may edit capa source dozens of times per day, and I don't think I want a new cache for each one. Maybe we could print a big red warning when the situation is detected?

@fariss
Copy link
Collaborator

fariss commented May 29, 2024

Basically for source code, I was thinking about focusing on the *.py files.

Here is an example:
import hashlib
from pathlib import Path

def compute_cache_identifier(rule_content: List[bytes]) -> CacheIdentifier:
    hash = hashlib.sha256()

    # note that this changes with each release,
    # so cache identifiers will never collide across releases.
    version = capa.version.__version__

    hash.update(version.encode("utf-8"))
    hash.update(b"\x00")

    # Add the hash of the source files
    source_dir = Path(__file__).parent.parent
    source_files = list(source_dir.rglob("*.py"))
    for source_file in source_files:
        with open(source_file, "rb") as f:
            source_content = f.read()
        hash.update(hashlib.sha256(source_content).digest())

    rule_hashes = sorted([hashlib.sha256(buf).hexdigest() for buf in rule_content])
    for rule_hash in rule_hashes:
        hash.update(rule_hash.encode("ascii"))
        hash.update(b"\x00")

    return hash.hexdigest()

I believe this will introduce unnecessary overhead each time a user edits a file and re-runs capa, it will be noticable.

Or could we use git status of the source repository?? Maybe this is simplest.

git sounds like a good way to track changes, just unsure about how practical it is.

Anyways, I'm not sure this is the behavior that I want, since I may edit capa source dozens of times per day, and I don't think I want a new cache for each one. Maybe we could print a big red warning when the situation is detected?

We can. We just need to compute the hash using one of the aforementioned methods and alert. Users can then choose to ignore the warning, and generate the cache on-demand when needed.

@williballenthin
Copy link
Collaborator

williballenthin commented May 29, 2024

git sounds like a good way to track changes, just unsure about how practical it is.

I understand the case we're trying to handle is that devs change source code in a way that invalidates the rules cache and it confuses them. So we can assume that this scenario involves a dev, and therefore git is present. And furthermore, we can rely on git to report the files that are tracked and have been modified, and only hash those ones.

This avoids the problem of inadvertently including irrelevant files in the hash.

@mr-tz
Copy link
Collaborator

mr-tz commented May 29, 2024

See https://github.com/mandiant/capa-rules/blob/master/.github/scripts/create_releases.py for an example usage of git in one of our scripts.

@fariss fariss moved this from Backlog to Ready in @s-ff GSoC 2024 May 30, 2024
@mr-tz mr-tz mentioned this issue Jun 3, 2024
6 tasks
@fariss
Copy link
Collaborator

fariss commented Jun 4, 2024

I find this command to be suitable to our need:

git ls-files --deleted --modified --exclude-standard --full-name --deduplicate -v               
R removed.txt                                   <- file was removed (rm removed.txt)
R renamed.txt                                   <- file was renamed (mv tracked.txt renamed.txt)
C capa/rules/cache.py                           <- file was modified (vim cache.py)

Then we can filter out the deletions (marked as R). This will leave us with tracked, and modified files only.

@williballenthin
Copy link
Collaborator

Looks great!

We'll also want to incorporate the git commit hash.

This is shaping up well.

@mr-tz
Copy link
Collaborator

mr-tz commented Jun 4, 2024

We'll also want to incorporate the git commit hash.

Yeah, this should help track committed changes which may affect rules / the cache.

@mr-tz
Copy link
Collaborator

mr-tz commented Jun 4, 2024

As another alternative, can we compare the timestamps of capa/rules/cache.py vs. the most recent cache and print out a warning that this may result in unexpected behavior. We should keep this simple and little intrusive.

@mike-hunhoff
Copy link
Collaborator Author

Regardless of the solution discussed after my initial message it appears that we'll still need to introduce a CLI argument, environment variable, etc. to control when the solution is executed. Otherwise, we'll be introducing overhead to all future invocations of capa just to handle a small use case where a developer may not want the cache to confuse their development?

@williballenthin
Copy link
Collaborator

Otherwise, we'll be introducing overhead to all future invocations of capa just to handle a small use case where a developer may not want the cache to confuse their development?

I think we can guess that we might be running in a dev environment very quickly:

  • not PyInstaller, which is by far the most common way to invoke capa, and
  • inspecting capa.main.__file__ doesn't contain site-packages, which should be very quick. (this might take some real world testing but i think the idea will work)

If these pass, we can then look for the .git directory (slightly slower) and then do the strategies already discussed (which will be fairly slow, but still only like 0.25s or so).

Therefore, I think it may still be possible to enable this for all runs, assuming we order the checks correctly.

@fariss
Copy link
Collaborator

fariss commented Jun 5, 2024

Please note that the auto-cache generation approach will leave users will a lot of stale cache files in the cache dir.

@fariss fariss moved this from Ready to In review in @s-ff GSoC 2024 Jun 10, 2024
@github-project-automation github-project-automation bot moved this from In review to Done in @s-ff GSoC 2024 Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gsoc Work related to Google Summer of Code project. question Further information is requested usability Related to using capa and displaying results (CLI/GUI)
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants