Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lint to avoid duplicate features #2250

Closed
mr-tz opened this issue Aug 2, 2024 · 13 comments · Fixed by #2573
Closed

add lint to avoid duplicate features #2250

mr-tz opened this issue Aug 2, 2024 · 13 comments · Fixed by #2573
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@mr-tz
Copy link
Collaborator

mr-tz commented Aug 2, 2024

e.g. duplicate API features such as removed in mandiant/capa-rules#916

@mr-tz mr-tz added enhancement New feature or request good first issue Good for newcomers labels Aug 2, 2024
@v1bh475u
Copy link
Contributor

hello @mr-tz . I want to work on this issue. Can you please assign it to me?
Also, I wanted to ask what python libraries I cannot use in lint. (For e.g can i use collections or it is not desired?).

@williballenthin
Copy link
Collaborator

you can use anything in the standard library (that is, doesn't require installation from PyPI) as well as anything we already rely upon (see pyproject or requirements.txt). we can also add dependencies as necessary, though we'd prefer a good reason to do so, since they add a little risk and maintenance burden.

@williballenthin
Copy link
Collaborator

also, go ahead and work on it!

@v1bh475u
Copy link
Contributor

To what extent do we want to remove duplicate features? We can detect the duplicate features under same statement ( and, or, not, some, etc). But do we want to make it so that it analyzes complex like the boolean expression a|(a&<some_other_expression>) simplifies into a where a is some feature? Do we also want to add lint for ors under one or and some thing similar for other such statements.

@williballenthin
Copy link
Collaborator

let's start with the most trivial duplicates, such as the same value under AND or OR. that's what we fixed in the linked PR.

@v1bh475u
Copy link
Contributor

INFO     lint: collecting potentially referenced samples                                                                                                   lint.py:1034
⠹ linting rule: compiled to the .NET platform        0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1 rule • 0:00:00 < -:--:-- • 00.0 rule/s  
 compiled to the .NET platform
  FAIL: rule contains a duplicate feature under `or`/`and` statement: remove the duplicate features
        duplicate line: "      - import: mscoree._cordllmain" : line numbers: 17, 19


rules with FAIL:
  - compiled to the .NET platform

My current implementation will print in this format, is it fine? It take cares of duplicate features under a single statement ( one of or, and, not, optional and some and can be extended for other statements).

@williballenthin
Copy link
Collaborator

that's great!

@v1bh475u
Copy link
Contributor

@williballenthin I have made the PR but there are some issues with it. We cannot use rule passed as argument for detecting duplicate features as the Rule class when retrieves features from the yaml file, it simply ignores the redundancies. Hence, we have to rely on definition which contains the entire unparsed yaml file as string. What I have done is iterate through lines and get the features under same statement and if there is any duplicate feature, we keep its line number in a list and add it to the recommendation for fixing. It is working and has detected quite a few duplicate features in some rules. However, it fails for certain multi-lined features. For example,

        - string: /dbghelp\.dll/i
          description: WindBG
        - string: /dbghelp\.dll/i
          description: WINE

This is due to the feature being multi-lined. So, I wanted to ask what your suggestions are for proceeding with this issue.

@v1bh475u
Copy link
Contributor

In create-tcp-socket-via-raw-afd-driver.yml,

        # in the example code, in debug mode, the array is constructed bytewise on the stack
        - basic block:
          - and:
            - description: bExtendedAttributes for IPv4 TCP on stack, bytewise
            # i've kept the values approximately in order while removing some duplicates for clarity
            - number: 0x00
            - number: 0x0F
            - number: 0x1E
            - number: 0x41 = A
            - number: 0x66 = f
            - number: 0x64 = d
            - number: 0x4F = O
            - number: 0x70 = p
            - number: 0x65 = e
            - number: 0x6E = n
            - number: 0x50 = P
            - number: 0x61 = a
            - number: 0x63 = c
            - number: 0x6B = k
            - number: 0x65 = e
            - number: 0x74 = t
            - number: 0x58 = X
            - number: 0x02
            - number: 0x01
            - number: 0x06
            - number: 0x00
            - number: 0x60
            - number: 0xEF
            - number: 0x3D
            - number: 0x47
            - number: 0xFE

Are - number: 0x65 = e and - number: 0x00 to be considered redundant?

@mr-tz
Copy link
Collaborator Author

mr-tz commented Jan 25, 2025

Yes

@v1bh475u
Copy link
Contributor

Ok, then. My updated lint has found some duplicate features in current set of rules and due to that reason, my commits are unable to pass the rule_linter check. So, shall I make PR on capa-rules with removed duplicate features?

@mr-tz
Copy link
Collaborator Author

mr-tz commented Jan 25, 2025

Yes, that would be great 👍🏼

@v1bh475u
Copy link
Contributor

I have made my PR of this issue ready for review and also added a PR in capa-rules for removing duplicate features. Please review them and suggest changes if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants