Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMRay and dynamic improvements #2537

Merged
merged 5 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@

### Bug Fixes

- vmray: load more analysis archives @mr-tz
- dynamic: only check file limitations for static file formats @mr-tz

### capa Explorer Web

### capa Explorer IDA Pro plugin
Expand Down
20 changes: 17 additions & 3 deletions capa/features/extractors/vmray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,10 @@ class VMRayMonitorProcess:
pid: int # process ID assigned by OS
ppid: int # parent process ID assigned by OS
monitor_id: int # unique ID assigned to process by VMRay
origin_monitor_id: int # unique VMRay ID of parent process
image_name: str
filename: str
cmd_line: str
filename: Optional[str] = ""
cmd_line: Optional[str] = ""


class VMRayAnalysis:
Expand Down Expand Up @@ -165,6 +166,7 @@ def _compute_monitor_processes(self):
process.os_pid,
ppid,
process.monitor_id,
process.origin_monitor_id,
process.image_name,
process.filename,
process.cmd_line,
Expand All @@ -176,6 +178,7 @@ def _compute_monitor_processes(self):
monitor_process.os_pid,
monitor_process.os_parent_pid,
monitor_process.process_id,
monitor_process.parent_id,
monitor_process.image_name,
monitor_process.filename,
monitor_process.cmd_line,
Expand All @@ -185,7 +188,18 @@ def _compute_monitor_processes(self):
self.monitor_processes[monitor_process.process_id] = vmray_monitor_process
else:
# we expect monitor processes recorded in both SummaryV2.json and flog.xml to equal
assert self.monitor_processes[monitor_process.process_id] == vmray_monitor_process
# to ensure this, we compare the pid, monitor_id, and origin_monitor_id
# for the other fields we've observed cases with slight deviations, e.g.,
# the ppid for a process in flog.xml is not set correctly, all other data is equal
sv2p = self.monitor_processes[monitor_process.process_id]
if self.monitor_processes[monitor_process.process_id] != vmray_monitor_process:
logger.debug("processes differ: %s (sv2) vs. %s (flog)", sv2p, vmray_monitor_process)

assert (sv2p.pid, sv2p.monitor_id, sv2p.origin_monitor_id) == (
vmray_monitor_process.pid,
vmray_monitor_process.monitor_id,
vmray_monitor_process.origin_monitor_id,
)
Comment on lines +198 to +202
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on being even more lenient here and not asserting this but just reporting it?
I'm encountering more inconsistencies between the two files, e.g., the monitor_id not being set

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we could do the check and just log it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to be lenient here because we rely on sane process and thread monitor IDs for indexing. I'd consider differences between SummaryV2.json and flog.xml to be VMRay bugs and, if true, can we trust capa's results at that point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, let's leave as is until we find more samples that fail.


def _compute_monitor_threads(self):
for monitor_thread in self.flog.analysis.monitor_threads:
Expand Down
7 changes: 4 additions & 3 deletions capa/features/extractors/vmray/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,7 @@ class ElfFileHeader(BaseModel):

class ElfFile(BaseModel):
# file_header: ElfFileHeader
sections: list[ElfFileSection]
sections: list[ElfFileSection] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incidentally, is this the correct way to set the default value, particularly as a list? i see this pattern used throughout the file.

my worry is that the default value = [] uses the same instance of a mutable list, rather than copies of it. sorta like when you have a kwarg parameter def foo(bar=[]).

in the past, i've used pydantic.Field for these. but maybe pydantic is extra smart and doesn't require this. @mr-tz @mike-hunhoff

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, great question, this is how mypy accepted the change and I saw the pattern throughout. Other files use Optional[list[<foo>]] = None or Field, we should cleanup the inconsistencies (separately).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that pydantic handles this correctly (i.e. deep copy) for non-hashable default values (i.e. lists). source: https://docs.pydantic.dev/latest/concepts/models/#fields-with-non-hashable-default-values

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoa that's cool!

so, looks good to me. and maybe we can update our remaining code to use this pattern.



class StaticData(BaseModel):
Expand Down Expand Up @@ -314,10 +314,11 @@ class Process(BaseModel):
# is_ioc: bool
monitor_id: int
# monitor_reason: str
origin_monitor_id: int # VMRay ID of parent process
os_pid: int
filename: SanitizedString
filename: Optional[SanitizedString] = ""
image_name: str
cmd_line: SanitizedString
cmd_line: Optional[SanitizedString] = ""
ref_parent_process: Optional[GenericReference] = None


Expand Down
13 changes: 7 additions & 6 deletions capa/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -748,15 +748,13 @@ def find_file_limitations_from_cli(args, rules: RuleSet, file_extractors: list[F
args:
args: The parsed command line arguments from `install_common_args`.

Dynamic feature extractors can handle packed samples and do not need to be considered here.
mr-tz marked this conversation as resolved.
Show resolved Hide resolved

raises:
ShouldExitError: if the program is invoked incorrectly and should exit.
"""
found_file_limitation = False
for file_extractor in file_extractors:
if isinstance(file_extractor, DynamicFeatureExtractor):
# Dynamic feature extractors can handle packed samples
continue

try:
pure_file_capabilities, _ = find_file_capabilities(rules, file_extractor, {})
except PEFormatError as e:
Expand Down Expand Up @@ -962,8 +960,11 @@ def main(argv: Optional[list[str]] = None):
ensure_input_exists_from_cli(args)
input_format = get_input_format_from_cli(args)
rules = get_rules_from_cli(args)
file_extractors = get_file_extractors_from_cli(args, input_format)
found_file_limitation = find_file_limitations_from_cli(args, rules, file_extractors)
found_file_limitation = False
if input_format in STATIC_FORMATS:
# only static extractors have file limitations
file_extractors = get_file_extractors_from_cli(args, input_format)
found_file_limitation = find_file_limitations_from_cli(args, rules, file_extractors)
except ShouldExitError as e:
return e.status_code

Expand Down
Loading