-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New matching and event reporting module #84
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giving you a first batch of comments from what I noticed when reading the code. Will try and play around with it, then give some more feedback. Looks very promising! I especially like the improved testability and the report abstraction.
reccmp/isledecomp/compare/core.py
Outdated
batch.match(fun.offset, recomp_addr) | ||
batch.set_recomp( | ||
recomp_addr, type=EntityType.FUNCTION, stub=fun.should_skip() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to unify these two functions? Not sure how much they are used independent of each other (maybe at least one of them can be removed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be better to have set_orig
here, as with all the other imports from annotations that follow. I don't remember why I put set_recomp
here. The final result to the database will be the same unless this match fails.
reccmp/isledecomp/compare/core.py
Outdated
recomp_addr, type=EntityType.FUNCTION, stub=fun.should_skip() | ||
) | ||
|
||
with self._db.batch() as batch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not immediately obvious to me why you use separate batches for some, but not all types of insertions. If there is a good reason, I'd add a comment explaining why. If not, why don't we use one big batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had these in different batches at one point and using insert_orig
to match the previous code better. I'll switch this around tomorrow.
The reason to use separate batches is so that staging data with insert_orig
would succeed once (in the first batch) and then not change the data in subsequent batches. If you keep calling insert_orig
on the same address in the same batch, we modify the pending changes. This is by design so you can add attributes in stages (or only if certain conditions are met) as we do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do it all in one batch if the annotations read from the DecompCodebase
were guaranteed to have a unique orig address. I don't remember if that's true or not. We detect if you repeat the same addr in two different annotations (in the linter) but I don't think we remove the dupes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a valid reason to do multiple batches. I'd just add that to the code as a comment since the pattern was not obvious to me.
return value | ||
|
||
|
||
def match_symbols(db: EntityDb, report: ReccmpReportProtocol = reccmp_report_nop): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused about the dependencies here (this was likely the case before your changes):
- Do both
match_functions
andmatch_static_variables
depend on this running first? If so, I'd document that in the respective docstrings. - If not: What kind of symbols would
match_symbols
be good for, if it isn't among the previous two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to match using the symbol is that it should be unique across the entire program. We don't need to check the type or any other attributes. Right now, the only place we expect a symbol is in a // FUNCTION
or similar annotation, but we could add something like this:
// SYMBOL: TEST 0x1234
// ??_7PizzaMissionState@@6B@
// aka PizzaMissionState::`vftable`
It's probably most useful for this to run first, but match_functions
doesn't depend on it. match_static_variables
depends on having an orig entity (the function with the variable) with a symbol. You could match using either this or match_functions
, or not at all if your annotation on the function uses the symbol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I expected. I'd add `"Depends on match_symbols to run first" to the docstrings of all the other functions (where applicable).
I tried a bit and failed to observe the diffs you mentioned. I see the same issue with |
# Max symbol length in MSVC is 255 chars. See also: Warning C4786. | ||
symbol_index.add(symbol[:255], recomp_addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct for MSVC4 toolchain, but not for current MSVC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a parameter that decides whether to truncate? We don't have any way to enable or disable it (or detect the MSVC version) but these could be added later to the yml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking about making it configurable.
Or, and this I don't know for sure: perhaps it only truncates in the old-style PDB file format.
So we can make it dependent on the file format.
Today's changes:
I think that covers all the outstanding items so please give it another look and let me know. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! We can resolve the few regressions afterwards. I really like the improvements away from OFFSET1
as well.
In #81 we changed the database to support merging two entities during a match. The natural progression is to separate loading of entity data (i.e. orig annotations) and matching. This is the first step towards doing that.
The modularization now makes it possible to test the effect of each match function. This was not really feasible before because everything was enclosed in the
Compare
structure. As a result, we now have a boatload of tests that demonstrate the idiosyncrasies for each type of match -- particularly vtables and static variables.Until now, if we failed to match an orig annotation to a recomp entity, the orig address would not exist in the database. With these entities now there, we can use the orig name during asm sanitize even if we failed to match. I think this is better because it exposes a diff that would be hidden by the
<OFFSET>
placeholder. (Also: your annotation could be wrong, so this will show where it creates problems.) The impact toLEGO1
is minimal because we have just about everything mapped out, but there are two new diffs.1If an orig vtable annotation fails to match, its name will be just the class name. It would appear in the asm output as just
'Pizza (VTABLE)'
instead ofPizza:'vftable' (VTABLE)
. We could mock up a name here if the vtable fails to match, but I left it alone for the moment. This happens because we had been overloading the "name" attribute in the database for a lot of different purposes. Now that we are free to use any attribute for matching, we should change this.I also added a simple event reporting protocol in
event.py
and each of the match functions can call it to report a failed match. We had relied on thelogging
module for this which left no way to react to a failed match in code or store it for later. (We can now also test when a match should fail.) For now we just redirect the events back to the main logger fromcore.py
. I changed the error text slightly in a few places but the information should be the same.Lastly, there are two new SQL views to help keep the queries concise and a wrapper around the
COUNT
query. When matching we now always read both addresses in order. This will keep matching consistent for cases (like function name match) where there are multiple options. (i.e. it does not depend on database insertion order or filename order when reading annotations.)Footnotes
We don't properly grab the
g_skeletonKickPhases
variable from the PDB so it's not possible to match it right now. The orig annotation now sets it in the database so there now appears to be a diff inLegoRaceCar::HandleSkeletonKicks
. The other diff is from an MSVC library function which might be a typo in the annotation. ↩