Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New matching and event reporting module #84

Merged
merged 8 commits into from
Feb 9, 2025

Conversation

disinvite
Copy link
Collaborator

In #81 we changed the database to support merging two entities during a match. The natural progression is to separate loading of entity data (i.e. orig annotations) and matching. This is the first step towards doing that.

The modularization now makes it possible to test the effect of each match function. This was not really feasible before because everything was enclosed in the Compare structure. As a result, we now have a boatload of tests that demonstrate the idiosyncrasies for each type of match -- particularly vtables and static variables.

Until now, if we failed to match an orig annotation to a recomp entity, the orig address would not exist in the database. With these entities now there, we can use the orig name during asm sanitize even if we failed to match. I think this is better because it exposes a diff that would be hidden by the <OFFSET> placeholder. (Also: your annotation could be wrong, so this will show where it creates problems.) The impact to LEGO1 is minimal because we have just about everything mapped out, but there are two new diffs.1

If an orig vtable annotation fails to match, its name will be just the class name. It would appear in the asm output as just 'Pizza (VTABLE)' instead of Pizza:'vftable' (VTABLE). We could mock up a name here if the vtable fails to match, but I left it alone for the moment. This happens because we had been overloading the "name" attribute in the database for a lot of different purposes. Now that we are free to use any attribute for matching, we should change this.

I also added a simple event reporting protocol in event.py and each of the match functions can call it to report a failed match. We had relied on the logging module for this which left no way to react to a failed match in code or store it for later. (We can now also test when a match should fail.) For now we just redirect the events back to the main logger from core.py. I changed the error text slightly in a few places but the information should be the same.

Lastly, there are two new SQL views to help keep the queries concise and a wrapper around the COUNT query. When matching we now always read both addresses in order. This will keep matching consistent for cases (like function name match) where there are multiple options. (i.e. it does not depend on database insertion order or filename order when reading annotations.)

Footnotes

  1. We don't properly grab the g_skeletonKickPhases variable from the PDB so it's not possible to match it right now. The orig annotation now sets it in the database so there now appears to be a diff in LegoRaceCar::HandleSkeletonKicks. The other diff is from an MSVC library function which might be a typo in the annotation.

Copy link
Collaborator

@jonschz jonschz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving you a first batch of comments from what I noticed when reading the code. Will try and play around with it, then give some more feedback. Looks very promising! I especially like the improved testability and the report abstraction.

Comment on lines 283 to 286
batch.match(fun.offset, recomp_addr)
batch.set_recomp(
recomp_addr, type=EntityType.FUNCTION, stub=fun.should_skip()
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to unify these two functions? Not sure how much they are used independent of each other (maybe at least one of them can be removed)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better to have set_orig here, as with all the other imports from annotations that follow. I don't remember why I put set_recomp here. The final result to the database will be the same unless this match fails.

recomp_addr, type=EntityType.FUNCTION, stub=fun.should_skip()
)

with self._db.batch() as batch:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not immediately obvious to me why you use separate batches for some, but not all types of insertions. If there is a good reason, I'd add a comment explaining why. If not, why don't we use one big batch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had these in different batches at one point and using insert_orig to match the previous code better. I'll switch this around tomorrow.

The reason to use separate batches is so that staging data with insert_orig would succeed once (in the first batch) and then not change the data in subsequent batches. If you keep calling insert_orig on the same address in the same batch, we modify the pending changes. This is by design so you can add attributes in stages (or only if certain conditions are met) as we do here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do it all in one batch if the annotations read from the DecompCodebase were guaranteed to have a unique orig address. I don't remember if that's true or not. We detect if you repeat the same addr in two different annotations (in the linter) but I don't think we remove the dupes here.

Copy link
Collaborator

@jonschz jonschz Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a valid reason to do multiple batches. I'd just add that to the code as a comment since the pattern was not obvious to me.

reccmp/isledecomp/compare/event.py Show resolved Hide resolved
reccmp/isledecomp/compare/db.py Show resolved Hide resolved
reccmp/isledecomp/compare/core.py Outdated Show resolved Hide resolved
reccmp/isledecomp/compare/match_msvc.py Outdated Show resolved Hide resolved
reccmp/isledecomp/compare/match_msvc.py Show resolved Hide resolved
reccmp/isledecomp/compare/match_msvc.py Outdated Show resolved Hide resolved
return value


def match_symbols(db: EntityDb, report: ReccmpReportProtocol = reccmp_report_nop):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused about the dependencies here (this was likely the case before your changes):

  • Do both match_functions and match_static_variables depend on this running first? If so, I'd document that in the respective docstrings.
  • If not: What kind of symbols would match_symbols be good for, if it isn't among the previous two?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to match using the symbol is that it should be unique across the entire program. We don't need to check the type or any other attributes. Right now, the only place we expect a symbol is in a // FUNCTION or similar annotation, but we could add something like this:

// SYMBOL: TEST 0x1234
// ??_7PizzaMissionState@@6B@
// aka PizzaMissionState::`vftable`

It's probably most useful for this to run first, but match_functions doesn't depend on it. match_static_variables depends on having an orig entity (the function with the variable) with a symbol. You could match using either this or match_functions, or not at all if your annotation on the function uses the symbol.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I expected. I'd add `"Depends on match_symbols to run first" to the docstrings of all the other functions (where applicable).

tests/test_match_msvc.py Outdated Show resolved Hide resolved
@jonschz
Copy link
Collaborator

jonschz commented Feb 2, 2025

I tried a bit and failed to observe the diffs you mentioned. I see the same issue with g_skeletonKickPhases on the current master commit. Weirdly, HandleSkeletonKicks still shows a 100% match.

Comment on lines 44 to 45
# Max symbol length in MSVC is 255 chars. See also: Warning C4786.
symbol_index.add(symbol[:255], recomp_addr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct for MSVC4 toolchain, but not for current MSVC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a parameter that decides whether to truncate? We don't have any way to enable or disable it (or detect the MSVC version) but these could be added later to the yml.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about making it configurable.
Or, and this I don't know for sure: perhaps it only truncates in the old-style PDB file format.
So we can make it dependent on the file format.

@disinvite
Copy link
Collaborator Author

Today's changes:

  • Added truncate param for symbol and function name match. This is just a bool for now. I don't know if there would be a need to truncate to a specific character length depending on the compiler. Defaults to True but we can control this later with a config option or by autodetecting when it is needed (PDB version).

  • Small change to DecompCodebase to eliminate annotations that reuse an address in the current module. This combined with the new batch code gets the same behavior that we had before: only the first occurrence of the address is used/matched, and any duplicates are dropped. New error message to alert when this happens.

  • With the above change: we can now load all annotations in a single batch because none of them will step on each other. I kept set_orig here because we want to override any existing information with the annotation data. Right now there isn't any data to replace (on the orig side) but that might change later.

  • I moved the match functions out of load_markers and into the __init__ function. We can move this to a helper function if that makes more sense, but the point is to separate loading from matching.

  • Used report call in load_markers for the three error messages.

  • Add comment for match_static_variables. It is most useful to call it after matching functions.

I think that covers all the outstanding items so please give it another look and let me know. Thanks!

@disinvite disinvite requested review from madebr and jonschz February 7, 2025 15:45
Copy link
Collaborator

@jonschz jonschz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! We can resolve the few regressions afterwards. I really like the improvements away from OFFSET1 as well.

@disinvite disinvite merged commit 7b924df into isledecomp:master Feb 9, 2025
11 checks passed
@disinvite disinvite deleted the match-module branch February 9, 2025 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants