New matching and event reporting module #84

disinvite · 2025-02-01T18:14:02Z

In #81 we changed the database to support merging two entities during a match. The natural progression is to separate loading of entity data (i.e. orig annotations) and matching. This is the first step towards doing that.

The modularization now makes it possible to test the effect of each match function. This was not really feasible before because everything was enclosed in the Compare structure. As a result, we now have a boatload of tests that demonstrate the idiosyncrasies for each type of match -- particularly vtables and static variables.

Until now, if we failed to match an orig annotation to a recomp entity, the orig address would not exist in the database. With these entities now there, we can use the orig name during asm sanitize even if we failed to match. I think this is better because it exposes a diff that would be hidden by the <OFFSET> placeholder. (Also: your annotation could be wrong, so this will show where it creates problems.) The impact to LEGO1 is minimal because we have just about everything mapped out, but there are two new diffs.¹

If an orig vtable annotation fails to match, its name will be just the class name. It would appear in the asm output as just 'Pizza (VTABLE)' instead of Pizza:'vftable' (VTABLE). We could mock up a name here if the vtable fails to match, but I left it alone for the moment. This happens because we had been overloading the "name" attribute in the database for a lot of different purposes. Now that we are free to use any attribute for matching, we should change this.

I also added a simple event reporting protocol in event.py and each of the match functions can call it to report a failed match. We had relied on the logging module for this which left no way to react to a failed match in code or store it for later. (We can now also test when a match should fail.) For now we just redirect the events back to the main logger from core.py. I changed the error text slightly in a few places but the information should be the same.

Lastly, there are two new SQL views to help keep the queries concise and a wrapper around the COUNT query. When matching we now always read both addresses in order. This will keep matching consistent for cases (like function name match) where there are multiple options. (i.e. it does not depend on database insertion order or filename order when reading annotations.)

We don't properly grab the g_skeletonKickPhases variable from the PDB so it's not possible to match it right now. The orig annotation now sets it in the database so there now appears to be a diff in LegoRaceCar::HandleSkeletonKicks. The other diff is from an MSVC library function which might be a typo in the annotation. ↩

jonschz

Giving you a first batch of comments from what I noticed when reading the code. Will try and play around with it, then give some more feedback. Looks very promising! I especially like the improved testability and the report abstraction.

jonschz · 2025-02-02T10:53:55Z

reccmp/isledecomp/compare/core.py

+                    batch.match(fun.offset, recomp_addr)
+                    batch.set_recomp(
+                        recomp_addr, type=EntityType.FUNCTION, stub=fun.should_skip()
+                    )


Do we want to unify these two functions? Not sure how much they are used independent of each other (maybe at least one of them can be removed)

It would probably be better to have set_orig here, as with all the other imports from annotations that follow. I don't remember why I put set_recomp here. The final result to the database will be the same unless this match fails.

jonschz · 2025-02-02T10:56:23Z

reccmp/isledecomp/compare/core.py

+                        recomp_addr, type=EntityType.FUNCTION, stub=fun.should_skip()
+                    )
+
+        with self._db.batch() as batch:


It is not immediately obvious to me why you use separate batches for some, but not all types of insertions. If there is a good reason, I'd add a comment explaining why. If not, why don't we use one big batch?

Yes, I had these in different batches at one point and using insert_orig to match the previous code better. I'll switch this around tomorrow.

The reason to use separate batches is so that staging data with insert_orig would succeed once (in the first batch) and then not change the data in subsequent batches. If you keep calling insert_orig on the same address in the same batch, we modify the pending changes. This is by design so you can add attributes in stages (or only if certain conditions are met) as we do here.

We could do it all in one batch if the annotations read from the DecompCodebase were guaranteed to have a unique orig address. I don't remember if that's true or not. We detect if you repeat the same addr in two different annotations (in the linter) but I don't think we remove the dupes here.

That is a valid reason to do multiple batches. I'd just add that to the code as a comment since the pattern was not obvious to me.

reccmp/isledecomp/compare/event.py

reccmp/isledecomp/compare/db.py

reccmp/isledecomp/compare/core.py

reccmp/isledecomp/compare/match_msvc.py

jonschz · 2025-02-02T11:19:17Z

reccmp/isledecomp/compare/match_msvc.py

+        return value
+
+
+def match_symbols(db: EntityDb, report: ReccmpReportProtocol = reccmp_report_nop):


I am a bit confused about the dependencies here (this was likely the case before your changes):

Do both match_functions and match_static_variables depend on this running first? If so, I'd document that in the respective docstrings.

If not: What kind of symbols would match_symbols be good for, if it isn't among the previous two?

The reason to match using the symbol is that it should be unique across the entire program. We don't need to check the type or any other attributes. Right now, the only place we expect a symbol is in a // FUNCTION or similar annotation, but we could add something like this:

// SYMBOL: TEST 0x1234 // ??_7PizzaMissionState@@6B@ // aka PizzaMissionState::`vftable`

It's probably most useful for this to run first, but match_functions doesn't depend on it. match_static_variables depends on having an orig entity (the function with the variable) with a symbol. You could match using either this or match_functions, or not at all if your annotation on the function uses the symbol.

That's what I expected. I'd add `"Depends on match_symbols to run first" to the docstrings of all the other functions (where applicable).

tests/test_match_msvc.py

jonschz · 2025-02-02T11:37:35Z

I tried a bit and failed to observe the diffs you mentioned. I see the same issue with g_skeletonKickPhases on the current master commit. Weirdly, HandleSkeletonKicks still shows a 100% match.

madebr · 2025-02-02T21:29:02Z

reccmp/isledecomp/compare/match_msvc.py

+        # Max symbol length in MSVC is 255 chars. See also: Warning C4786.
+        symbol_index.add(symbol[:255], recomp_addr)


That's correct for MSVC4 toolchain, but not for current MSVC.

How about a parameter that decides whether to truncate? We don't have any way to enable or disable it (or detect the MSVC version) but these could be added later to the yml.

I was also thinking about making it configurable.
Or, and this I don't know for sure: perhaps it only truncates in the old-style PDB file format.
So we can make it dependent on the file format.

disinvite · 2025-02-03T22:34:48Z

Today's changes:

Added truncate param for symbol and function name match. This is just a bool for now. I don't know if there would be a need to truncate to a specific character length depending on the compiler. Defaults to True but we can control this later with a config option or by autodetecting when it is needed (PDB version).
Small change to DecompCodebase to eliminate annotations that reuse an address in the current module. This combined with the new batch code gets the same behavior that we had before: only the first occurrence of the address is used/matched, and any duplicates are dropped. New error message to alert when this happens.
With the above change: we can now load all annotations in a single batch because none of them will step on each other. I kept set_orig here because we want to override any existing information with the annotation data. Right now there isn't any data to replace (on the orig side) but that might change later.
I moved the match functions out of load_markers and into the __init__ function. We can move this to a helper function if that makes more sense, but the point is to separate loading from matching.
Used report call in load_markers for the three error messages.
Add comment for match_static_variables. It is most useful to call it after matching functions.

I think that covers all the outstanding items so please give it another look and let me know. Thanks!

jonschz

Looks good! We can resolve the few regressions afterwards. I really like the improvements away from OFFSET1 as well.

New matching and event reporting module

3b0df42

jonschz reviewed Feb 2, 2025

View reviewed changes

disinvite added 2 commits February 2, 2025 11:47

Remove deprecated code

c5b5d28

Text cleanup

0af8394

madebr reviewed Feb 2, 2025

View reviewed changes

disinvite added 4 commits February 3, 2025 12:37

Symbol and function name truncate is optional

63f93c5

Remove annotations with duplicate (module, address)

f0e7bd3

Move matching out of load_markers, use report function more

c429c17

Add notice to match_static_variables

7677d88

disinvite requested review from madebr and jonschz February 7, 2025 15:45

madebr approved these changes Feb 7, 2025

View reviewed changes

jonschz approved these changes Feb 8, 2025

View reviewed changes

Merge branch 'master' into match-module

687f0f4

disinvite merged commit 7b924df into isledecomp:master Feb 9, 2025
11 checks passed

disinvite deleted the match-module branch February 9, 2025 01:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New matching and event reporting module #84

New matching and event reporting module #84

disinvite commented Feb 1, 2025

jonschz left a comment

jonschz Feb 2, 2025

disinvite Feb 3, 2025

jonschz Feb 2, 2025

disinvite Feb 3, 2025

disinvite Feb 3, 2025

jonschz Feb 3, 2025 •

edited

Loading

jonschz Feb 2, 2025

disinvite Feb 3, 2025

jonschz Feb 3, 2025

jonschz commented Feb 2, 2025

madebr Feb 2, 2025

disinvite Feb 3, 2025

madebr Feb 3, 2025

disinvite commented Feb 3, 2025

jonschz left a comment

		return value


		def match_symbols(db: EntityDb, report: ReccmpReportProtocol = reccmp_report_nop):

		# Max symbol length in MSVC is 255 chars. See also: Warning C4786.
		symbol_index.add(symbol[:255], recomp_addr)

New matching and event reporting module #84

New matching and event reporting module #84

Conversation

disinvite commented Feb 1, 2025

Footnotes

jonschz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonschz Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonschz commented Feb 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

disinvite commented Feb 3, 2025

jonschz left a comment

Choose a reason for hiding this comment

jonschz Feb 3, 2025 •

edited

Loading