Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider SymbolType enum #51

Open
disinvite opened this issue Jan 2, 2025 · 3 comments · May be fixed by #65
Open

Reconsider SymbolType enum #51

disinvite opened this issue Jan 2, 2025 · 3 comments · May be fixed by #65
Labels
enhancement New feature or request

Comments

@disinvite
Copy link
Collaborator

We currently have this:

class SymbolType(IntEnum):
    """Broadly tells us what kind of comparison is required for this symbol."""

    FUNCTION = 1
    DATA = 2
    POINTER = 3
    STRING = 4
    VTABLE = 5
    FLOAT = 6

We set this field for an entity based on assumptions from the PDB. If we read a match marker like // VTABLE we will set or replace this value based on the type of comparison that is requested.

We use the value to decide how to compare each entity. We also use it in the match_name function that presents a string describing the entity. This is used in asm sanitize and takes the form of: {name | string text | float number} ({symbol type name})

The problems with this are:

  • Cannot insert the enum directly into the database, so it must be converted to a primitive type. This was solved this by extending IntEnum instead of just Enum.
  • To write a custom query to (for example) "select all strings" the user has to know this internal value or use the enum as a parameter.
  • Similarly, when loading data from a text file, there isn't a way to declare that an entity is a particular SymbolType without knowing the internal value (See Load CSV files to populate reccmp metadata #32)
  • For proper type hygiene we should expect SymbolType anywhere the "type" field is needed. One reason for not doing this is that enums are slow. For what they are, there's more overhead than you would expect. (Again, we have sort of dodged this one by using IntEnum)
  • On the topic of performance, to get a string representation of the SymbolType in match_name() it is much faster to use a dict than the "correct" way of SymbolType(value).name.
@disinvite disinvite added the enhancement New feature or request label Jan 2, 2025
@jonschz
Copy link
Collaborator

jonschz commented Jan 3, 2025

Just a few thoughts:

  • I definitely see a problem with the way the enum is used now. If we decide to stick with the enum, we should already map it in the ORM and never use the integer representation outside the database file.
  • Not sure how big of a difference in performance we will see here. While I agree that we shouldn't waste performance unecessarily, I do think that an enum is a good choice here conceptually and in terms of code readability. Also, Python isn't exactly famous for its performance, so I wouldn't invest too much into small gains, especially if they decrease redability or maintainability.
  • What alternatives would you propose? Just what I can think of from the top of my head:
    • Stick with integers, compare to constants (the C/C++ way): Not something I'd like - we'd need to see a massive gain in performance before I'd consider that
    • Use strings instead: Can we properly protect them from invalid values using the type system? Might be worth considering once we have mypy in the pipeline
      • Also: Does sqlite support string enums? I haven't checked yet

Looking forward to your thoughts!

@disinvite
Copy link
Collaborator Author

No enum column type in sqlite. You could create a table with the values you want and use a foreign key constraint to enforce the possible values in the entities table. That may work with the JSON approach we're now using, but it might not.

Strings are probably the "simplest" way forward. It's no worse than what we have now. You're still using magic values but they at least make sense to users. You can also stringify the value for free which is nice. We could maybe add a "cleanup" step to check all type fields and shunt the invalid ones to EntityType.UNKNOWN or something. Evaluating the performance hit should be as simple as changing the parent type from IntEnum to StrEnum.

The other option is using a bunch of boolean fields, since our key value pairs can now be anything. This is sort of intuitive: e.g. function=True or const=True. However... you could argue that it just shifts the magic values into the JSON keys instead. You also gain a new problem of data hygiene because it's probably not valid to have both function=True and string=True on an entity.

@jonschz
Copy link
Collaborator

jonschz commented Jan 4, 2025

In that case I'd say going for strings as the underlying type makes the most sense. I'd like to have some protection by the type system against invalid enum values (e.g. typos) - not sure if there's a better option than StrEnum.

@disinvite disinvite linked a pull request Jan 7, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants