-
Notifications
You must be signed in to change notification settings - Fork 5
Why is encoding not in the type? #3
Comments
Yeah, more statically available information is always better. Wondering how we can provide it, though, given that Wasm modules and Wasm hosts are composable in many ways, so a module may run in a browser, or let's say in Wasmtime, or an import may be a JS module, or a Rust module. As such this mostly focuses on caching so far, but I am certainly interested in your ideas. Regarding additional encodings, I currently would not expect that we'd need more anytime soon, but I may be wrong. Afaict, the bulk of languages uses either W/UTF-8 or W/UTF-16 and supporting anything else (in browsers) may turn out to be impractical. Do you have any specific encoding in mind? |
I don't have a concrete additional encoding in mind. But anyway, now I think you would want to have indirection in string object for both slots, because alternative is preallocating space for potentially not needed encoding, which is ~2x memory consumption in the worst case, which is really bad. Based on what I understood from your proposal, universal string would look like:
Which means:
I'm assuming that when you pass JS string to Wasm you would allocate WasmString and attach JS string to slot2. You would need to cache JS string -> Wasm string mapping in order to avoid allocating different WasmString for the same JS string (Assuming you don't want to force JS to use this 2-slot structure). Turns out there is a cost attached to JS->Wasm string interop, which is unfortunate. Languages where String is not a primitive type, but a subtype of other classes, like Object, would need to have custom fields in String for "type info" or "v-table". They would have to choose from:
So, if universal strings would be inefficient, languages might choose to use Wasm array if they need fast strings within Wasm module. Or they might choose to keep compiling to JS for fast JS and DOM interop. Universal strings might become less than universal. There might be a third compelling alternative with great JS interop, and less overhead compared to "universal" strings:
Alternatively you could attach runtime $vtable to Strings only when they String loose its static type (e.g. casted to Object). But this approach has trade-offs and is still worse compared to attaching v-table directly to char array. If you have an API that wants to be compatible across embeddings, it may choose a concrete encoding (or provide a multiple overloads for different encodings of strings). Languages that would want to use these APIs would adapt its strings on boundaries if needed. I'm afraid I don't see a way to make a universal string type that supports multiple encodings without sacrificing use-cases within a single known encoding. |
Do you know how the JVM solves this?
I guess analogous to arrays with custom fields, there may as well be strings with custom fields, with the encoding slots being implicit.
Yeah, that's one of the alternative ideas brought up in related discussion, using type pre-imports. The hope expressed was that if enough languages (running in the browser) go for it, these languages will be interoperable (in the browser), but I worry that non-WTF-16 languages will not adapt it, or non-WTF-16 hosts won't provide it, again leading to copying overhead, ecosystem fragmentation or the like. If everything else fails, I do consider it as the most viable alternative, though. Regarding efficiency in general, my mental model is that dynamic checks are only necessary if an encoding cannot statically be determined, like for example only the first instruction in a code path needs to do a dynamic check, and subsequent code would become just indirection. It becomes even easier when a module uses one encoding scheme exclusively (the common case), where the check can be performed at the boundary, so engines may be able to avoid indirection. There's more written down in this paragraph. Do you see problems with the assumptions made there? |
Major JVM implementations use array with custom field prefix inherited from Object, without indirection. JVM languages have to use this String class and Java object model if they want to be efficient and compatible with enormous amount of Java libraries and legacy codebases. Sadly, nobody uses this new Wasm GC string type we are designing, so it would be hard to convince languages to adopt it in cases when alternative is better for their non-universal use-cases, for example targeting browsers and interacting with JS and DOM, where the only currently relevant language (ecosystem-wise) is JavaScript.
Custom fields would make strings incompatible with JS, DOM and across languages, requiring copies, am I right?
Yes, there are a lot of optimization opportunities, but it hard to predict how good it is compared to specific strings. By thinking about it a little bit more, it feels like your universal Wasm type can be represented as a Wasm GC struct of two concrete string types, and check avoiding optimizations can be done by language toolchains. So languages can use concrete types internally where possible. And the hard problem would be convincing languages to agree to use this GC struct at interface boundaries. But I might be missing something here. |
Wasm engines that know available encoding statically would sometimes be able to generate more efficient code, especially when compiling without dynamic profile information.
Then we can have instruction to convert between different encodings. Conversion result can also be cached to external hash table (or to an object slot) by engine to avoid repeating the same conversion multiple types.
Would JS embeddings require wrapping JS string with an extra Wasm string object in order to provide an extra WTF-8 slot? Knowing WTF-16 encoding statically would help to avoid that.
This also could scale better with adding more encodings in the future. More than two encoding slots per objects would mandate indirect access, right?
The text was updated successfully, but these errors were encountered: