I think the author is viewing WebAssembly strictly in the context of using it inside of a web browser on a web page. The problem with that view is that WebAssembly's goals go far beyond that, being more of a (much more) portable runtime/binary format akin to a Flatpak or AppImage.
In the context of a web browser having low-level access to the client's filesystem (e.g. fd_write) seems like a bad idea/security problem. However, if you're executing WASM from the perspective of, "I'm running this on my desktop" or "this is a microservice I am running on my server" suddenly low-level filesystem access makes a lot more sense.
It's up to the implementation to add controls around or completely restrict things like filesystem access. There's no reason why such features can't be included in the larger standard of WebAssembly. It's similar to .webm being a subset of the Matroska (.mkv) container format... For the web, specifically the folks that defined .webm (Google) decided that they didn't want or need all the features of Matroska.
It has been painstakingly designed not to be a security problem, and I would really recommend becoming familiar with the APIs in question. Every last I/O function is capability-based; for example, you must have a directory handle in order to open a file or subdirectory in it. Initial handles are provided by the runtime.
At the end of the day this person's argument is "WASI providing posix like APIs to WebAssembly is harmful for the web" and they are very angry the larger community disagrees with them. Despite all the diagrams, words, and tables repeating "In Violation" the author can't make an argument more substantiated than their own opinion of what's appropriate for the web.
It's also clear this person would be extremely unpleasant to collaborate with. So none of this is surprising.
I fully have to agree that it seems like new standards on the Wasm ecosystem are created to get dominion over them. WASI is one clear example of this (but not only, the Component Model is another example where only one runtime is pushing for this).
To be honest, that's not the way standards should be done: standards are named this way because they require wide adoption. Until that happens, they are not standards... just technical specifications.
In general I believe there's a great opportunity to create something based on true Web Standards and not proprietary APIs driven by the big corps, that also can run fully server side.
The author seems to be under the impression that utf8 and Unicode are distinct, incompatible string types, and the rest of the objections follow. Bizarre.
That's not the issue / impression. The issue is that well-formed UTF-16 is rarely used in practice. All of JavaScript, Java, C#, Dart, Kotlin etc. effectively use WTF-16 for compatibility and performance reasons, and that's what's semantically distinct from UTF-8. These have asymmetric value spaces, so that strings in "legacy" languages would sometimes throw or implicitly mutate synchronously. Mixed systems typically use WTF-8 as the common denominator for this reason, i.e. not UTF-8, but Wasm decided against.
I'm sympathetic to the argument but I'm not convinced that the UTF-16 string format in those languages is much of a hurdle.
Even Python which adopted the pretty ridiculous internal UCS4 encoding is now carrying around a utf-8 pre-encoded version of strings for crossing boundaries. UTF-8 is just too widespread that many languages can avoid having to support it natively in some form.
Likewise WASM is not the first standard that has opinions about string encodings that are not native to a language. For instance Go and Rust which prefer to use UTF-8 internally have to re-encode on the way to Windows APIs (usually!). Likewise Cocoa/Objective-C traditionally use UCS-2 strings which are quite leaky, yet Swift nowadays uses UTF-8 internally and transcodes.
To me it's not so much a question of what's the best / recommended (for new languages) / most used encoding. It's rather the observation that there are so many popular languages operating on Unicode Code Points, not Unicode Scalar Values, and Wasm said it wants to support these (as) well, incl. integration with JS/Web APIs which already share their semantics. And for all of these, the restriction is unnecessary and trivially avoidable by allowing them to pass strings to each other, or from/to JavaScript, without errors or mutation, while not changing anything for those preferring UTF-8. Looking at this the other way around might be valuable as well: What if one would design a Component Model for any of the affected languages? There, introducing an unnecessary discontinuity would be a design mistake I'd argue.
I'm not sure I understand that point. JavaScript has one way to encode strings (WTF-16), many things that compile to WASM have another (UTF-8). The only thing that I see happening is that people generally are starting to recommend the use of UTF-8 as internal string encoding, and particularly across bridges.
The question is IMO not really about if someone operates on unicode code points or not, but what encoding strings shared across a WASM bridge should have and coming to one makes sense.
In more technical jargon, it is about value spaces. All UTF-8 strings map to WTF-16 strings semantically (lists of Unicode Scalar Values are a subset of lists of Unicode Code Points), but some WTF-16 strings do not map to UTF-8 (Unicode Scalar Values exclude some Code Points). That's something UTF-8-based languages have to deal with anyway (on the Web, which is WTF-16, or any other mixed system), but it's odd that the expectation has become that even languages that map to each other incl. to JS/Web APIs now have to share a problem that does not exist in their native VMs for reasons. To emphasize: These languages just do what's perfectly fine in WebIDL, JSON, ECMAScript and their own language specifications.
let myString = "...";
let returnedMyString = someFunction(myString); // might or might not cross boundary
if (myString == returnedMyString) {
// sometimes silently false
}
I am having a hard time to understand how so many people prefer this as the outcome of a Web / polyglot / language neutral standard, especially since affected languages cannot change, and the problem is so trivially avoided, say with a boolean flag "this end is WTF-16, it's OK if the other end is as well" (otherwise use well-formed/UTF-8 semantics).
> I am having a hard time to understand how so many people prefer this as the outcome of a Web / polyglot / language neutral standard, especially since affected languages cannot change, and the problem is so trivially avoided, say with a boolean flag "this end is WTF-16, it's OK if the other end is as well" (otherwise use well-formed/UTF-8 semantics).
It's because the idea that languages "cannot change" does not appear to be true. UTF-8 is so widespread now that for languages changing the native string representation towards it has become an interesting proposition. Many modern languages (eg: Go and Rust) already picked UTF-8, others such as Swift changed over to it. Then there are implementations of languages like Python (PyPy) that changed their internal encoding even though that was a widespread assumption that it cannot work.
The web is also not WTF-16, JavaScript is and the web consists of more than just that. WTF-16 to WTF-16 is most likely becoming less and less a thing going forward except for legacy interfaces such as W APIs on Windows and even there it appears that UTF-8 on the codepage level is now strongly recommended.
To give you another example: I'm very interested in using AssemblyScript today to do data processing, but that actually is not all that easy because the data I need to process is in UTF-8. Now to use the string class in AssemblyScript I actually have to do a pointless data conversion to WTF-16 and back.
I would be majorly surprised if JavaScript doesn’t adopt UTF-8 at one point as well.
I do understand the desire to switch all languages and systems to one encoding, of course. However, Switching a WTF-16 language to UTF-8 removes previously valid values from strings, then exchanging what's errors/mutation on Component boundaries right now with errors/mutation when using string APIs. Can't be done in a backwards-compatible way, and all these languages have a lot of existing code. If backwards compatibility is a goal (say when using a breadcrumbs mechanism as in Swift), one still ends up with WTF-8 underneath, which maps to WTF-16, but is not UTF-8. Hence why I think it's impossible, because the only way to pull this off is by replacing affected string APIs (and/or accepting that old APIs then throw or mutate). Likewise, I see a possible future where JS adopts breadcrumbs, but then with WTF-8 (and perhaps a well-formedness flag), not guaranteed UTF-8. In your use case, that would yield a fast-path if a string is well-formed, but still with the same old fallback. Plus, of course, that having a systems fast-path implies that there is a corresponding JS-interop slow-path (when using AS).
PyPy uses utf-8 internally and it’s completely hidden from the user. That’s however possible because in Python there was always a UCS2/USC4 leak to the user code so you could never really rely on anything.
I expect other languages to make the switch sooner or later.
I do think though that this is not all that interesting for the issue here. WASI needs to pick some format and picking UTF-8 is fine. Roundtripping half broken UTF-16 is something that does not need preserving.
I think enforcing UTF-8 there won’t be much of an issue in practice.
I guess we are about to find out whether there is substance to the precedents. My bet is on "what can go wrong, will go wrong", even more so on Web scale. Let's hope I'm wrong.
It’s worth mentioning since the title doesn’t say much that AssemblyScript removed WASI support. There are still ways to get it to work but they clearly are dissatisfied.
In the context of a web browser having low-level access to the client's filesystem (e.g. fd_write) seems like a bad idea/security problem. However, if you're executing WASM from the perspective of, "I'm running this on my desktop" or "this is a microservice I am running on my server" suddenly low-level filesystem access makes a lot more sense.
It's up to the implementation to add controls around or completely restrict things like filesystem access. There's no reason why such features can't be included in the larger standard of WebAssembly. It's similar to .webm being a subset of the Matroska (.mkv) container format... For the web, specifically the folks that defined .webm (Google) decided that they didn't want or need all the features of Matroska.