Three cheers for (draft stage) progress on a Sanitizer API! It’s gospel that you can’t trust user input. And indeed, any app I’ve ever worked on has dealt with bad actors trying to slip in and execute nefarious code somewhere it shouldn’t.
It’s the web developer’s job to clean user input before it is used again on the page (or stored, or used server-side). This is typically done with our own code or libraries that are pulled down to help. We might write a RegEx to strip anything that looks like HTML (or the like), which has the risk of bugs and those bad actors finding a way around what our code is doing.
Instead of user-land libraries or our dancing with it ourselves, we could let the browser do it:
// some function that turns a string into real nodes
const untrusted_input = to_node("<em onclick='alert(1);'>Hello!</em>");
const sanitizer = new Sanitizer();
sanitizer.sanitize(untrusted_input); // <em>Hello!</em>
Then let it continue to be a browser responsibility over time. As the draft report says:
The browser has a fairly good idea of when it is going to execute code. We can improve upon the user-space libraries by teaching the browser how to render HTML from an arbitrary string in a safe manner, and do so in a way that is much more likely to be maintained and updated along with the browser’s own changing parser implementation.
This kind of thing is web standards at its best. Spot something annoying (and/or dangerous) that tons of people have to do, and step in to make it safer, faster, and better.
Help me out here: I can’t find a reason for sanitizing user input on Browser level. To prevent malicious input to be saved in a database or even be put out later, we need to sanitize on server level. Thats the case because a malicious user won’t hesitate to look up the API and directly post the malicious input there circumventing any browser level checks.
This API is more for information coming out of the database and onto the page, as a final line of defence kind of thing after validating/sanitising the data going into the database.
Systems should not trust each other. Your client app should assume that the response from your API is malicious and has been compromised by, for instance, a man-in-the-middle attack. Always sanitize the inputs/outputs of each system.
If my client app receives some html from an API response (e.g. a headless CMS, etc) it should ALWAYS sanitize it before rendering. The go-to for this has been https://github.com/cure53/DOMPurify. I wonder how this new spec will compare.
This can prevent XSS. Yes you should sanitize serverside however some applications do not require storing the data. For example an non peristant chat. You may want to allow users to share html without risking XSS without having a middleware sanitizing on the server. WebRTC for example is client to client, you woule want to protect the receiving end from executing malicious JS.
This is not about most data, this is specifically for when you want to display user supplied HTML without risking an XSS attack.
While it might seem reasonable to remove any JavaScript from the HTML on the server, parsing HTML is a nontrivial task which browsers handle slightly differently. It is as such feasible for an attacker to craft an HTML payload that runs JS in Firefox while being completely safe in Chrome or JSDOM.
Since you can’t possibly predict how every browser someone might view your website in will parse any given HTML, it’s better to store it as is and later leverage the capabilities of the viewers browser to generate safe HTML for that particular browser.
This sounds great! So many sites need to scrub user input. The code to do it has probably been rewritten a million times, each rewrite with the potential to contain bugs. It makes so much sense for this to be a part of the platform!