Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to get rid of invalid html #19

Open
tysonzero opened this issue Sep 11, 2021 · 4 comments
Open

Function to get rid of invalid html #19

tysonzero opened this issue Sep 11, 2021 · 4 comments

Comments

@tysonzero
Copy link

I am looking into rendering some user-submitted html, so unsurprisingly I'm planning on using this library to sanitize it.

However:

There is no guarantee for how a browser will display invalid HTML, so there is no guarantee that this function will protect your HTML from being broken by a user's html.

This makes me pretty nervous. Preventing invalid HTML from breaking the rest of the page is pretty essential.

How difficult would it be to add a functionality that would get rid of or fix invalid html to avoid this problem? Or at least invalid html that causes problems in practice.

Are there any known examples of html that is problematic even after sanitization? I was hoping that a parent div with some overflow: hidden would be sufficient.

@snoyberg
Copy link
Member

I don't remember offhand. But my recommendation for sanitizing HTML is to parse and render with html-conduit.

In practice, that comment is probably out of date, since HTML5 does standardize HTML parsing rules. But rendering something more idiomatic is probably a good move.

@tysonzero
Copy link
Author

It does not appear that html-conduit is guaranteed to output valid html5.

I actually found some valid html5 that becomes invalid when round-tripped through html-conduit:

Specifically anything that contains something like <tr><td>foo<td>bar</tr>, which is legal html5 where the second opening <td> automatically closes the previous <td>.

Yeah given all of HTML5's standardized error correction we may just chance it with sanitizeBalance. As long as it's bulletproof from a security perspective, the occasional rendering bug which we can hunt down isn't the end of the world.

A fully html5-compliant parser would definitely be ideal though.

@tysonzero
Copy link
Author

Seems like just about all Haskell html5 parsers have the same <td> bug, including tagsoup and xss-sanitize.

@tysonzero
Copy link
Author

Made a separate issue #20 to address html5 parsing issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants