How to Sanitize HTML (Javascript Security)
September 3, 2007 9:34 AM

Is there a safe way to sanitize user submitted HTML to prevent security problems?

I'm working on a website where users can post their own HTML. I want to be as flexible as possible in what I allow. I'm ok with removing Javascript from their HTML though.

So I guess A.) How do I accurately remove javascript? and B.) Are there any other security risks not related to javascript?

For part A. I'm thinking to not allow the script tag, and perhaps not allow onclick. I'm sure I'm missing stuff though.

By the way, the only security risk I've heard of for allowing untrusted Javascript on your site is that user names and passwords can be stolen, and other actions can be done on your site in the name of that user. Is there more to it?
posted by GregX3 to Computers & Internet (31 answers total) 8 users marked this as a favorite
Untrusted Javascript can be used to redirect users to another server which pretends to present the same page as yours does.

Untrusted Javascript can deliver browser infections.

If you really want to be safe, you permit bold, italics, underlines and line breaks and nothing else whatever.
posted by Steven C. Den Beste at 9:49 AM on September 3, 2007


A) This is a very hard problem and new XSS attacks are being discovered every day, even on websites that have moderately "good" XSS protection.

If you really need to do this, I'd suggest copying code from someone who has already dealt from this problem. Here's an xss_clean function from the Open Source project Code Igniter.

B) Allowing IMG tags could cause problems, specifically if your site does not properly protect against CSRF attacks.

If someone can inject javascript into their site, they can violate the same origin policy and perform a number of possibly unwanted actions to your site (basically reading the data off any page of your site).
posted by null terminated at 9:50 AM on September 3, 2007


I assume PHP.

HTMLPurifier. The library looks complex on the surface (lots of options!) but it isn't - and it is very very configurable. I love it.

You could also take a look at kses and SafeHTML.

I've used all three. Purifier is the most flexible and powerful IME.

There are always problems with hand-sanitizing JavaScript, it's not a simple matter of removing onclick handlers. Don't write your own from scratch if you can help. For example, there are other handlers which can be abused (one example with WordPress).

In short, allow unsanitized JavaScript through on your page and it is pretty much game over... A simple remote .js embed allows an attacker to steal your clipboard contents, track any clicks/keystrokes you make on a page and lots of other nasties. Remember, statstrackers like Urchin/Analytics, Statcounter etc do all their monitoring with a single .js embed.
posted by geminus at 9:53 AM on September 3, 2007


The main thing to consider is that you need some code that'll parse the tags, only retain whitelisted tags/attributes, then rebuild the markup. Scripts that filter stuff out are generally easy to circumvent, you have to think in terms of what's allowed not what's forbidden and err on the side of caution. HTML Purifier is probably a good option.

As well as JavaScript, you'll need to ban CSS, images and object/embed to be completely safe, and restrict links to http(s). If you need to allow images, you could provide an image upload facility (with rigorous limitations and validation) and only allow uploaded images to be referenced in the HTML.
posted by malevolent at 10:16 AM on September 3, 2007


It's safest to create your own language, and interpret that. For everything that's not in your language, encode it.
E.g.,

"**Foo<SkriptHaxor!>**"

->

"<strong>Foo&lt;SkriptHaxor!&gt;</strong>"

Make it impossible to pass special characters through to an HTML intepreter. For anything else, you will regret it.
posted by cmiller at 10:21 AM on September 3, 2007


this is a great page to see how large of a problem this really is: http://ha.ckers.org/xss.htm

You need to be able to defend against most of these, some of which are really really hard to detect.
posted by cschneid at 10:26 AM on September 3, 2007


This sounds trickier than I thought.

Maybe it would help to show you exactly what I'm trying to do. This is a website that lets users create their own web generators/utilities. And in the case where the utility generates HTML I want to allow the user to preview the HTML in a browser. Here's an example of how it works now:

http://www.utilitymill.com/utility/Text_Diff

There's some javascript to detect if there is HTML in the text area and provides a link to open that HTML in a new window. I'm thinking perhaps the same javascript could also detect anything dangerous and simply not render that link.
posted by GregX3 at 10:36 AM on September 3, 2007


If you really want to be safe, you permit bold, italics, underlines and line breaks and nothing else whatever.

Really, nothing else?
posted by yohko at 11:02 AM on September 3, 2007


Everything related to XSS only applies to data being stored on your server that will be displayed to other users. If users are only going to be seeing the HTML/javascript themselves and the user generated HTML is not stored on your server, you have nothing to worry about.
posted by null terminated at 11:16 AM on September 3, 2007


If you really want to be safe, you permit bold, italics, underlines and line breaks and nothing else whatever.

I'd also like to see a justification for this statement.
posted by null terminated at 11:23 AM on September 3, 2007


null terminated, I think he means that you must adopt a closed-with-exceptions approach, rather than open-with-exceptions. We can't predict what the next exploit will be. Heck, we can barely cleanse out what we know of already; once we put Turing machines into our browsers, we were screwed.

So, if you're saying '"dl" and "ul" and "ol" ... also', then yeah, fine. As long as they're simple and not likely to embraced and extended in Internet Explorer v14.
posted by cmiller at 11:51 AM on September 3, 2007


You want to watch those attributes as well - people can hide some funky stuff in there. From stuff like onload or onfail (combined with a fualty src) to stuff like Javascript running off of src or style attributes.
posted by Artw at 11:53 AM on September 3, 2007


In perl there is HTML::Scrubber. I use it on this site to allow users to modify pages - but limit them to the following tags:

a p b i u br h1 h2

You absolutely shouldn't have a "banned" list. You must have a whitelist and remove everything else. Also, make sure whatever you use filters attributes. Here is the function I pass everything through before presentation:
sub cleanse { # Cleanse - Use an HTML::Scrubber object to remove nefarious markup
	my $input = shift;
	my $scrubber = HTML::Scrubber->new( allow => [ qw[ a p b i u br h1 h2 ] ] );
	$scrubber->rules(
        a => {
            href => 1, # only relative image links allowed
            alt => 1,                 # alt attribute allowed
            '*' => 0,                 # deny all other attributes
	} );
	return $scrubber->scrub($input);
}

posted by phrontist at 12:01 PM on September 3, 2007


GregX3: Are users going to see eachother's pages? If not then you don't really have anything to worry about.
posted by delmoi at 12:02 PM on September 3, 2007


so my website feature is safe then? I'm not sure I understand why.
posted by GregX3 at 12:03 PM on September 3, 2007


delmoi, well any user can write a utility. A utility could output arbitrary HTML. Any other user could run the utility. So yes, I suppose users do see each others pages.
posted by GregX3 at 12:04 PM on September 3, 2007


cmiller: Nothing else whatsoever makes me wonder if I'm unaware of some attribute of these tags that don't exist in other tags. Is bold somehow more safe than strikeout? I don't understand why.
posted by null terminated at 12:07 PM on September 3, 2007


GregX3: There's two types of HTML/Javascript on your site. There's the HTML you generate for all your users, and there's the HTML that's generated in javascript.

1) The HTML/Javascript you generate
This needs to be protected from XSS attacks. If someone were able to inject code, it could be displayed to other users and possibly do nasty things.

2) The HTML generated on the client side
This code is only to one user. If that user is malicious and were to modify code, he is the only person who'd see this code. This is equivalent to someone writing a virus and releasing it on his own machine. A virus is not a threat unless the virus writer unleashes on the world. In the same way, Javascript is not a threat unless it's somehow sent to other users.

If you're familiar with Greasemonkey, this might be more clear. Greasemonkey allows users to inject code into any page they visit. This is equivalent to what you're doing. In both cases, the user (malicious or not) is the only one executing the javascript.
posted by null terminated at 12:14 PM on September 3, 2007


*This code is only displayed to one user.
posted by null terminated at 12:17 PM on September 3, 2007


null terminated, I fall in category 1 then. So I probably need some protection.

It seems to me that all the XSS exploits involve loading another page/resource. So perhaps I could just detect all the top level domains e.g., .org, .com, .ca, etc and not offer to preview HTML in a browser if any of those are detected?
posted by GregX3 at 12:22 PM on September 3, 2007


I think it was hyperbole, but the intention is right. Although you can add style tags to even simple tags like b and i which, while they won't necessarily run javascript, can make a mess of yr page, or pull images that leave you open to XSS.
posted by bonaldi at 12:22 PM on September 3, 2007


GregX3: In order for you to be vulnerable, you need to be accepting HTML/Javascript from a user, storing this on your server and displaying it to other users. You don't seem to be doing this.
posted by null terminated at 12:33 PM on September 3, 2007


bonaldi: b and i can include javascript, which makes the statement wrong in two directions (permitting the tags is unsafe and disallowing other tags does not add any protection)
posted by null terminated at 12:36 PM on September 3, 2007


null terminated, any user can make or edit the code for a utility. For example, for the text diff example I linked to, a user could come in and make the output of the utility be arbitrary HTML. (click the edit link on that page to see what I mean)

Think of my website like a wiki for utilities. Does that make sense?
posted by GregX3 at 12:37 PM on September 3, 2007


Ah, yes. I incorrectly assumed you were talking about the HTML in the "output" textarea. You do need to properly sanitize output.

This looks like a very cool website. Make sure you're properly sandboxing the python environment.
posted by null terminated at 12:44 PM on September 3, 2007


It sounds risky to me. It might be best to display the output of the generator as the raw HTML (ie, encode all the greater and less thans), and let them copy and paste into their favourite editor. Would that kill the usefulness of your utils?

null terminated: yes, that's right of course. No more benefit of the doubt for scdb.
posted by bonaldi at 12:54 PM on September 3, 2007


bonaldi, I think it kills a lot of the usefulness to not show the HTML generated in a browser though.

I liked the first PHP cleaner approach mentioned near the top of this thread but I'd prefer the function to be in javascript. Anyone know of anything like that?
posted by GregX3 at 1:05 PM on September 3, 2007


GregX3, you said,

It seems to me that all the XSS exploits involve loading another page/resource. So perhaps I could just detect all the top level domains e.g., .org, .com, .ca, etc and not offer to preview HTML in a browser if any of those are detected?

I shan't harp on this any more. An "open-with-exceptions" scheme like one you're planning is doomed. You can never anticipate and plan for all the weird things that might exist in the universe. The only (nearly) safe scheme is to permit nothing, with a list of exceptions. That exceptions list had better be planned out very well.

Good luck.
posted by cmiller at 2:39 PM on September 3, 2007


I _believe_ you could host the untrusted data on a subdomain (like unsafe.utilitymill.com) without filtering anything.
posted by null terminated at 3:53 PM on September 3, 2007


There were some subdomain cookie exploits in IE years back, however, and nothing to say there might not be again ...
posted by bonaldi at 6:09 PM on September 3, 2007


This is why vBulletin uses its own non-html language.
posted by smackfu at 8:29 PM on September 3, 2007


« Older banality of transcendence   |   I need to start an e-business. Newer »
This thread is closed to new comments.