Tags:

Cleanup HTML using C#
July 19, 2009 11:57 AM   Subscribe

Using C#, how can I programmatically clean up HTML code ?

I need to get rid of multiple spaces (except those within quotes), all new line and tab characters in the HTML.

Is there a ready library or API that I can call from my program to clean up HTML on the fly ?
posted by inquisitive to Computers & Internet (16 answers total) 4 users marked this as a favorite
 
It looks like htmltidy, the standard tool for what you're trying to accomplish, can be used on your platform and language of choice.
posted by majick at 12:03 PM on July 19, 2009


I am not quite happy with HTMLTidy .. its a COM component in C# .. I was looking for some pure .NET solution. Any example will be appreciated.

Probably a regex search and replace can help.

Remove all tabs and new line characters, and also multiple white spaces other than those which are inside quotes.
posted by inquisitive at 12:06 PM on July 19, 2009


I would suggest regex, it'll give you the fine control you want.

Is this something you want to do for all pages? It might be worth looking into getting IIS to do it for you.
posted by Artw at 12:29 PM on July 19, 2009


Probably a regex search and replace can help.

I don't know C#, but this is how I would tackle the problem in most languages.
posted by grumblebee at 12:30 PM on July 19, 2009


Beware, Regular Expressions can not describe SGMLs.

<b class="<hi"><!-- <Yeah, not kidding>--><![[Though <people> try to]]> do it all the </b> time.
posted by cmiller at 12:38 PM on July 19, 2009 [4 favorites]


Beware, Regular Expressions can not describe SGMLs.

Why not?
posted by grumblebee at 1:11 PM on July 19, 2009


Why can't regex describe SGMLs? Because SGMLs are not a regular language. A few seconds of Google yields some concrete examples of problems with regex for HTML.

You *can* use regex as a tokenizer for part of making a parser, but that's not much different than not using regex as a tokenizer when you make the parser.

Want some hands-on experiment? I gave an example up there of some HTML. Here's another. Write a single regex to change every other nested "b" tag to "em", starting with the first.

<b><b><b x="> </b>" y="<b></b>" z="test">"""lkjh<!-- </b><b> asdf --></b>poiu</b></b>
posted by cmiller at 2:38 PM on July 19, 2009 [3 favorites]


I am not quite happy with HTMLTidy .. its a COM component in C# .. I was looking for some pure .NET solution.

Why can't you just P/Invoke to HTML Tidy? Or let Visual Studio generate .NET wrappers for you? Not sure what the advantage of a pure .NET solution is.
posted by matthewr at 2:46 PM on July 19, 2009


>I need to get rid of multiple spaces (except those within quotes), all new line and tab characters in the HTML.

If you'd left out that "(except those within quotes)" part, this would be solvable by a very simple regular expression along the lines of s/\s+/ /sg -- but if you want to treat attribute text differently (is that what you mean?), yes, you need a parser.
posted by AmbroseChapel at 5:20 PM on July 19, 2009


I actually thought that a regex would solve the problem, even though SGML isn't regular. Mainly because I had figured that the sub-problem was regular.

But, I can think of progressively weirder edge cases that would definitely not act the way you want it to.

You can blow away all the newline and tab characters in a file trivially. Suck in the file as a String, and then just myString.replaceAll('\t', ''); myString.replaceAll('\n', '');.

As for removing multiple spaces? Why not suck the markup into a DOM and then output the normalized source? Normalization already coalesces multiple spaces. And walking through the tree is far more easily proved correct than the regex solution.
posted by Netzapper at 5:37 PM on July 19, 2009


>Suck in the file as a String, and then just myString.replaceAll('\t', ''); myString.replaceAll('\n', '');.

That's probably a terrible idea, because linebreaks and tabs might be functioning as spaces between words in text, or between other inline elements.
posted by AmbroseChapel at 7:03 PM on July 19, 2009


>Suck in the file as a String, and then just myString.replaceAll('\t', ''); myString.replaceAll('\n', '');.

That's probably a terrible idea, because linebreaks and tabs might be functioning as spaces between words in text, or between other inline elements.


Ambiguous specification. He says "all new line and tab characters in the HTML". My snippet certainly does that.

Another option that's less destructive would be this (in pseudo-Java):

String normalize(String html){
    html = html.replaceAll('\n', ' '); //replace with spaces
    html = html.replaceAll('\t', ' '); //to preserve whitespace
    Document dom = new Document(html);
    dom.normalize(); //coalesce adjacent whitespace
    return dom.toString(); //deprecated, use transform
}

posted by Netzapper at 7:15 PM on July 19, 2009


The pseudocode above is the right idea. If you don't need to be aware of the SGML content, you can use regexes. If your transformations depend upon the document structure, then regexes are going to be a pain. (You said "quotes", do you mean "double-quoted strings" or do you mean <blockquote> tags?)

I am from the UNIX world, but I'm sure you can adapt this technique to .NET. I use libxml2 (via the Perl binding; XML::LibXML) to parse the HTML document into an XML DOM. Assuming that the HTML is not totally broken, this is fairly reliable. I then have a DOM that I can easily transform. If I want to remove all extra spaces for text nodes, I can do an xpath search for all text nodes, and process their text, in-place, one at a time. In your case, you can select all non-blockquote nodes, and do what you need to do with those. (It sounds like you want to walk the tree and emit the HTML for each node, never adding any spaces or tabs -- except for quotes.)

Anyway; regexes if you don't need to be aware of the tag structure; "XML" parser otherwise.
posted by jrockway at 9:37 PM on July 19, 2009


By the way, be careful using XML parsers on HTML. An XML parser, by default (and some invariably), is going to throw away malformed XML. Valid HTML is not necessarily valid XML.

For example, this is valid HTML but not XML: <font color=blue><p>Some text with an html &Entity; and <b><i>some bold italic text</b></i>.

XML requires quoted element attributes, paired or self-closed tags, that all entities be defined in the document or DTD, and that nested tags be closed in the proper order. Furthermore, much of the HTML you find in the wild is syntactically broken even for the HTML standard.

Some XML parsers can be convinced to accept the HTML, despite it being malformed. But the results have, in my experience and in contrast to jrockway above, been decidedly unstable and unrobust.

Really, I'd be looking for an HTML-specific parser. For instance, how about the HTML Agility Pack, linked from here? I'm not a .NET guy, so I can't recommend it. But, it looks right.
posted by Netzapper at 9:57 PM on July 19, 2009


I am a .NET guy and a big fan of the aforementioned HTML Agility Pack. It follows a similar object model to ASP.NET's System.Xml stuff and feels like the HTML library that should have been included in ASP.NET out of the box. Quite how MS didn't think to include an HTML parser in a web dev framework is beyond me.

I wouldn't load it into an XML DOM. System.Xml is way too strict and if you're going to have to use an external library you might as well use a dedicated HTML one.

I've found string manipulation of real world HTML, regex or otherwise, to be a massive headache. As Netzapper says you can't make any assumptions about its validity. Also, being realistic, a perfect regular expression taking into account all edge cases would render your code unmaintainable magic for your average programmer. A nice straightforward foreach over an object model can be understood and extended by anyone.
posted by guid at 8:23 AM on July 20, 2009


You can't use a single regex, but you can do this by using a regex to get the stuff inside of quotes, then working on those tokens with a regex to replace multispace sequences (I think it'd be something like \s+ ) with single spaces, then putting it all back together.
posted by ignignokt at 8:48 AM on July 20, 2009


« Older The big five-O is coming up fo...   |  At the beginning of Bret Easto... Newer »
This thread is closed to new comments.