Regex: Text from HTML, no attributes
March 26, 2006 7:15 PM   Subscribe

Regex Madness...filter. How do I pull the text out of an html document without looking at the tag attributes?

I'm using javascript... and I am just stuck. I think my brain is about to explode.

I'm trying to pull certain things out of an html document. Let's say, for simplicity's sake, it looks like this... 'cept with html tags. (Had to change 'em to display here.)
[!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"]
  [meta http-equiv="content-type" content="text/html; charset=windows-1250"]
  [meta name="generator" content="PSPad editor,"]
  [title]Sample Document[/title]
      [img src=""]
      Some text is [a href="fjkj.html"]here[/a]
All I want out of that thing is:
Sample Document
Some text is

Is that possible? I thought I had something working... but I was so wrong.

I tried to spider down through the dom, but I never could get that right either.

As a bonus... is there a particular book/tutorial folks recommend for understandings the mighty regex?
posted by ph00dz to Computers & Internet (26 answers total)
The DOM is the best way to do this... try playing with some of the examples at this site:

Fact of the matter is, it's the best tool for doing this, far better and more reliable than regexps!

getElementsByTagName, for example, would be one way to find the title. There may even be convenience API these days where you can ask just for the page title.

Once you have an element, you can ask for its innerText() or (more compatible) innerHTML().

Another good site when shit inevitably breaks in some browsers. :)
posted by symphonik at 7:26 PM on March 26, 2006

Best answer: For the bonus question I would suggest Jeffrey Friedl's Mastering Regular Expressions.
posted by shoesfullofdust at 7:27 PM on March 26, 2006

Best answer: This is also a useful resource for at least pretending to tackle the inhumane beast that is regex.
posted by disillusioned at 7:36 PM on March 26, 2006

Response by poster: The problem with innerHTML is that innerHTML returns all the stuff inside a tag.

So... the body innerHTML contains all the other nodes 'n' stuff.

I tried spidering around in the dom 'till I found the tags with no child nodes, but yeah... that ain't right either.

While I was out walking my dog, it occurred to me that I wasn't entirely clear in what I'm trying to do. Basically, I'm looking to search 'n' replace text on a webpage (any page) without screwing up the images and other attributes.
posted by ph00dz at 7:50 PM on March 26, 2006

sed \
-e "s/<\\([^>'\"]*\\|'[^']*'\\|\"[^\"]*\"\\)*>//g" \
-e 's/&lt;/</g' \
-e 's/&gt;/>/g' \
-e 's/&quot;/"/g' \
-e 's/&amp;/&/g'

  1. I used used GNU sed 4.1.4 (not particularly important) and bash 2.05b (some descendant of bourne shell is necessary for the quoting in the first regex). These should translate pretty well to javascript, but YMMV.
  2. Will behave badly if your HTML is not well-formed
  3. It only translates the "big four" character entity references as defined in section 5.3.2 of HTML 4.0.1 and does nothing at all for any other references.
  4. On preview: This isn't going to be useful at all for what you're doing because it doesn't retain the tags. But, it may help you out a bit. In particular, caveat 2 is going to bite you pretty hard
And, oh yeah, what the other guys said.... this is probably a better job for a real (DOM or SAX) parser. And "Mastering Regular Expressions" is the way to go.
posted by Mr Stickfigure at 7:59 PM on March 26, 2006

Eliminate all groups contained within angle brackets. Except for non-conforming HTML, that should leave you with body text. Maize no?
posted by five fresh fish at 8:00 PM on March 26, 2006

For the body text, seems like building an crawler to grab all the child nodes of a document, checking if they're text nodes, and concatenating them would do it.

I was going to try and write one as a proof of concept, but then I stumbled across element.textContent. I haven't yet tested it, but it looks like it does exactly what you want (try bodyText = document.getElementsByTagName('body')[0].textContent).

On preview: or perhaps not. Your original question asks for something very different from the question you just posted. It sounds like you want an output of the text in an HTML document that is formatted in such a way that you can recreate the document once you've made changes to it. Is this why you have the text in the link ("here") on a seperate line from the text previous to it, even though in the document it would run inline? Are you hoping to change particular "lines" from the output (corresponding to the various text nodes) and then have an automated way of cramming it all back into the proper elements?

Also, why did your spider fail? It's the most intuitive solution, so you might be better off trying to fix it.
posted by chrominance at 8:04 PM on March 26, 2006

And, on second thought: If you can find a SAX parser for Javascript, you'll probably be much happier with it than the DOM parser. Of course, that makes certain assumptions about where your document is coming from and what you need to do with it...
posted by Mr Stickfigure at 8:04 PM on March 26, 2006

Response by poster: Ugh. I think i got it... just adapted something I found here:

Thanks folks... Gotta remember to step away from my problems to walk the dog more often. (Although I would be curious to hear other Regex recommendations... because I don't know if I'll ever feel like a real programmer 'till I learn 'em...)
posted by ph00dz at 8:08 PM on March 26, 2006

You can can do anything with Javascript regular expressions that you can do with other languages' regular expressions.

As others have said, it's probably better to use the DOM and regular expressions. Use the DOM to identify which bit of the page you're working on and use the regex on just that piece.

I still don't know what you want exactly, but, back with an example in a second.
posted by AmbroseChapel at 8:22 PM on March 26, 2006

Here's something:
<html><head>	<title>replace js</title> <script>function replace(){extractedHTML = document.getElementById('favouritecolour').innerHTML;// use the DOM to get the contents of that divreg = new RegExp("My favourite colour is [^.]+.");/*   any colour will match because the regex is for   "any string of characters up to the next period". */replacementHTML = extractedHTML.replace(reg, "My favourite colour is orange");// whatever colour they put, change it to orangedocument.getElementById('favouritecolour').innerHTML = replacementHTML;// put the tweaked HTML back in to the div}</script> </head><body><p>	blah blah blah</p><div id="favouritecolour">	My favourite colour is blue. </div><p>	<a href="javascript:replace()">replace</a></p><p>	blah blah blah</p></body></html>

posted by AmbroseChapel at 8:26 PM on March 26, 2006

Best answer: liorean's article Regular Expressions in JavaScript at might be a good place to start.
posted by shoesfullofdust at 8:27 PM on March 26, 2006

So when you click on the "replace" link in that document, it should replace "my favourite colour is blue" with "my favourite colour is orange", and because it's a regex, it'll work with any colour, no only blue.
posted by AmbroseChapel at 8:28 PM on March 26, 2006

Best answer: Here's the actual code, in case someone stumbles on this later and needs it:
replaceSearchTerms('ph00dz', 'dumkopf'); 

// this code came from
// it was originally a highlighter... but repurposed for this thing!
function doReplace(bodyText, searchTerm, replaceText) 
  // find all occurences of the search term in the given text,
  // and add some "highlight" tags to them (we're not using a
  // regular expression search, because we want to filter out
  // matches that occur within HTML tags and script blocks, so
  // we have to do a little extra validation)
  var newText = "";
  var i = -1;
  var lcSearchTerm = searchTerm.toLowerCase();
  var lcBodyText = bodyText.toLowerCase();
  while (bodyText.length > 0)
    i = lcBodyText.indexOf(lcSearchTerm, i+1);
    if (i < 0)br>
      newText += bodyText;
      bodyText = "";
      // skip anything inside an HTML tag
      if (bodyText.lastIndexOf(">", i) >= bodyText.lastIndexOf("< , i))br>
        // skip anything inside a  block
        if (lcBodyText.lastIndexOf("/script>", i) >= lcBodyText.lastIndexOf("
          newText += bodyText.substring(0, i) + replaceText;
          bodyText = bodyText.substr(i + searchTerm.length); // bodyText.substr(i, searchTerm.length)
          lcBodyText = bodyText.toLowerCase();
          i = -1;
          } // end if
        } // end if
      } // end else
    } // end while
  return newText;
  } // end function

function replaceSearchTerms(searchText, replaceText)
  searchArray = [searchText];
  var bodyText = content.document.body.innerHTML;
  for (var i = 0; i < searcharray.length; i++)br>
    bodyText = doReplace(bodyText, searchArray[i], replaceText);
  content.document.body.innerHTML = bodyText;
  return true;

posted by ph00dz at 8:35 PM on March 26, 2006

s/no only blue/not only blue/
posted by AmbroseChapel at 8:46 PM on March 26, 2006

Regular expressions are overkill for something like this. I have two generic functions that I like to use.

The first generic function is cleanWhitespace. This function goes through an entire XML document and finds all text nodes that contain nothing but whitespace and removes them.
function cleanWhitespace( element ) {
// If no element is provided, do the whole HTML document
element = element || document;
// Use the first child as a starting point
var cur = element.firstChild;

// Go until there are no more child nodes
while ( cur != null ) {

// If the node is a text node, and it contains nothing but whitespace
if ( cur.nodeType == 3 && ! /\S/.test(cur.nodeValue) ) {
// Remove the text node
element.removeChild( cur );

// Otherwise, if it’s an element
} else if ( cur.nodeType == 1 ) {
// Recurse down through the document
cleanWhitespace( cur );

cur = cur.nextSibling; // Move through the child nodes

The second generic function is text. This function retreives the text contents of an element. Calling text(Element) will return a string containing the combined text contents of the element and all child elements that it contains.
function text(e) {
var t = "";

// If an element was passed, get it’s children,
// otherwise assume it’s an array
e = e.childNodes || e;

// Look through all child nodes
for ( var j = 0; j < e.length; j++ ) {br> // If it’s not an element, append its text value
// Otherwise, recurse through all the element’s children
t += e[j].nodeType != 1 ?
e[j].nodeValue : text(e[j].childNodes);

// Return the matched text
return t;

So, using both of those functions together, it would look something like this:
// Remove the extraneous whitespace from the document
// Get all the 'good' text
var myText = text(document);

and that's it! myText now contains all the text that you need! I hope this helps.

[plug] All this and more can be found in my upcoming book Professional Javascript Techniques. [/plug]
posted by jeresig at 8:54 PM on March 26, 2006

ungh, it demolished my whitespace and it seems as if you've already solved your problem. Just not my day today.
posted by jeresig at 8:57 PM on March 26, 2006

Have a look at the String.stripTags() function found in the prototype javascript library. I just did a quick test and it does exactly what you need with a single regular expression.

A good tutorial and reference on the library can be found on
posted by chrisch at 10:06 PM on March 26, 2006

this is trivial with xsl; google have released an xsl implementation in javascript.
posted by andrew cooke at 5:48 AM on March 27, 2006

Didn't want to sound too not-having-a-clue, but viewing the HTML in a browser and cutting/pasting what you see is perhaps far too simple?
posted by vanoakenfold at 8:48 AM on March 27, 2006

if you have the contents in a string in ruby:

mystring.gsub!(/< [\/]?.*?>/, ' ')

in PHP:

$mystring = preg_replace('/< [\/]?.*?>/', ' ', $mystring);

Then mystring will hold just the content you want. I replace with a space instead of nothing so you don't end up with words running into each other if stripping out things like a <br> that don't have a space around it.
posted by chrisroberts at 9:38 AM on March 27, 2006

Screw the dom, real javascript ninjas use regex (you can paste this in the urlbar to see if it does basically what you want):

javascript:for(var i=0; !document.childNodes[i].innerHTML; ++i); document.body.innerHTML=document.childNodes[i].innerHTML.replace(/< .*?>/gm, ');

The loop is to find the first non-empty node, which should contain the whole document if it's well formed. You'll need to beef up the regex a bit if you want it to remove CSS/Script blocks as well, but if you haven't moved on already I'd be happy to help with that.
posted by moift at 10:10 AM on March 27, 2006

There's not supposed to be a space between the < and the . in the regex, but I couldn't get it to show up right.
posted by moift at 10:11 AM on March 27, 2006

Also, the lone quote mark at the end should be a double (empty string)
posted by moift at 10:22 AM on March 27, 2006

There might be a much easlier way to do this - the links or lynx text-mode web browsers and their "dump" option:
mrbill@ohno:~> links -dump test.html
   Some text is here

mrbill@ohno:~> lynx -dump test.html


   Some text is [1]here


   1. file://localhost/disk/home/mrbill/fjkj.html

posted by mrbill at 11:56 PM on March 27, 2006

Perhaps I'm missing something? Are you after the text for some further action in your js code? If not and you just want a plain text version of it, use textutil if you are on OS X.
posted by jxpx777 at 2:40 PM on March 28, 2006

« Older How can I shut off the auto-click feature of the...   |   What foods help fight disease and make for an... Newer »
This thread is closed to new comments.