<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: Regex: Text from HTML, no attributes</title>
	<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes/</link>
	<description>Comments on Ask MetaFilter post Regex: Text from HTML, no attributes</description>
	<pubDate>Sun, 26 Mar 2006 19:26:44 -0800</pubDate>
	<lastBuildDate>Sun, 26 Mar 2006 19:26:44 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: Regex: Text from HTML, no attributes</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes</link>	
		<description>Regex Madness...filter. How do I pull the text out of an html document without looking at the tag attributes? &lt;br /&gt;&lt;br /&gt; I&apos;m using javascript... and I am just stuck. I think my brain is about to explode.&lt;br&gt;
&lt;br&gt;
I&apos;m trying to pull certain things out of an html document. Let&apos;s say, for simplicity&apos;s sake, it looks like this... &apos;cept with html tags. (Had to change &apos;em to display here.)&lt;br&gt;
&lt;br&gt;
&lt;pre&gt;&lt;br&gt;
[!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.01 Transitional//EN&quot;]&lt;br&gt;
[html]&lt;br&gt;
  [head]&lt;br&gt;
  [meta http-equiv=&quot;content-type&quot; content=&quot;text/html; charset=windows-1250&quot;]&lt;br&gt;
  [meta name=&quot;generator&quot; content=&quot;PSPad editor, www.pspad.com&quot;]&lt;br&gt;
  [title]Sample Document[/title]&lt;br&gt;
  [/head]&lt;br&gt;
  [body]&lt;br&gt;
    [p]&lt;br&gt;
      [img src=&quot;http://blah.com/sample.jpg&quot;]&lt;br&gt;
    [/p]&lt;br&gt;
    [p]&lt;br&gt;
      Some text is [a href=&quot;fjkj.html&quot;]here[/a]&lt;br&gt;
    [/p]&lt;br&gt;
  [/body]&lt;br&gt;
[/html]&lt;br&gt;
&lt;/pre&gt;&lt;br&gt;
&lt;br&gt;
All I want out of that thing is:&lt;br&gt;
Sample Document&lt;br&gt;
Some text is&lt;br&gt;
here&lt;br&gt;
&lt;br&gt;
Is that possible? I thought I had something working... but I was so wrong.&lt;br&gt;
&lt;br&gt;
I tried to spider down through the dom, but I never could get that right either.&lt;br&gt;
&lt;br&gt;
As a bonus... is there a particular book/tutorial folks recommend for understandings the mighty regex?</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2006:site.35120</guid>
		<pubDate>Sun, 26 Mar 2006 19:15:47 -0800</pubDate>
		<dc:creator>ph00dz</dc:creator>
		
			<category>regular</category>
		
			<category>expressions</category>
		
			<category>regex</category>
		
			<category>javascript</category>
		
			<category>dhtml</category>
		
	</item> <item>
		<title>By: symphonik</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547011</link>	
		<description>The DOM is the best way to do this... try playing with some of the examples at this site:&lt;br&gt;
&lt;br&gt;
&lt;a href=&quot;http://www.w3schools.com/js/js_examples_3.asp&quot;&gt;http://www.w3schools.com/js/js_examples_3.asp&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
Fact of the matter is, it&apos;s the best tool for doing this, far better and more reliable than regexps!&lt;br&gt;
&lt;br&gt;
getElementsByTagName, for example, would be one way to find the title. There may even be convenience API these days where you can ask just for the page title.&lt;br&gt;
&lt;br&gt;
Once you have an element, you can ask for its innerText() or (more compatible) innerHTML().&lt;br&gt;
&lt;br&gt;
Another &lt;a href=&quot;http://www.quirksmode.org/dom/&quot;&gt;good site&lt;/a&gt; when shit inevitably breaks in some browsers. :)</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547011</guid>
		<pubDate>Sun, 26 Mar 2006 19:26:44 -0800</pubDate>
		<dc:creator>symphonik</dc:creator>
	</item><item>
		<title>By: shoesfullofdust</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547014</link>	
		<description>For the bonus question I would suggest Jeffrey Friedl&apos;s &lt;a href=&quot;http://regex.info/&quot; title=&quot;Jeffrey Friedl&apos;s Mastering Regular Expressions&quot;&gt;Mastering Regular Expressions&lt;/a&gt;.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547014</guid>
		<pubDate>Sun, 26 Mar 2006 19:27:57 -0800</pubDate>
		<dc:creator>shoesfullofdust</dc:creator>
	</item><item>
		<title>By: disillusioned</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547018</link>	
		<description>&lt;a href=&quot;http://www.tote-taste.de/X-Project/regex/index.php&quot;&gt;This&lt;/a&gt; is also a useful resource for at least pretending to tackle the inhumane beast that is regex.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547018</guid>
		<pubDate>Sun, 26 Mar 2006 19:36:38 -0800</pubDate>
		<dc:creator>disillusioned</dc:creator>
	</item><item>
		<title>By: ph00dz</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547027</link>	
		<description>The problem with innerHTML is that innerHTML returns all the stuff inside a tag.&lt;br&gt;
&lt;br&gt;
So... the body innerHTML contains all the other nodes &apos;n&apos; stuff.&lt;br&gt;
&lt;br&gt;
I tried spidering around in the dom &apos;till I found the tags with no child nodes, but yeah... that ain&apos;t right either. &lt;br&gt;
&lt;br&gt;
While I was out walking my dog, it occurred to me that I wasn&apos;t entirely clear in what I&apos;m trying to do. Basically, I&apos;m looking to search &apos;n&apos; replace text on a webpage (any page) without screwing up the images and other attributes.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547027</guid>
		<pubDate>Sun, 26 Mar 2006 19:50:58 -0800</pubDate>
		<dc:creator>ph00dz</dc:creator>
	</item><item>
		<title>By: Mr Stickfigure</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547034</link>	
		<description>sed \&lt;br&gt;
	-e &quot;s/&amp;lt;\\([^&amp;gt;&apos;\&quot;]*\\|&apos;[^&apos;]*&apos;\\|\&quot;[^\&quot;]*\&quot;\\)*&amp;gt;//g&quot; \&lt;br&gt;
	-e &apos;s/&amp;amp;lt;/&amp;lt;/g&apos; \&lt;br&gt;
	-e &apos;s/&amp;amp;gt;/&amp;gt;/g&apos; \&lt;br&gt;
	-e &apos;s/&amp;amp;quot;/&quot;/g&apos; \&lt;br&gt;
	-e &apos;s/&amp;amp;amp;/&amp;amp;/g&apos;&lt;br&gt;
&lt;br&gt;
Caveats:&lt;br&gt;
&lt;ol&gt;&lt;li&gt;I used used GNU sed 4.1.4 (not particularly important) and bash 2.05b (some descendant of bourne shell is necessary for the quoting in the first regex).  These &lt;em&gt;should&lt;/em&gt; translate pretty well to javascript, but YMMV.&lt;/li&gt;&lt;li&gt;Will behave badly if your HTML is not well-formed&lt;/li&gt;&lt;li&gt;It only translates the &quot;big four&quot; character entity references as defined in section 5.3.2 of HTML 4.0.1 and does nothing at all for any other references.&lt;/li&gt;&lt;li&gt;On preview: This isn&apos;t going to be useful at all for what you&apos;re doing because it doesn&apos;t retain the tags.  But, it may help you out a bit.  In particular, caveat 2 is going to bite you pretty hard&lt;/li&gt;&lt;/ol&gt;&lt;br&gt;
&lt;small&gt;And, oh yeah, what the other guys said.... this is probably a better job for a real (DOM or SAX) parser.  And &quot;Mastering Regular Expressions&quot; is the way to go.&lt;/small&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547034</guid>
		<pubDate>Sun, 26 Mar 2006 19:59:59 -0800</pubDate>
		<dc:creator>Mr Stickfigure</dc:creator>
	</item><item>
		<title>By: five fresh fish</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547035</link>	
		<description>Eliminate all groups contained within angle brackets.  Except for non-conforming HTML, that should leave you with body text.  Maize no?</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547035</guid>
		<pubDate>Sun, 26 Mar 2006 20:00:04 -0800</pubDate>
		<dc:creator>five fresh fish</dc:creator>
	</item><item>
		<title>By: chrominance</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547038</link>	
		<description>For the body text, seems like building an crawler to grab all the child nodes of a document, checking if they&apos;re text nodes, and concatenating them would do it.&lt;br&gt;
&lt;br&gt;
I was going to try and write one as a proof of concept, but then I stumbled across &lt;a href=&quot;http://developer.mozilla.org/en/docs/DOM:element.textContent&quot;&gt;element.textContent.&lt;/a&gt; I haven&apos;t yet tested it, but it looks like it does exactly what you want (try &lt;code&gt;bodyText = document.getElementsByTagName(&apos;body&apos;)[0].textContent&lt;/code&gt;).&lt;br&gt;
&lt;br&gt;
On preview: or perhaps not. Your original question asks for something very different from the question you just posted. It sounds like you want an output of the text in an HTML document that is formatted in such a way that you can recreate the document once you&apos;ve made changes to it. Is this why you have the text in the link (&quot;here&quot;) on a seperate line from the text previous to it, even though in the document it would run inline? Are you hoping to change particular &quot;lines&quot; from the output (corresponding to the various text nodes) and then have an automated way of cramming it all back into the proper elements?&lt;br&gt;
&lt;br&gt;
Also, why did your spider fail? It&apos;s the most intuitive solution, so you might be better off trying to fix it.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547038</guid>
		<pubDate>Sun, 26 Mar 2006 20:04:01 -0800</pubDate>
		<dc:creator>chrominance</dc:creator>
	</item><item>
		<title>By: Mr Stickfigure</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547039</link>	
		<description>And, on second thought: If you can find a SAX parser for Javascript, you&apos;ll probably be much happier with it than the DOM parser.  Of course, that makes certain assumptions about where your document is coming from and what you need to do with it...</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547039</guid>
		<pubDate>Sun, 26 Mar 2006 20:04:47 -0800</pubDate>
		<dc:creator>Mr Stickfigure</dc:creator>
	</item><item>
		<title>By: ph00dz</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547042</link>	
		<description>Ugh. I think i got it... just adapted something I found here:&lt;br&gt;
&lt;br&gt;
&lt;a href=&quot;http://www.nsftools.com/misc/SearchAndHighlight.htm&quot;&gt;http://www.nsftools.com/misc/SearchAndHighlight.htm&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
Thanks folks... Gotta remember to step away from my problems to walk the dog more often. (Although I would be curious to hear other Regex recommendations... because I don&apos;t know if I&apos;ll ever feel like a real programmer &apos;till I learn &apos;em...)</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547042</guid>
		<pubDate>Sun, 26 Mar 2006 20:08:39 -0800</pubDate>
		<dc:creator>ph00dz</dc:creator>
	</item><item>
		<title>By: AmbroseChapel</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547048</link>	
		<description>You can can do anything with Javascript regular expressions that you can do with other languages&apos; regular expressions.&lt;br&gt;
&lt;br&gt;
As others have said, it&apos;s probably better to use the DOM &lt;em&gt;and&lt;/em&gt; regular expressions. Use the DOM to identify which bit of the page you&apos;re working on and use the regex on just that piece.&lt;br&gt;
&lt;br&gt;
I still don&apos;t know what you want exactly, but, back with an example in a second.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547048</guid>
		<pubDate>Sun, 26 Mar 2006 20:22:26 -0800</pubDate>
		<dc:creator>AmbroseChapel</dc:creator>
	</item><item>
		<title>By: AmbroseChapel</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547049</link>	
		<description>Here&apos;s something:&lt;br&gt;
&lt;br&gt;
&lt;pre&gt;&amp;lt;html&amp;gt;&lt;br&gt;&amp;lt;head&amp;gt;&lt;br&gt;	&amp;lt;title&amp;gt;replace js&amp;lt;/title&amp;gt; &amp;lt;script&amp;gt;&lt;br&gt;function replace(){&lt;br&gt;extractedHTML = document.getElementById(&apos;favouritecolour&apos;).innerHTML;&lt;br&gt;// use the DOM to get the contents of that div&lt;br&gt;reg = new RegExp(&quot;My favourite colour is [^.]+.&quot;);&lt;br&gt;/* &lt;br&gt;  any colour will match because the regex is for &lt;br&gt;  &quot;any string of characters up to the next period&quot;. &lt;br&gt;*/&lt;br&gt;replacementHTML = extractedHTML.replace(reg, &quot;My favourite colour is orange&quot;);&lt;br&gt;// whatever colour they put, change it to orange&lt;br&gt;document.getElementById(&apos;favouritecolour&apos;).innerHTML = replacementHTML;&lt;br&gt;// put the tweaked HTML back in to the div&lt;br&gt;}&lt;br&gt;&amp;lt;/script&amp;gt; &lt;br&gt;&amp;lt;/head&amp;gt;&lt;br&gt;&amp;lt;body&amp;gt;&lt;br&gt;&amp;lt;p&amp;gt;&lt;br&gt;	blah blah blah&lt;br&gt;&amp;lt;/p&amp;gt;&lt;br&gt;&amp;lt;div id=&quot;favouritecolour&quot;&amp;gt;&lt;br&gt;	My favourite colour is blue. &lt;br&gt;&amp;lt;/div&amp;gt;&lt;br&gt;&amp;lt;p&amp;gt;&lt;br&gt;	&amp;lt;a href=&quot;javascript:replace()&quot;&amp;gt;replace&amp;lt;/a&amp;gt;&lt;br&gt;&amp;lt;/p&amp;gt;&lt;br&gt;&amp;lt;p&amp;gt;&lt;br&gt;	blah blah blah&lt;br&gt;&amp;lt;/p&amp;gt;&lt;br&gt;&amp;lt;/body&amp;gt;&lt;br&gt;&amp;lt;/html&amp;gt;&lt;br&gt;&lt;/pre&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547049</guid>
		<pubDate>Sun, 26 Mar 2006 20:26:52 -0800</pubDate>
		<dc:creator>AmbroseChapel</dc:creator>
	</item><item>
		<title>By: shoesfullofdust</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547050</link>	
		<description>liorean&apos;s article &lt;a href=&quot;http://www.evolt.org/article/headline/17/36435/index.html&quot; title=&quot;Regular Expressions in JavaScript | evolt.org&quot;&gt;Regular Expressions in JavaScript&lt;/a&gt; at evolt.org might be a good place to start.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547050</guid>
		<pubDate>Sun, 26 Mar 2006 20:27:35 -0800</pubDate>
		<dc:creator>shoesfullofdust</dc:creator>
	</item><item>
		<title>By: AmbroseChapel</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547051</link>	
		<description>So when you click on the &quot;replace&quot; link in that document, it should replace &quot;my favourite colour is blue&quot; with &quot;my favourite colour is orange&quot;, and because it&apos;s a regex, it&apos;ll work with any colour, no only blue.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547051</guid>
		<pubDate>Sun, 26 Mar 2006 20:28:31 -0800</pubDate>
		<dc:creator>AmbroseChapel</dc:creator>
	</item><item>
		<title>By: ph00dz</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547053</link>	
		<description>Here&apos;s the actual code, in case someone stumbles on this later and needs it:&lt;br&gt;
&lt;br&gt;
&lt;pre&gt;&lt;br&gt;
replaceSearchTerms(&apos;ph00dz&apos;, &apos;dumkopf&apos;); &lt;br&gt;
&lt;br&gt;
&lt;br&gt;
// this code came from http://www.nsftools.com/misc/SearchAndHighlight.htm&lt;br&gt;
// it was originally a highlighter... but repurposed for this thing!&lt;br&gt;
function doReplace(bodyText, searchTerm, replaceText) &lt;br&gt;
  {&lt;br&gt;
  // find all occurences of the search term in the given text,&lt;br&gt;
  // and add some &quot;highlight&quot; tags to them (we&apos;re not using a&lt;br&gt;
  // regular expression search, because we want to filter out&lt;br&gt;
  // matches that occur within HTML tags and script blocks, so&lt;br&gt;
  // we have to do a little extra validation)&lt;br&gt;
  var newText = &quot;&quot;;&lt;br&gt;
  var i = -1;&lt;br&gt;
  var lcSearchTerm = searchTerm.toLowerCase();&lt;br&gt;
  var lcBodyText = bodyText.toLowerCase();&lt;br&gt;
    &lt;br&gt;
  while (bodyText.length &amp;gt; 0)&lt;br&gt;
    {&lt;br&gt;
    i = lcBodyText.indexOf(lcSearchTerm, i+1);&lt;br&gt;
    if (i &lt; 0)br&gt;
      {&lt;br&gt;
      newText += bodyText;&lt;br&gt;
      bodyText = &quot;&quot;;&lt;br&gt;
      } &lt;br&gt;
    else&lt;br&gt;
      {&lt;br&gt;
      // skip anything inside an HTML tag&lt;br&gt;
      if (bodyText.lastIndexOf(&quot;&amp;gt;&quot;, i) &amp;gt;= bodyText.lastIndexOf(&quot;&lt; , i))br&gt;
        {&lt;br&gt;
        // skip anything inside a  block&lt;br&gt;
        if (lcBodyText.lastIndexOf(&quot;/script&amp;gt;&quot;, i) &amp;gt;= lcBodyText.lastIndexOf(&quot;
          {&lt;br&gt;
          newText += bodyText.substring(0, i) + replaceText;&lt;br&gt;
          bodyText = bodyText.substr(i + searchTerm.length); // bodyText.substr(i, searchTerm.length)&lt;br&gt;
          lcBodyText = bodyText.toLowerCase();&lt;br&gt;
          i = -1;&lt;br&gt;
          } // end if&lt;br&gt;
        } // end if&lt;br&gt;
      } // end else&lt;br&gt;
    } // end while&lt;br&gt;
  &lt;br&gt;
  return newText;&lt;br&gt;
  } // end function&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
function replaceSearchTerms(searchText, replaceText)&lt;br&gt;
  {&lt;br&gt;
  searchArray = [searchText];&lt;br&gt;
  var bodyText = content.document.body.innerHTML;&lt;br&gt;
  for (var i = 0; i &lt; searcharray.length; i++)br&gt;
    {&lt;br&gt;
    bodyText = doReplace(bodyText, searchArray[i], replaceText);&lt;br&gt;
    }&lt;br&gt;
  &lt;br&gt;
  content.document.body.innerHTML = bodyText;&lt;br&gt;
  return true;&lt;br&gt;
  }&lt;br&gt;
&lt;br&gt;
&lt;/&gt;&lt;/&gt;&lt;/&gt;&lt;/pre&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547053</guid>
		<pubDate>Sun, 26 Mar 2006 20:35:58 -0800</pubDate>
		<dc:creator>ph00dz</dc:creator>
	</item><item>
		<title>By: AmbroseChapel</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547058</link>	
		<description>s/no only blue/not only blue/</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547058</guid>
		<pubDate>Sun, 26 Mar 2006 20:46:58 -0800</pubDate>
		<dc:creator>AmbroseChapel</dc:creator>
	</item><item>
		<title>By: jeresig</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547068</link>	
		<description>Regular expressions are overkill for something like this. I have two generic functions that I like to use.&lt;br&gt;
&lt;hr /&gt;The first generic function is &lt;b&gt;cleanWhitespace&lt;/b&gt;. This function goes through an entire XML document and finds all text nodes that contain nothing but whitespace and removes them.&lt;br&gt;
&lt;tt&gt;function cleanWhitespace( element ) {&lt;br&gt;
  // If no element is provided, do the whole HTML document&lt;br&gt;
  element = element || document;&lt;br&gt;
  // Use the first child as a starting point&lt;br&gt;
  var cur = element.firstChild;&lt;br&gt;
&lt;br&gt;
  // Go until there are no more child nodes&lt;br&gt;
  while ( cur != null ) {&lt;br&gt;
&lt;br&gt;
  // If the node is a text node, and it contains nothing but whitespace&lt;br&gt;
    if ( cur.nodeType == 3 &amp;amp;&amp;amp; ! /\S/.test(cur.nodeValue) ) {&lt;br&gt;
      // Remove the text node&lt;br&gt;
      element.removeChild( cur );&lt;br&gt;
&lt;br&gt;
    // Otherwise, if it&apos;s an element&lt;br&gt;
    } else if ( cur.nodeType == 1 ) {&lt;br&gt;
      // Recurse down through the document&lt;br&gt;
      cleanWhitespace( cur );&lt;br&gt;
    }&lt;br&gt;
&lt;br&gt;
    cur = cur.nextSibling; // Move through the child nodes&lt;br&gt;
  }&lt;br&gt;
}&lt;/tt&gt;&lt;hr /&gt;The second generic function is &lt;b&gt;text&lt;/b&gt;. This function retreives the text contents of an element. Calling text(Element) will return a string containing the combined text contents of the element and all child elements that it contains.&lt;br&gt;
&lt;tt&gt;function text(e) {&lt;br&gt;
  var t = &quot;&quot;;&lt;br&gt;
&lt;br&gt;
  // If an element was passed, get it&apos;s children, &lt;br&gt;
  // otherwise assume it&apos;s an array&lt;br&gt;
  e = e.childNodes || e;&lt;br&gt;
&lt;br&gt;
  // Look through all child nodes&lt;br&gt;
  for ( var j = 0; j &lt; e.length; j++ ) {br&gt;
    // If it&apos;s not an element, append its text value&lt;br&gt;
    // Otherwise, recurse through all the element&apos;s children &lt;br&gt;
    t += e[j].nodeType != 1 ?&lt;br&gt;
      e[j].nodeValue : text(e[j].childNodes);&lt;br&gt;
  }&lt;br&gt;
&lt;br&gt;
  // Return the matched text&lt;br&gt;
  return t;&lt;br&gt;
}&lt;/&gt;&lt;/tt&gt;&lt;hr /&gt;&lt;br&gt;
So, using both of those functions together, it would look something like this:&lt;br&gt;
&lt;tt&gt;// Remove the extraneous whitespace from the document&lt;br&gt;
cleanWhitespace();&lt;br&gt;
// Get all the &apos;good&apos; text&lt;br&gt;
var myText = text(document);&lt;/tt&gt;&lt;br&gt;
and that&apos;s it! &lt;b&gt;myText&lt;/b&gt; now contains all the text that you need! I hope this helps.&lt;br&gt;
&lt;br&gt;
&lt;small&gt;[plug] All this and more can be found in my upcoming book &lt;b&gt;Professional Javascript Techniques&lt;/b&gt;. [/plug]&lt;/small&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547068</guid>
		<pubDate>Sun, 26 Mar 2006 20:54:50 -0800</pubDate>
		<dc:creator>jeresig</dc:creator>
	</item><item>
		<title>By: jeresig</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547070</link>	
		<description>&lt;small&gt;ungh, it demolished my whitespace &lt;i&gt;and&lt;/i&gt; it seems as if you&apos;ve already solved your problem. Just not my day today.&lt;/small&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547070</guid>
		<pubDate>Sun, 26 Mar 2006 20:57:19 -0800</pubDate>
		<dc:creator>jeresig</dc:creator>
	</item><item>
		<title>By: chrisch</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547117</link>	
		<description>Have a look at the String.stripTags() function found in the &lt;a href=&apos;http://prototype.conio.net/&apos;&gt;prototype&lt;/a&gt; javascript library.  I just did a quick test and it does exactly what you need with a single regular expression.&lt;br&gt;
&lt;br&gt;
A good tutorial and reference on the library can be found on &lt;a href=&apos;http://www.sergiopereira.com/articles/prototype.js.html&apos;&gt;sergiopereira.com&lt;/a&gt;.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547117</guid>
		<pubDate>Sun, 26 Mar 2006 22:06:37 -0800</pubDate>
		<dc:creator>chrisch</dc:creator>
	</item><item>
		<title>By: andrew cooke</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547277</link>	
		<description>this is trivial with xsl; google have released an xsl implementation in javascript.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547277</guid>
		<pubDate>Mon, 27 Mar 2006 05:48:49 -0800</pubDate>
		<dc:creator>andrew cooke</dc:creator>
	</item><item>
		<title>By: vanoakenfold</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547451</link>	
		<description>Didn&apos;t want to sound too not-having-a-clue, but viewing the HTML in a browser and cutting/pasting what you see is perhaps far too simple?</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547451</guid>
		<pubDate>Mon, 27 Mar 2006 08:48:56 -0800</pubDate>
		<dc:creator>vanoakenfold</dc:creator>
	</item><item>
		<title>By: chrisroberts</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547507</link>	
		<description>if you have the contents in a string in ruby:&lt;br&gt;
&lt;br&gt;
mystring.gsub!(/&lt; [\/]?.*?&gt;/, &apos; &apos;)&lt;br&gt;
&lt;br&gt;
in PHP:&lt;br&gt;
&lt;br&gt;
$mystring = preg_replace(&apos;/&lt; [\/]?.*?&gt;/&apos;, &apos; &apos;, $mystring);&lt;br&gt;
&lt;br&gt;
Then mystring will hold just the content you want. I replace with a space instead of nothing so you don&apos;t end up with words running into each other if stripping out things like a &amp;lt;br&amp;gt; that don&apos;t have a space around it.&lt;/&gt;&lt;/&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547507</guid>
		<pubDate>Mon, 27 Mar 2006 09:38:21 -0800</pubDate>
		<dc:creator>chrisroberts</dc:creator>
	</item><item>
		<title>By: moift</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547558</link>	
		<description>Screw the dom, real javascript ninjas use regex (you can paste this in the urlbar to see if it does basically what you want):&lt;br&gt;
&lt;br&gt;
&lt;code&gt;&lt;br&gt;
javascript:for(var i=0; !document.childNodes[i].innerHTML; ++i); document.body.innerHTML=document.childNodes[i].innerHTML.replace(/&lt; .*?&gt;/gm, &apos;);&lt;/&gt;&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
The loop is to find the first non-empty node, which should contain the whole document if it&apos;s well formed.  You&apos;ll need to beef up the regex a bit if you want it to remove CSS/Script blocks as well, but if you haven&apos;t moved on already I&apos;d be happy to help with that.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547558</guid>
		<pubDate>Mon, 27 Mar 2006 10:10:11 -0800</pubDate>
		<dc:creator>moift</dc:creator>
	</item><item>
		<title>By: moift</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547561</link>	
		<description>There&apos;s not supposed to be a space between the &amp;lt; and the . in the regex, but I couldn&apos;t get it to show up right.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547561</guid>
		<pubDate>Mon, 27 Mar 2006 10:11:22 -0800</pubDate>
		<dc:creator>moift</dc:creator>
	</item><item>
		<title>By: moift</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#547576</link>	
		<description>Also, the lone quote mark at the end should be a double (empty string)&lt;br&gt;
&lt;small&gt;drat&lt;/small&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-547576</guid>
		<pubDate>Mon, 27 Mar 2006 10:22:30 -0800</pubDate>
		<dc:creator>moift</dc:creator>
	</item><item>
		<title>By: mrbill</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#548135</link>	
		<description>There might be a much easlier way to do this - the &lt;a href=&quot;http://links.sourceforge.net/&quot;&gt;links&lt;/a&gt; or &lt;a href=&quot;http://lynx.browser.org&quot;&gt;lynx&lt;/a&gt; text-mode web browsers and their &quot;dump&quot; option:&lt;br&gt;
&lt;tt&gt;&lt;pre&gt;&lt;br&gt;
mrbill@ohno:~&amp;gt; links -dump test.html&lt;br&gt;
   Some text is here&lt;br&gt;
&lt;br&gt;
mrbill@ohno:~&amp;gt; lynx -dump test.html&lt;br&gt;
&lt;br&gt;
   [sample.jpg]&lt;br&gt;
&lt;br&gt;
   Some text is [1]here&lt;br&gt;
&lt;br&gt;
References&lt;br&gt;
&lt;br&gt;
   1. file://localhost/disk/home/mrbill/fjkj.html&lt;br&gt;
&lt;/pre&gt;&lt;/tt&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-548135</guid>
		<pubDate>Mon, 27 Mar 2006 23:56:29 -0800</pubDate>
		<dc:creator>mrbill</dc:creator>
	</item><item>
		<title>By: jxpx777</title>
		<link>http://ask.metafilter.com/35120/Regex-Text-from-HTML-no-attributes#548875</link>	
		<description>Perhaps I&apos;m missing something? Are you after the text for some further action in your js code? If not and you just want a plain text version of it, use textutil if you are on OS X.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.35120-548875</guid>
		<pubDate>Tue, 28 Mar 2006 14:40:50 -0800</pubDate>
		<dc:creator>jxpx777</dc:creator>
	</item>
	</channel>
</rss>
