I Need to Train a Snake to Fetch...
March 11, 2010 8:04 PM   Subscribe

I need to get the title of a webpage via a URL I submit to a python script on a server.

I am writing an app for Google App Engine that will receive a URL as its input and post the URL, with its title, to a blog. (i.e., if I submitted "http://ask.metafilter.com/", it would set the blog post to be "Ask MetaFilter | Community Weblog" with a link to askMefI.) Everything is working fine except I don't know how to get the title of the URL I pass to the script.

I am going to be using this on Google App Engine, so, I think, I am limited on the addons I can use. Also, I am a beginner python scripter, so please write clearly. :)

Thanks Green!
posted by 47triple2 to Computers & Internet (16 answers total) 1 user marked this as a favorite
 
If you have access to BeautifulSoup, that will make this pretty easy.

import urllib2, BeautifulSoup
webpage = urllib2.urlopen(input_url)
soup = BeautifulSoup.BeautifulSoup()
soup.feed(webpage.read(-1))
webpage.close()
page_title = ''.join(soup.title.contents)


Obviously this has no error handling and it's a little blunt but hopefully it's enough to get you well on your way.
posted by brett at 8:13 PM on March 11, 2010


Best answer: Sorry, I meant to include the link for BeautifulSoup.
posted by brett at 8:14 PM on March 11, 2010


If you don't have access to BeautifulSoup you can use a regular expression to parse the title:


>>> import urllib2
>>> import re
>>> page = urllib2.urlopen('http://www.metafilter.com')
>>> page_contents = page.read()
>>> regex_result = re.search(r'(.+)', page_contents)
>>> page_title = regex_result.group(1)
>>> page_title
'MetaFilter | Community Weblog'


Refer to the urllib2 documentation and the regular expression documentation on the Python website.
posted by albatross84 at 8:29 PM on March 11, 2010


Err, meta filter stripped the html from re.search(..). This should do it:


posted by albatross84 at 8:33 PM on March 11, 2010


Meta filter also stripped the html from the last message, not surprisingly, perhaps. You'll have to use this link then:
http://pastebin.com/pj1j5uqj
posted by albatross84 at 8:34 PM on March 11, 2010


http://pastebin.com/pj1j5uqj :)
posted by albatross84 at 8:35 PM on March 11, 2010 [1 favorite]


Trying to parse SGML-family markup languages with regexes is nuts. Order the soup.
posted by flabdablet at 9:11 PM on March 11, 2010


Response by poster: Thanks for the suggestions, but none of them work. :( BeautifulSoup seemed promising, but I keep getting an error:

<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xe2 in position 24965: ordinal not in range(128)

when it tries to "soup.feed(webpage.read(-1))".

And the Reg-Ex method doesn't show anything.

Any other ideas? (Oh, and, BTW, Google App Engine is running Python 2.5.)
posted by 47triple2 at 10:10 PM on March 11, 2010


Even if you don't have BeautifulSoup available, you probably have HTMLParser (which is part of the standard library since python2.2) or htmllib (which is part of the standard library in python 2.x). You can certainly use a regexp, but using a purpose-designed parser like these will be easier and more reliable.

FWIW, if the URLs you're getting are from the outside world, be sure to think about the security implications (can I tell your script to "retrieve" a mailto: or file: URL? Can I give it an http URL with a query-string that does some operation on a website somewhere, like send spam? Can I point your script at a page I've written whose <title> tag contains javascript code which, when it's embedded in your blog post, does something nasty? Etc.)
posted by hattifattener at 10:10 PM on March 11, 2010


Response by poster: I found this script which seems to do what I want (and more), but I keep getting errors of the, "self.error("malformed start tag")" kind. It happens on the, "parser.feed(data)" line.
posted by 47triple2 at 10:58 PM on March 11, 2010


Best answer: Something like this should work:

p = urllib2.urlopen(url).read()
s = BeautifulSoup.BeautifulSoup(p)
t = s.find("title")
if t: title = string.join(t.contents, "")
posted by rainy at 11:47 PM on March 11, 2010


Actually, wait you'll get that damned ascii error.. ok, this is a bit ugly but:
p = urllib2.urlopen(url).read()
s = BeautifulSoup.BeautifulSoup(p)
t = s.find("title")
if t:
t = unicode(t)
t = re.sub(r'<>]*?>', '', t)

posted by rainy at 11:51 PM on March 11, 2010


Looks to me like you're missing a UTF-8 to Unicode decoding step.
posted by flabdablet at 2:15 AM on March 12, 2010


Ugh, encoding and decoding issues...

I am especially surprised that you're getting this when you feed the page into the soup. It might be worth double-checking that. If your error comes later -- like when you try to output the title to your file or whatever -- there's a reasonably easy solution. But if you're really getting it when you feed the page to BeautifulSoup, you're going to need more finesse.

Unfortunately, I can't detail exactly what you should do, because I don't completely understand the problem you're facing, and to be honest giving this issue the in-depth explanation it deserves would be a gross abuse of AskMe.

The basic, fundamental problem is that the data you're reading (the web page) is encoded one way -- meaning there's one particular way that the computer translates letters and other characters to the numbers that computers deal so well with. But when you feed it into other parts of your program, your program wants the data to be encoded a different way, using a different translation of letters to computer-numbers. And it's trying to convert things nicely for you automatically, but sometimes that's just not possible, and when that happens you get that error. It's like you're asking the computer to translate schadenfreude into a single English word, and it's giving up.

The way to really fix this problem is to make sure that all the data going into and our of your program first gets converted into a single encoding -- probably UTF-8. But doing this properly requires great discipline: it's something you have to do throughout your whole program. There's no easy one-line fix for it. And in order to do it well, you'll need to have a pretty good understanding of how encodings work, and how you deal with them.

Unfortunately, this is a mammoth topic. The single resource that helped me with this task the most is Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" If you understand it completely, then you'll have an excellent starting point to figuring out what you need your script to do specifically.

I know this is a lot to throw at someone who's just starting out. I wish I could figure out a better way to approach the issue, but I honestly don't know where else to start. Figuring this out will take time and effort, but if you do it well, I promise it will pay massive dividends over the course of all your programming life. This is the kind of thing that can be impressive at job interviews.

If you get to the point of understanding the fundamentals, some bits of Python that will help you out are looking at the documentation for string and Unicode objects (they're different), their respective encode and decode methods, and the built-in codecs library, especially its EncodedFile class.

Good luck!
posted by brett at 6:42 AM on March 12, 2010 [1 favorite]


Best answer: I got it! :)

My method is really messy, but because I am the only user, I'll live with it. You can read it on PasteBin.
posted by 47triple2 at 10:49 AM on March 12, 2010


Response by poster: Oh, and thanks to rainy for the "s = BeautifulSoup.BeautifulSoup(p)" line and brett for hooking me up with BeautifulSoup.
posted by 47triple2 at 10:52 AM on March 12, 2010


« Older Hot Air Balloon Safety statistics?   |   Spreading thought viruses onto the young ones Newer »
This thread is closed to new comments.