Automate web site usage?
September 1, 2010 4:12 PM   Subscribe

How do I automate interaction with a web site via a computer program?

I'd like to learn how to automate the user interaction with a web site. An easy example may be logging into multiple free (yahoo, gmail) email account(s) and retrieving all new mails. Another app I thought I'd like to investigate is writing my own personal auction sniper for ebay auctions.

I'm most experienced in Win32 using MS Visual Studio and I've programmed almost exclusively in C/C++, but I welcome all programming language suggestions. I've also done only standalone app dev. We or client/server programming is all new to me.

For some reason I have a feeling that python under linux might be the recommendation here, which is also fine, because I'm planning on looking heavily into both of those topics this winter. If the language is C# that's good too, because I want to learn that as well. Really, though, any explanation of whatever language/way you feel is the best way to approach this and why would help.

Thanks!!

(Sorry if repeat. I searched prior submissions, but I don't know the lingo very well.)
posted by InsertNiftyNameHere to Computers & Internet (17 answers total) 6 users marked this as a favorite
 
Selenium is pretty much the industry standard for this sort of thing, since it lets you drive an actual browser. If you don't need that, you could look at twill or WWW::Mechanize.
posted by asterix at 4:15 PM on September 1, 2010 [5 favorites]


You can do it in any language, it really doesn't matter much.

There are typically a few approaches:

a) the service supports some kind of API - REST or SOAP are two common examples. There's usually a library for your programming language that uses these APIs. Using the APIs is much like using any other API for a programming language, it adds some functions to your language that you can use to interact with the site.

b) the service doesn't support an API. You fetch the contents of the web page and "scrape" the data you want out of it. This is usually kind of a PITA but doable, I've done it a million times. Your language *probably* supports HTTP kinda directly via some kind of library. If it doesn't, you can open a socket to (usually port 80) on the web server, you send it a request, which is just a bit of text like
GET /index.html HTTP/1.1
and it sends you some headers and the HTML source of the page you requested.

If you don't want to bother with opening a socket and messing with headers, you can shell out to a command line program, like wget or curl, which are programs designed to download web pages for you. You can have them save the page to a file and open it in your programming language and parse it from there.
posted by RustyBrooks at 4:18 PM on September 1, 2010


the service supports some kind of API

I was just going to say... this is the way to go. You can cobble something together to do screen-scraping, but you will come to hate life quickly, particularly when the site you're scraping changes something small and all your code breaks.
posted by asterix at 4:25 PM on September 1, 2010


Use the APIs. Don't scrape. Trust me on this one.
posted by jeffamaphone at 4:28 PM on September 1, 2010


FWIW, the standard API for interacting with mail services is actually either the "POP3" or "IMAP" protocols- most services support these directly. You'll have a much easier time solving that specific task using libraries that speak these protocols rather than trying to hork together a screen scraper. Pop3 and Imap libraries exist for all major programming languages these days.
posted by jenkinsEar at 4:43 PM on September 1, 2010


"ChickenFoot is a Firefox extension that puts a programming environment in the browser's sidebar so you can write scripts to manipulate web pages and automate web browsing."

"CoScripter is a system for recording, automating, and sharing processes performed in a web browser such as printing photos online, requesting a vacation hold for postal mail, or checking flight arrival times."
posted by jasonhong at 4:49 PM on September 1, 2010


I personally use curl, but it's for an extremely lightweight task that involves scraping a site with no API. I dont use firefox so i cant comment on the extensions that have been mentioned; you could have some luck using autohotkey, depending on your tasks.. the choices are endless. But curl will probaby be the fastest..
posted by 3mendo at 5:06 PM on September 1, 2010


Basically what rustybrooks and asterix said. You have your choices of

A) Deal with an API (clean)
B) Screen scape (bad, prone to breakage)
C) Drive a whole browser (or rendering library) directly (e.g. IE with C#, see here), or use the Selenium API with your favorite language.

(A) is best to get data into a database or do a mashup, (C) is best for website testing or getting around tedious security measures, and (B) is a pain in the ass which you should only do if there is no other alternative. If you're just looking to get your feet wet with some general purpose web scripts I'd suggest (A) in your favorite language.
posted by benzenedream at 5:33 PM on September 1, 2010


nth'ing selenium. The IDE lets you record your scripts and play them back. There's python bindings that run on Linux if you want it.

The real challenge though is keeping your scripts up to date with HTML and web app structure. Websites that want you to actually use their data offer APIs. Websites that want to hide it offer HTML that fails to parse in new ways weekly. With email specifically, the protocols imap and pop3 are sometimes available; check your webmail hosts for details.

What selenium offers is access to complete browser environment. You get a very lenient HTML parser that most websites test against, javascript automation, and a cache of cookies. This makes it pretty good for sites that don't make an effort to be HTML compliant, testing your own ajax website, or credentialed websites.
posted by pwnguin at 6:02 PM on September 1, 2010


Let me just add that my last job was at a company that, among other things, had a product that would take an email you crafted, and send it to addresses at hundreds of web-mail companies. It would use these kinds of techniques to log into these webmail clients and take a snapshot of the email. This way you could see how your email would look under all these webmail clients.

It took 2-3 people working full time to keep these all up to date and working.

Before amazon had an API, I used to scrape it. It was a PITA.
posted by RustyBrooks at 8:49 PM on September 1, 2010


Sikuli.

With sikuli you can script anything really, anything.
posted by Brent Parker at 9:55 PM on September 1, 2010


Depending on what you want to do, look into iMacros for FireFox. You can record macros then edit them using a kind of pseudo-basic procedural scripting language.

I actually found that a particularly powerful combination for screen scraping (ebay.com, for the record) was using iMacros to batch-download individual pages, then $yourFavouriteScriptingLanguage (I used Groovy with TagSoup) to parse the pages themselves.
posted by primer_dimer at 2:18 AM on September 2, 2010


Above is all good advice - plenty of options.

So a few more to make the decision a bit harder.

AutoIT for script automation.

Another coding option is PHP CLI which runs on both windows and *nix and has IMAP and POP3 libraries for handling mail, as well as curl for http.

As a C/C++ coder, you can use that with MFC or .Net, embedding a webbrowser component. This can be made to do just about anything but is not particularly fast. For speed, look at HttpWebRequest or WebClient for more low level stuff. Only thing missing is a really good free IMAP library equal to what you have with say PHP or Perl, etc.

@Brent Parker - thanks for the sikuli link, that looks interesting.
posted by w.fugawe at 12:55 PM on September 2, 2010


Response by poster: Wow! All these answers are GREAT! There's obviously a lot to chew on here, but I just wanted to say a huge thanks to all the people who replied.

Anyone else who may have something to contribute, please do so.

Thanks again!
posted by InsertNiftyNameHere at 1:44 PM on September 2, 2010


Response by poster: Oh! One quick question I have right off the bat. What's the easiest way to determine if a web site has an API? I know I can do a web search on "sitename API" but I thought I might be missing something more obvious. Thanks again for all the help.
posted by InsertNiftyNameHere at 1:46 PM on September 2, 2010


Differs from site to site. Usually there will be an explicit link for developers or APIs.

Google have a boat load for their various services- most of which provide much less data than you could get by being naughty which is why they have IP quotas.

Many sites do not want you to scrape data - check in their T&Cs. You can be a good citizen and abide by robots.txt. However, most useful sites are being hammered 24 x 7 by bots regardless of any T&Cs.
posted by w.fugawe at 1:56 PM on September 2, 2010


In order to interact with the browser what you want to do is access objects from the browser DOM. Selenium is the go-to free application to do this. At work I use a professional tool, SilkTest that does essentially the same thing, but comes with tons of bells and whistles.

If this is more than a one-off project you'll likely come to hate your dependence on the various sites, since you will be completely at their whim as far as their changes breaking your code (as noted above). If there's any way you can avoid that I'd highly encourage you to do so.

If all you are interested in are the emails then you'd be better off using a command-line email application to pull down the mails from the various sites. You can also use Outlook or Outlook Web Access or other email client to pull the emails in, then automate that client (which is what we do at my job when dealing with email).
posted by Four Flavors at 8:03 PM on September 2, 2010


« Older Mr. Darcy has 10,000 a year - how does everyone...   |   Gardening ideas wanted. Newer »
This thread is closed to new comments.