How do you go about automating/simulating (I think) an HTTP request?
March 3, 2013 7:39 PM   Subscribe

I want to write a script to automate doing a search, retrieving, and parsing the search results from a website (a booking site similar to the search on www.hilton.com ).

My (extremely) rough understanding is that I should write a script to mimic the request the form is sending, and that I can use something like Firebug or Fiddler to capture what my browser is sending.
I am way out of my depths here but am pretty committed to doing this as an educational project, so I'd appreciate any direction getting started figuring out how all of this is done - I'm familiar with basic python scripting and have used urllib and BeautifulSoup to do basic web scraping, but I don't really understand how all these pieces fit together or how to get started - pointers to good resources would help tremendously, as I've found information on StackOverflow but am having trouble deciphering it. I'm a little more comfortable with what I need to do to parse the html once I get it back.

Also, am I even right in assuming that what I want to do is send a HTTP GET request? Or is this search done with Javascript (it seems like if that's the case, ths becomes more difficult? How do I tell what's really going on - I've messed around with Fiddler but am having trouble). Please bear in mind that I barely understand the words that I'm using, but I'd really like to learn. Thanks!
posted by hot soup to Computers & Internet (5 answers total) 5 users marked this as a favorite
 
Scrapy should hide some of these issues from you and also get through the next steps really well.

To answer some of your other questions though, what I'd suggest is playing around with some Web Inspector tutorials (or Firebug tutorials or Tamper Data tutorials) until you understand the mechanics of the HTTP request.
posted by Monsieur Caution at 7:47 PM on March 3, 2013


Er, here's the Web Inspector tutorial I had in mind--you may find this easier to work with than Fiddler.
posted by Monsieur Caution at 7:51 PM on March 3, 2013


HTTP Requests are computers sending plain text instructions back and forth. When you go to a website, your computer sends the other computer text like this:


GET /index.html HTTP/1.1
Host: www.example.com


The other computer then knows to reply with index.html.

GET requests use only the URL, and add parameters by putting stuff like ?user=me&color=blue at the end of a url. POST requests can basically push across a multi-line document, so they can use longer data. Additionally, by convention, POST requests can change things (by creating or deleting a blog post, for example) and GET can't.

The standard HTTP request library for Python is called requests. The section on parameters in URLs may help you.

Basically, it lets you make requests like this:


r = requests.get("http://google.com")
r.text #outputs the HTML content of the page


Searches can be done with Javascript in a variety of ways, but if it talks to the server it's still using HTTP POST or GET. If it doesn't talk to the server it's just using CSS to hide data that's on the page.

Scrapy looks nice, but I would personally recommend the ScraperWiki tutorial. They will run your code for you and can store your results. It is more focused on the data extraction part, though.
posted by 23 at 8:25 PM on March 3, 2013


Hotel booking sites are not going to take kindly to your data scraping. Of all the sites on the web, they are some of the most likely to use technical measures to make this quite difficult.
posted by ryanrs at 8:50 PM on March 3, 2013


If you want to see exactly how your browser and the hotel site are talking to each other, there is no substitute for Wireshark. It has a follow TCP stream function that's just perfect for looking at Web conversations assuming those are not taking place over HTTPS, in which case reverse engineering gets rather harder.
posted by flabdablet at 10:32 PM on March 3, 2013


« Older Now I want a wardrobe theme too!   |   Filing Taxes for the First Time: Roth IRA Question Newer »
This thread is closed to new comments.