Using Java to do a repetitive flight search (with querying and scraping)
April 2, 2011 4:21 PM   Subscribe

I'm trying to use the Google App Engine to make a Java program that goes to every six hours, enters in two airport codes (i.e. JFK and LAX) and two dates (i.e. 7/22/2011 and 7/29/2011) into the text fields on that webpage, and parses the HTML of the results page to find and store the lowest price on the page. I need some help getting started.

What's giving me the most trouble isn't the parsing or the scraping, but the text entry and results-page retrieval. Is there some simple tutorial on how to enter something (say, the word "hello") into a text field (say, the search bar on the homepage)?

Is there a simple tutorial on how, after you have sent text inputs to search website X, to download the HTML from results website Y?

posted by JamesJD to Computers & Internet (7 answers total) 10 users marked this as a favorite
It's simpler than you think. You don't need to send text inputs to search website X in order to get HTML from results website Y. All you need to do is send the same HTTP GET or POST request to server Y as would result from a user filling in the text inputs on website X with a browser.

If you install Wireshark and have a look at the HTML requests that get sent when you do various things by hand with your browser (hint: Follow TCP Stream), you will soon work out the pattern of what the sites you're dealing with expect to see.
posted by flabdablet at 4:59 PM on April 2, 2011

Response by poster: I really don't think it's that simple, because the URL that results is not a permanent one. Rather, it stops working after 10 minutes. I want for the search to be performed once every six hours, so it could not reuse the same URL. Otherwise, this problem would be extremely easy.
posted by JamesJD at 5:32 PM on April 2, 2011

Here's an example using curl. As you can see it's just a matter of setting each form field to the right value. In Firefox, select "View page info" and then click on the Forms tab, or use "View source" and find them manually, or use Wireshark to record the traffic.

curl -s -L -o out.html -d fareType=DOLLARS -d twoWayTrip=true -d originAirport=LGA -d destinationAirport=LAX -d returnAirport=RoundTrip -d outboundDateString=7%2F22%2F2011 -d outboundTimeOfDay=ANYTIME -d returnDateString=07%2F29%2F2011 -d returnTimeOfDay=ANYTIME -d adultPassengerCount=1 -d seniorPassengerCount=0 -d submitButton=Search

That will result in an out.html file that you can then parse however you like, e.g. (this is just illustrative, I don't recommend using a regexp this sloppy):

$ perl -l -0777 -ne 'print $1 while /div class="product_info".*?title="([^"]+)"/sg' out.html
Departing flight 605/649 $539 6:00AM depart 11:45AM arrive 2 stops Change Planes in MDW
Departing flight 605/649 $519 6:00AM depart 11:45AM arrive 2 stops Change Planes in MDW
Departing flight 605/649 $300 6:00AM depart 11:45AM arrive 2 stops Change Planes in MDW
[... etc ...]

posted by Rhomboid at 5:34 PM on April 2, 2011 [1 favorite]

If the form action URL changes, just parse the form for that URL each time. You definitely should be POSTing to that URL, in which case, the resulting URL doesn't matter.
posted by wongcorgi at 6:19 PM on April 2, 2011

Seconding wireshark + post/get analysis. I can't specifically help you with google app or Java, but if you are at all familiar with programming, I would look into Python. It's easily done with the urllib/urllib2/cookielib libraries.
posted by oracle bone at 6:23 PM on April 2, 2011

Look into the urlfetch helper for google app engine.

It makes things a LOT easier.
posted by empath at 7:38 PM on April 2, 2011

You might also consider using a library designed for web interaction automation. Great examples in other languages are Ruby's mechanize and perl's WWW::Mechanize. There are also more complex solutions that actually emulate a web browser, like Selenium.
posted by vasi at 9:53 PM on April 2, 2011

« Older Fine Literature in Shakespeare's time   |   Messy involvement with someone sort of in another... Newer »
This thread is closed to new comments.