What skills would a developer need to automate file downloads from websites?
May 14, 2008 6:10 AM
Subscribe
What skills would a developer need to make a program to download multiple files from several different websites, name the files, and place them in a specific location?
The program would be meant to replace a person who logs into these websites and completes this process manually. Some more detail:
1) The number of files could be as high as a 400+
2) The number of websites the program would have to access could be as high as 10 and each would have a different format (i.e. navigating to the location where the files are downloaded would be different, logins would be different, etc.)
3) The file names will likely require text to be read from the text on the website or the files themselves
In addition, maybe someone out there knows:
1) How much would a project like this generally cost in the NY-Metro area?
2) How long would a project of this size take to complete? I'm only looking for a general range.
3) Are there any "off the shelf" products that do this? A long search turned up nothing.
I really appreciate any ideas! Cheers!
posted by alrightokay to computers & internet (20 comments total)
2) There are perl or ruby libraries (search for "www mechanize") which pretend to be full web browsers. They handle details like website cookies (used to remember logins), "clicking" links, downloading files, and so on.
3) perl and ruby also happen to be very good at parsing text. If you go with ruby, there are libraries (hpricot, beautiful soup) designed to pull text out of web sites (even when the sites aren't perfect).
If I were to go about this, I'd implement a "driver" program to start the process, then lots of specific programs, one for each specific website you want to scrape. Each of these programs would range from 1/2 page of code to 5 or 6, depending on what the details of grabbing the files were.
For cost estimation, figure a few hundred bucks (8 hours * 50/hour). You can get that down by having all the details ready for the programmer. The less time they waste figuring out what to click, or where to go, the more time they'll have to just write the code and get done.
If it was specified correctly, I'd estimate an 8 hour project (lets see, 20 minutes on a driver program, 30 * 10 for the site components, fudge time, and testing time.)
To specify the problem, you'd need a list of logins, web sites, exact path through the web site to get the elements ("click this link, then this link, then login, then this link, download all the .jpg files listed").
As for off the shelf products, nothing that I can think of, but you might want to look into "windows automation" products which can record arbitrary keystrokes, and mouse input. It might be possible to download files via that, but it'll be more brittle and annoying than the ruby/perl version.
posted by cschneid at 6:33 AM on May 14