What skills would a developer need to automate file downloads from websites?
May 14, 2008 6:10 AM   Subscribe

What skills would a developer need to make a program to download multiple files from several different websites, name the files, and place them in a specific location?

The program would be meant to replace a person who logs into these websites and completes this process manually. Some more detail:

1) The number of files could be as high as a 400+

2) The number of websites the program would have to access could be as high as 10 and each would have a different format (i.e. navigating to the location where the files are downloaded would be different, logins would be different, etc.)

3) The file names will likely require text to be read from the text on the website or the files themselves

In addition, maybe someone out there knows:

1) How much would a project like this generally cost in the NY-Metro area?

2) How long would a project of this size take to complete? I'm only looking for a general range.

3) Are there any "off the shelf" products that do this? A long search turned up nothing.

I really appreciate any ideas! Cheers!
posted by alrightokay to Computers & Internet (20 answers total)
 
1) There's a nice rule of thumb in programming, 0...1...n. Everything happens one of those number of times. How big n gets just doesn't matter (disk space and technical details aside)

2) There are perl or ruby libraries (search for "www mechanize") which pretend to be full web browsers. They handle details like website cookies (used to remember logins), "clicking" links, downloading files, and so on.

3) perl and ruby also happen to be very good at parsing text. If you go with ruby, there are libraries (hpricot, beautiful soup) designed to pull text out of web sites (even when the sites aren't perfect).

If I were to go about this, I'd implement a "driver" program to start the process, then lots of specific programs, one for each specific website you want to scrape. Each of these programs would range from 1/2 page of code to 5 or 6, depending on what the details of grabbing the files were.

For cost estimation, figure a few hundred bucks (8 hours * 50/hour). You can get that down by having all the details ready for the programmer. The less time they waste figuring out what to click, or where to go, the more time they'll have to just write the code and get done.

If it was specified correctly, I'd estimate an 8 hour project (lets see, 20 minutes on a driver program, 30 * 10 for the site components, fudge time, and testing time.)

To specify the problem, you'd need a list of logins, web sites, exact path through the web site to get the elements ("click this link, then this link, then login, then this link, download all the .jpg files listed").

As for off the shelf products, nothing that I can think of, but you might want to look into "windows automation" products which can record arbitrary keystrokes, and mouse input. It might be possible to download files via that, but it'll be more brittle and annoying than the ruby/perl version.
posted by cschneid at 6:33 AM on May 14, 2008


The programmer would have to know how to call the programs wget or curl in script, and call then from a script or program that read a list of urls, logins, and output file names, e.g.:
www.foo.com/bar, foouser/foopass, foo-bar.txt.

Assuming that the file locations did not change (e.g., if the file's location is www.example.com/foo/bar/baz.txt today, it's the same tomorrow), and the save file name did not change, this would be trivially easy to code.

Making it slightly less simple would be coming up with a way to indicate that an error occurred and a file could not be fetched; this is only difficult insofar as you have not specified what shoudl be done in case of an error like this.

Cost would be probably two hours of the programmer's time, mostly to cover contingencies like error reporting. Time would be similar.

You could do this yourself by writing a script around wget or curl.

wget basically works like this: wget url. e.g, "wget www.foo.com/bar.txt" will produce a copy of bar.txt in the current directory. Handling logins, etc, is slightly more complicated.
posted by orthogonality at 6:33 AM on May 14, 2008


It depends very much on how robust and polished you want the solution to be. A competent coder could probably bang something out in half a day, using wget and black arts like perl, but not suitable for use in a mission-critical setting or by nontechnical people.
posted by ghost of a past number at 6:34 AM on May 14, 2008


Sorry, after looking at it, beautiful soup is a python library. There are equivalent libraries for perl and ruby, I just can't think of what they are off the top of my head.
posted by cschneid at 6:36 AM on May 14, 2008


Also, it might be worth noting that although these kinds of applications are relatively easy to write, they often break due to a site slightly changing its page format or login process. You probably won't need changes made very often, but you won't be able to count on this application to work forever without any changes.
posted by burnmp3s at 6:37 AM on May 14, 2008 [1 favorite]


Here's a very simple way of getting what you want:
wget -i myurls.txt -o log.txt

where myurls.txt is a file with one url per line, with your login and password embedded in it:

http://user:password@host/path

Note that by embedding your login and password, you lose some security. Consult the wget manual for better ways to go about this.

Again, this assumes that a file's location on the web site stays the same each day (or changes in some predictable way, so that you can likewise change myurls.txt).
posted by orthogonality at 6:44 AM on May 14, 2008


Are you shopping around for developers? This would be a very small job for a professional. I'd say this is a 1-2 hour project at most. File manipulation and downloading via web is beginner's stuff.

Of course this assumes you dont ask him for a nice GUI based admin panel, email alerts, sql backend, etc etc.
posted by damn dirty ape at 6:45 AM on May 14, 2008


Please note that orthogonality's solution with user:pass@host/path only works with certain types of authentication. My approach takes a bit more up-front work, but is much more flexible, since ruby would be pretending to actually be a browser, rather than just sending the requests directly into the server, which makes all sorts of authentication, cookies, sessions all work where the wget solution wouldn't.
posted by cschneid at 6:54 AM on May 14, 2008


Response by poster: burnmp3s: Agreed, the websites are certainly likely to change.

damn dirty ape: No nice GUI is required, but certainly some error checking involved e.g. file download successful, download failed, etc.

How much does this all change if the program is more robust? E.g. the website changes, and all the user has to do is change a few items in the code to work again? Keep in mind that the user editing the code would have minimal knowledge but would certainly know where to navigate to in the web page.

The program doesn't have to be pretty - it just has to work.
posted by alrightokay at 6:59 AM on May 14, 2008


Best answer: I am a software consultant who has done many web-spidering projects that would have been described exactly this way at the start. A competent NYC developer will require a 80-hour retainer at $200-$250/hour to start this project. While it could be done in eight hours with commodity tools, there are likely to be complications:

1) Your list of files to retrieve and rules for naming the files on disk will almost certainly require refinement.

2) There are likely to be aspects to logging in to and dealing with the sites that will require custom programming.

3) Even though you say today that you just want the tool to work with your current list of sites in their current state, a wise developer will budget being called back to add or change sites into the retainer.

4) A competent developer will want to fully document the system so some other competent developer can work on it when he or she gets hit by a bus.

Of course, your management will probably choose someone who says they can do it in 8 hours at $20 per on Rent-A-Coder.
posted by backupjesus at 7:09 AM on May 14, 2008 [3 favorites]


"could be done" should be "might be done".
posted by backupjesus at 7:10 AM on May 14, 2008


Response by poster: backupjesus: This sounds more like it, but a bit high, no? The issues you point out are certainly valid but at the costs you estimate, management would likely just hire someone to do it manually. Is it possible someone could do it for less than 200/hr for 80 hrs and still do a decent job - say 100/hour for 20-40 hours? Keep in mind that we would be able to do much of the leg-work involved in getting the specs.
posted by alrightokay at 7:28 AM on May 14, 2008


beautiful soup is a python library

There is a Ruby port. And yes, 80 x 200 seems amazingly high for this. While I agree you don't want to go with the lowest bidder, beware the grizzled veteran who seems complications in everything, says I, who suffers from that disability.
posted by yerfatma at 7:46 AM on May 14, 2008


How much does this all change if the program is more robust? E.g. the website changes, and all the user has to do is change a few items in the code to work again? Keep in mind that the user editing the code would have minimal knowledge but would certainly know where to navigate to in the web page.

It's very unlikely that a non-technical person would be able to modify any of these scripts to work if the page has changed. I don't have experience with the specific library in question, but I have some experience with screen scraping - generally, these libraries are good in the sense that they work well and are easy to use for programmers, not that they would be easy to use for laypeople.

This is the beautiful soup documentation.

You need a good command of HTML and the language of the program. HTML pages are complex items, so generally the data representations used in these libraries are complex data structures.
posted by meowzilla at 8:17 AM on May 14, 2008


Keep in mind that we would be able to do much of the leg-work involved in getting the specs.

No offense, alrightokay, but every client says that. Getting good specs is difficult, and the 400 discrete documents is a red flag. Let's say 90% of the docs are specified fully but 10% require ten minutes each of discussion to resolve (which is conservative)...there goes roughly seven hours.

Is it possible someone could do it for less than 200/hr for 80 hrs and still do a decent job - say 100/hour for 20-40 hours?

Possible, yes, but unlikely if you're using a NY-area consultant; someone available for $100/hour short-term consulting is inexperienced, unskilled, or a combination of both. (...or just doesn't know his or her value -- but those people are usually fully booked.) I would strongly consider offshore options before looking at low-cost local resources.
posted by backupjesus at 8:23 AM on May 14, 2008


Response by poster: backupjesus: No offense taken. As a person in consulting, I too am familiar with this issue of the client thinking that everything is peachy when in fact it's a terrible mess. An excellent point but I'm still not convinced the main problem is that difficult.

I would strongly consider offshore options before looking at low-cost local resources.:

In order to consider this, the savings would have to be significant. Perhaps my main question should have been: "Where can I find an offshore, low-cost, and reliable company for program development?"
posted by alrightokay at 9:00 AM on May 14, 2008


There are a few gotchas with this kind of thing, as others have noted. It's also, however, a problem that's been solved a billion times, and you don't seem to have any complicated integration issues. I'd check out someplace like Rent A Coder, where there are probably quite a few folks who have done similar projects.

Be forewarned: Make sure your specs are thorough and specific.
posted by mkultra at 10:02 AM on May 14, 2008


You don't even have to go offshore to get out of local - NYC costs are just plain high; hit up something like Rent A Coder and see what you can get. Heck, I'd toss my own hat into the ring if I was still in college; my senior year final project involved about three-quarters of your requirements.
posted by Tomorrowful at 10:05 AM on May 14, 2008


If you're going to rentacoder, why not try MeFi Jobs just around the corner?

WILL WORK FOR FAVORITES
posted by ghost of a past number at 1:41 PM on May 14, 2008


I just finished doing this, including learning Python, researching mechanize and beautiful soup, and then reimplemented the entire thing in about 20 hours (start to finish, including terse documentation, and basic testing(ie.. none)). It is designed to pull in class files that handle specific sites.

In my case I am indexing "posts" to sites that are made in a custom format instead of downloading files. Each class file has a "rule" for the object it returns, so they tend to be fairly dissimilar from instance to instance, but all return the same fields in the return object.
posted by SirStan at 8:47 PM on May 16, 2008


« Older Music for wife in labor...   |   Awkward parental relations: how to prevent? Newer »
This thread is closed to new comments.