Help me hire a programmer
February 27, 2014 6:15 PM   Subscribe

I am working on my dissertation, and I need to hire someone to write some code that will scrape content off of a password-protected site that uses AJAX. I don't really know how to go about doing this.

The site that I am trying to harvest content from is a web forum that uses Simple Machines.

I have a password for the site. I have permission from the site owners to scrape content and a password, so we're all good on that front - I just don't know how to actually go about finding someone who will be able to do this for me.

I simply need to scrape a bunch of the text from this site so that I can analyze it. The formatting, images, all of those things aren't important to me - just the textual content.

I don't know how what the timeline should look like for a job like this. Because I am not a programmer, scoping the project is difficult for me. How long does a job like this take?

I also don't really know how to go about paying someone for this job - would I offer an hourly rate? A flat fee? And how much is reasonable to pay for something like this?

Finally, I am computer-literate and used to know how to program a little bit, so I would be fine with either a programmer giving me a script that I run on my own computer to harvest content. I'm also happy with them just handing a big folder full of text files to me (as long as they are organized and clearly labeled). What is a more reasonable deliverable? Is one easier or more standard than another?
posted by anonymous to Work & Money (17 answers total) 8 users marked this as a favorite
 
You might try doing this yourself. The python website scraping libraries (like beautiful soup) are really beginner friendly and there are a million tutorials out there to get you started.
posted by rockindata at 6:31 PM on February 27, 2014 [1 favorite]


You can easily find specialists in web scraping on independent contractor sites like Elance and Odesk.
posted by Dansaman at 6:45 PM on February 27, 2014 [1 favorite]


This doesn't sound like a hard project, unless there's a lot of different kinds of text, or it has to be sorted based on its content somehow. This is especially true if you just care about the text, once, for your dissertation, and aren't that worried about code that's well-written and maintainable and robust and all that. I think you should find an eager undergraduate at your institution and pay them as little as you can, and then they can say they "participated in research". You could probably find someone willing to take a flat fee in exchange for code+the folder of text files. If you go with an hourly rate I think maybe a good starting place might be slightly more than the work-study wages for university jobs.
posted by vogon_poet at 6:46 PM on February 27, 2014 [3 favorites]


I suggest posting a flyer in your school's CS department and attract an undergrad. Without knowing any more details, I'd try for $30 and be willing to pay $50.
posted by Kwine at 6:47 PM on February 27, 2014 [2 favorites]


$30 to $50 is a little low-ball, but for once this project does seem small and simple. Depending on any unmentioned requirements, maybe just a couple of hours hacking.

So if it's a five hour job at say $100/hr for a pro programmer, I'd offer a bright CS student $200 or so.
posted by teatime at 6:55 PM on February 27, 2014 [1 favorite]


Depending on the complexity of the website, you might be able to do this with minimal scripting and repeated calls to curl. Poke around the curl site and see if you can figure out a way to get it to do what you want. The bulk of the work would probably be figuring out how to login and how to organize the data you get.

Building scrapers is part of my job description, but not all of it. Something like you describe would take me anywhere between 0 and 4 hours, depending on the complexity of the data layout and login protocols. I would be willing to do this for you for a flat fee. MeMail me if you're interested.

On preview, I like the idea of getting an undergrad to help. You could also consider posting this to Metafilter Jobs.
posted by rhythm and booze at 7:00 PM on February 27, 2014 [1 favorite]


There are scraper programs that do this in a single command. If you have a linux wget is good.

wget -r -user=abcd --password=secret -O all_the_pages.txt http://www.yoursite.com


I thing that gets all the html/htm files, getting rid of the html tags is a different process. If the order of the files is important there are option to build a directory tree and concatenate the files in the order needed. Ah there is a wget for windows.
posted by sammyo at 8:33 PM on February 27, 2014 [3 favorites]


I recommend using curl or wget. You can submit forms (i.e. put in your username and password) with either tool, and it is trivial to rip/scrape/save output to a file.
posted by Blazecock Pileon at 9:37 PM on February 27, 2014


There's also Scraperwiki, which has varying levels of online tools for scraping data. They also have 'scrapers for hire' (I was one of them, once), so you might be able to hire someone through them.
posted by drwelby at 9:53 PM on February 27, 2014 [1 favorite]


I would just try it using one of the python website scraping libraries mentioned above.

If you run into trouble, swing by your local python/Mac/Linux user group near you, with a box of donuts and ask for help (only after you've had a proper go at it). Or your institution's comp sci undergrad student club room.
posted by sebastienbailard at 9:57 PM on February 27, 2014


You mention AJAX, which probably takes the easy solutions such as curl and wget out of the picture: tools like that do not interpret JS, and will not see any content loaded asynchronously.
posted by Dr Dracator at 10:21 PM on February 27, 2014 [3 favorites]


This seems like a good candidate for MetaFilter Jobs.
(this sounds like a fun diversion, you're welcome to me-mail me directly too)

Do you know which part of the site uses Ajax? If it's just the login form, you should be able to login manually, and pass the cookie details to your scripts.

If you know how to tinker with scripts, I would ask the programmer to give you the scripts and instructions on how to use them, rather than just the results. That way if you need to make adjustments, you can do it yourself without extra assistance.
posted by Gomez_in_the_South at 11:28 PM on February 27, 2014 [1 favorite]


If you want to do it yourself, PhantomJS is an excellent tool for scraping, and easily enough to get started with that if you have a little bit of programming experience you should be able to figure it out.
posted by deathpanels at 4:45 AM on February 28, 2014


Could you ask the site operators to get a dump of they MYSQL of the relevant tables? No scraping after that.
posted by bleucube at 6:24 AM on February 28, 2014 [2 favorites]


User-friendly tools for doing this are having a kind of upswing. Kimono is very impressive, and I've had success with Import.io as well.
posted by mbrock at 6:45 AM on February 28, 2014


How long does a job like this take?

It really depends on the experience of the person doing it. You'd also need to be a bit more specific with your requirements: you say, "I simply need to scrape a bunch of the text from this site so that I can analyze it" but what text? And does it need context? Are you only interested in comments? The post titles? The user who posted the data?

The bare bones of it would be a variation on the approach suggested here:

Getting the posts
  1. Log into the site with requests (you could do this with a number of libraries, including Python's built-in urllib or urllib2, but life is way too short)
  2. Storing the cookie/ session info you need from the response - you probably don't need to be worried about the Ajax-ness, it's most likely just a POST to a URL, but learning how to store the credentials is tricky at first.
  3. Request the root url of the forum
  4. Create a dictionary for storing info about the forums, sub-forums and posts
  5. You may also want to store user info (so you can key names to posts or comments)
  6. Assuming all forums are like Simple Machines' community forum . . .
  7. Grab all of the sub-forum urls
  8. For each sub-forum,
    • Stick the board number and id in your storage variable, add an empty list for topic ids
    • Unhappily, the URL parameter seems to be a count, rather than a page number, so
    • Open the first page and look for the page 2 link
    • If it exists, get the counter (defaults to 25?) from the page 2 link and find the last page number link to use as your end point
    • If it doesn't exist, you're all set in terms of finding all pages of posts for the sub-forum
    • Open each of those pages and stick the urls into your posts storage variable

  9. For each sub-forum page
    • Check your post urls list to make sure you haven't already indexed it
    • Open the link to each post
    • Go through much the same page logic as you did for the subforums
    • Then open each page of the post and grab the content via Beautiful Soup

Parsing posts with Beautiful Soup
  • Each comment on a post page is in div.post_wrapper (i.e., a div tag with a class of "post_wrapper"). Loop over each of these in Beautiful Soup.
  • Inside each is a div.poster. The user name is in an a tag inside an h4 (so the full path to the user info from each post is "div.poster h4 a:text()" if I remember my BS
  • At the same level as the poster is a div.postarea
  • The comment title (not sure how useful) is in div.flow_hidden div.keyinfo h5 a:text()
  • Actual content is in the div.inner two layers down (div.post div.inner) which has raw text plus break tags for line separation. This div also has an id with the comment's database id which could be handy for keeping track of things

posted by yerfatma at 12:15 PM on February 28, 2014


Update from the anon OP:
I am using Import.io suggested by mbrock and I'm getting everything I need in a format that works really well for my data analysis methods. It's a really nice little scraping tool; very user-friendly and efficient.

I was also thankful for the offers to actually write the script for me from several mefites, but Import.io does what I want and it does it for free and I've already collected pretty much all of the data I needed (after months of agonizing over how I was going to get it) so I'm very grateful that I posted this question!
posted by LobsterMitten at 5:07 PM on February 28, 2014


« Older iPhone won't sync after software update and janky...   |   My girlfriend and I both have HSV1. What risks are... Newer »
This thread is closed to new comments.