Join 3,442 readers in helping fund MetaFilter (Hide)


Running into Ruby Restriction
March 9, 2009 4:06 AM   Subscribe

In Ruby, how can you get around the ~65,500 character limit when grabbing a web page?

I'm new to Ruby, but have started to use it to scrape information from websites. I have been using either the Hpricot package or net:http. Unfortunately, when I've tried using these to scrape larger web pages, the streams cut off after 65,500 characters. I haven't found any information online about this. Is there a way to get around this limit? Can you separate the stream over two arrays or strings? Or will I have to manage the stream myself with new code?
posted by FuManchu to Computers & Internet (7 answers total) 2 users marked this as a favorite
 
Does Ruby have a 65k limit on string sizes? Then you need to use something else. If you wanted a hack, you could push it onto an array.

I would try to process the screen scraped string as it comes in though. In Perl: while () { do_something; }
posted by devnull at 5:36 AM on March 9, 2009


No, there is no such limitation on strings in Ruby. I think perhaps there is a bug somewhere else?
posted by meta_eli at 6:40 AM on March 9, 2009


I don't think that's the limit on either Ruby strings or what Hpricot can handle. pastie

Can you give an example of a page that doesn't completely parse? Without more info, I'm favoring seriously screwed up markup as the culprit.
posted by samsm at 6:41 AM on March 9, 2009


A mysql text field has a limit of 65k characters and will silently truncate. Are you storing the resulting pages in the db? If so, you'll need to use mediumtext instead.
posted by Caviar at 7:11 AM on March 9, 2009


Arg, you guys are right... I can still parse things like Wikipedia on New York, with a couple hundred thousand characters.

There's something in the pages (all on one website) which is throwing off both the Hpricot and net:http calls. It's not obvious from the html, though. Looks like I'll be spending time figuring out the more specific issue.
posted by FuManchu at 7:24 AM on March 9, 2009


Perhaps the server is sending a bogus Content-Length header, which you won't see in the HTML. It's possible that a web browser wouldn't honor that, but the Ruby libraries would. Can you give any details about the specific site or an example URL?
posted by pocams at 8:28 AM on March 9, 2009


Consider busting out a packet sniffer like Wireshark to see what the actual HTTP conversation looks like
posted by Good Brain at 10:17 AM on March 9, 2009 [1 favorite]


« Older Hi all, thanks in advance for ...   |  Cannot get my Airport Express ... Newer »
This thread is closed to new comments.