Running into Ruby Restriction
March 9, 2009 4:06 AM Subscribe
In Ruby, how can you get around the ~65,500 character limit when grabbing a web page?
I'm new to Ruby, but have started to use it to scrape information from websites. I have been using either the Hpricot package or net:http. Unfortunately, when I've tried using these to scrape larger web pages, the streams cut off after 65,500 characters. I haven't found any information online about this. Is there a way to get around this limit? Can you separate the stream over two arrays or strings? Or will I have to manage the stream myself with new code?
I'm new to Ruby, but have started to use it to scrape information from websites. I have been using either the Hpricot package or net:http. Unfortunately, when I've tried using these to scrape larger web pages, the streams cut off after 65,500 characters. I haven't found any information online about this. Is there a way to get around this limit? Can you separate the stream over two arrays or strings? Or will I have to manage the stream myself with new code?
Best answer: No, there is no such limitation on strings in Ruby. I think perhaps there is a bug somewhere else?
posted by meta_eli at 6:40 AM on March 9, 2009
posted by meta_eli at 6:40 AM on March 9, 2009
I don't think that's the limit on either Ruby strings or what Hpricot can handle. pastie
Can you give an example of a page that doesn't completely parse? Without more info, I'm favoring seriously screwed up markup as the culprit.
posted by samsm at 6:41 AM on March 9, 2009
Can you give an example of a page that doesn't completely parse? Without more info, I'm favoring seriously screwed up markup as the culprit.
posted by samsm at 6:41 AM on March 9, 2009
A mysql text field has a limit of 65k characters and will silently truncate. Are you storing the resulting pages in the db? If so, you'll need to use mediumtext instead.
posted by Caviar at 7:11 AM on March 9, 2009
posted by Caviar at 7:11 AM on March 9, 2009
Response by poster: Arg, you guys are right... I can still parse things like Wikipedia on New York, with a couple hundred thousand characters.
There's something in the pages (all on one website) which is throwing off both the Hpricot and net:http calls. It's not obvious from the html, though. Looks like I'll be spending time figuring out the more specific issue.
posted by FuManchu at 7:24 AM on March 9, 2009
There's something in the pages (all on one website) which is throwing off both the Hpricot and net:http calls. It's not obvious from the html, though. Looks like I'll be spending time figuring out the more specific issue.
posted by FuManchu at 7:24 AM on March 9, 2009
Perhaps the server is sending a bogus Content-Length header, which you won't see in the HTML. It's possible that a web browser wouldn't honor that, but the Ruby libraries would. Can you give any details about the specific site or an example URL?
posted by pocams at 8:28 AM on March 9, 2009
posted by pocams at 8:28 AM on March 9, 2009
Consider busting out a packet sniffer like Wireshark to see what the actual HTTP conversation looks like
posted by Good Brain at 10:17 AM on March 9, 2009 [1 favorite]
posted by Good Brain at 10:17 AM on March 9, 2009 [1 favorite]
This thread is closed to new comments.
I would try to process the screen scraped string as it comes in though. In Perl: while (
posted by devnull at 5:36 AM on March 9, 2009