Message board archive app?
March 10, 2006 5:02 AM Subscribe
Recommendations on an app which can archive message board content on my PC?
I'd like an app to be able to sequentially crawl a block of messages (say #s 1-5000) and archive them to my PC for local searching. I'm primarily thinking of Yahoo's finance boards - I find their online search terrible. Thanks
I'd like an app to be able to sequentially crawl a block of messages (say #s 1-5000) and archive them to my PC for local searching. I'm primarily thinking of Yahoo's finance boards - I find their online search terrible. Thanks
wget has problems with yahoo because of the interstatial ads yahoo inserts in the archive. You end up loosing 1/3 of the messages and the threading is all messed up.
I asked this before but the only solution involved installing perl which turned out to be to much effort, YMMV. I'm hoping you get an easier answer.
posted by Mitheral at 9:30 AM on March 10, 2006
I asked this before but the only solution involved installing perl which turned out to be to much effort, YMMV. I'm hoping you get an easier answer.
posted by Mitheral at 9:30 AM on March 10, 2006
If you don't get a "use this package, solves all your problems" answer, there are probably dozens of other potential solutions, ranging from complicated to much less-so. Best fit depends on your needs and preferences, answering the following can help.
Most importantly, are you willing to wait more than an hour to download the 5000 messages? Yahoo's TOS don't have hard limitations listed although they reserve the right to limit access, but hitting a site sequentially more than once a second for the sheer number of pages you want to download would probably be considered abuse by most webmasters (Google bots do it every two seconds on my site).
Do you have or are you willing to install Perl or another scriptish type language? I think a Perl install for Windows PC's is pretty easy, but there is a counter-opinion on the table already.
Subset of above question: Do you want something which works as-is or are you willing to do a bit of customization to the source? A moderate amount? A lot?
Do you just want the message text with identifying author/date/#, or will the entire web page suffice? Obviously pulling the whole web page takes up more room and introduces a lot of extraneous material, but it also allows a certain brute force approach that eliminates the need for more intelligent front-end processing.
Do you have upload rights to site, your own or a public repository, where you can send the parsed message file(s) and pull them later to local store? That approach may be easier if you use a browser-based solution where writing to your local disk store is difficult.
Subset of the above question: How about using Gmail or another large capacity mail service? Is e-mailing the files sufficient for your purposes if you can then do searching within those mails off- or on-site?
While Yahoo's finance board search does massively suck, site-specific searching via Yahoo or Google advanced search might sufficiently cover what you want to do (unfortunately it looks like they both also rather suck in their overall coverage of the existing message base). Otherwise you are effectively duplicating a lot of spidering effort for your personal database. That's fine if you must, but it's something to avoid if there is another way to do it.
posted by mdevore at 2:40 PM on March 10, 2006
Most importantly, are you willing to wait more than an hour to download the 5000 messages? Yahoo's TOS don't have hard limitations listed although they reserve the right to limit access, but hitting a site sequentially more than once a second for the sheer number of pages you want to download would probably be considered abuse by most webmasters (Google bots do it every two seconds on my site).
Do you have or are you willing to install Perl or another scriptish type language? I think a Perl install for Windows PC's is pretty easy, but there is a counter-opinion on the table already.
Subset of above question: Do you want something which works as-is or are you willing to do a bit of customization to the source? A moderate amount? A lot?
Do you just want the message text with identifying author/date/#, or will the entire web page suffice? Obviously pulling the whole web page takes up more room and introduces a lot of extraneous material, but it also allows a certain brute force approach that eliminates the need for more intelligent front-end processing.
Do you have upload rights to site, your own or a public repository, where you can send the parsed message file(s) and pull them later to local store? That approach may be easier if you use a browser-based solution where writing to your local disk store is difficult.
Subset of the above question: How about using Gmail or another large capacity mail service? Is e-mailing the files sufficient for your purposes if you can then do searching within those mails off- or on-site?
While Yahoo's finance board search does massively suck, site-specific searching via Yahoo or Google advanced search might sufficiently cover what you want to do (unfortunately it looks like they both also rather suck in their overall coverage of the existing message base). Otherwise you are effectively duplicating a lot of spidering effort for your personal database. That's fine if you must, but it's something to avoid if there is another way to do it.
posted by mdevore at 2:40 PM on March 10, 2006
« Older Can my doggie point the way to the poles? | Brokeback Mountain interpretation question? Newer »
This thread is closed to new comments.
Specifically: wget --mirror and some flags depending on the specifics
Support for tons of platforms and flexible on the commandline.. The downside is that you're getting a local HTML copy, which might affect searching speed and results depending on what you are using for the search, but you can to html->plaintext with some other tools if needed.
posted by bhance at 6:47 AM on March 10, 2006