creating an uber-archive of twitter updates
January 25, 2009 9:26 AM   Subscribe

I am trying to scrape all 77 pages of my twitter updates and put them on a single page in chronological order, automagically.

I think I can figure out how to scrape all of the pages using curl, but I'm not so sure how to get them in one place, and was wondering if there is something obvious that I am overlooking that will let me bypass the tedium of doing this manually, or (heaven forbid) learning a scripting language.

Is there an RSS reader that will capture all 1,000 plus twitters to a single page? Or some other clever method?
posted by mecran01 to Computers & Internet (13 answers total) 10 users marked this as a favorite
Oh, I just saw someone who did this with a program. She put it all up and fed it to her website in lieu of a blog. It does exist.
posted by typewriter at 9:40 AM on January 25, 2009

This one is not the one I saw, but does something similar.
posted by typewriter at 9:56 AM on January 25, 2009

Is there an RSS reader that will capture all 1,000 plus twitters to a single page?

No, because the reader can only read what's in the file, and the RSS isn't going to hold every update ever... probably only the last 10-20.

That said, Twitter has a decent API, so there is probably some software out there that will let you do what you're looking for. You might see if tweetdeck has that feature -- it has a bunch.

Or if you 're not opposed to running a python script, check this out.
posted by toomuchpete at 10:29 AM on January 25, 2009

You can download your Twitter Message archive here
posted by smoothhickory at 10:33 AM on January 25, 2009 [2 favorites]

smoothhickory with the win! Awesome!
posted by fenriq at 11:45 AM on January 25, 2009

I have been trying smoothhickory's link to Tweetscan but it keeps timing out. I'll keep trying--thanks!
posted by mecran01 at 4:22 PM on January 25, 2009

I tried the python script, but it wanted "BeautifulSoup" which I downloaded, then a host of other errors sprung up and then I remembered why I hate scripts.
posted by mecran01 at 6:14 PM on January 25, 2009

Tweetscan is completely down. Here is my current plan:

1. Use Curl to grab every single page.

2. Use textwrangler to copy lines containing [unique tweet identifier code]. Textwrangler will search through every file in a folder, conveniently.

3. Figure out how to sort them chronologically.

4. Profit!

Actually, this is going to be my yearly Christmas card. I've given myself until the end of January.
posted by mecran01 at 6:20 PM on January 25, 2009

That's not a bad way to do it... take all of the pages from: to whatever the last page is, then pull out all of the text between

<span class="entry-content"> and the next </span>

That will just give you body text, not timestamps, though. If you need those, you'll want the stuff between <td class="status-body"> and </td>
posted by toomuchpete at 8:56 PM on January 25, 2009

Ah thanks. Then I can convert it to tabular data and sort it in a spreadsheet, then export it again. Excellent.
posted by mecran01 at 10:20 PM on January 25, 2009

I just discovered this archiving service, and unlike the others mentioned it actually works!

This saves your tweets and/or replies, or those of friends and favorites, as a CSV file.
posted by mecran01 at 7:25 AM on January 31, 2009
posted by mecran01 at 8:32 PM on January 31, 2009

Tweetake only archives 1,000 entries, but this app by Johann Burkard will archive everything as an XML file. At that point, if you are a sad humanities major with no coding abilities you can try the following:

1. open xml file in text editor. Tex-edit for OS X makes adding carriage returns and tabs painless.

2. Each tweet, or entry, is preceded by a tag, so search and replace "" so that each tag is preceded by a carriage return. This allows each entry to appear as a separate row when imported into Excel.

3. Next, you want to delimit each item within an entry (date, id#, etc.) using a tab. Don't use a comma as the delimiter, because the commas within your text entry will screw this up (I found this out the non-easy way). So search for every ">" character and replace it with ">" + tab.

4. Save the file and then import it into excel.

5. The entries are in reverse chronological order, so within excel I numbered each entry (859!) using autofill, then did a reverse sort.

6. I then deleted all the columns for fields I wasn't interested in, leaving behind the time stamp and text of each entry

At this point I was sort of wishing that I I had cracked one of those perl/php/ruby/awk/sed books on my shelf and learned how to write a script.

posted by mecran01 at 4:01 PM on February 3, 2009

« Older Age me up!   |   Korean shirts with surprising English on them Newer »
This thread is closed to new comments.