Advanced Word search/replace to clean up a big list.
July 1, 2019 3:18 AM   Subscribe

Hello proficient Word users! I've got a biiiiig list of instagram comments from a giveway we're running. My end goal is to create one spreadsheet containing all the instagram handles of all the users who commented and all the users that were tagged in the comments. I have thousands of entries, I'm hoping there's an easy way to automate that?

This is what a sample data copy/past from the instagram comments looks like:

Insta_user_1 @tagged_user2
Jun 27

Insta_user_1 @tagged_user3 @tagged_user4
Jun 27

Insta_user_3 @tagged_user4 🥰🎀🍀 succes ❣️🌺🌼🌿
Jun 28

And this is the ideal formatting i'd like to get out of it: (the order in which the instagram users gets sorted it doesn't matter)


@Insta_user_1
@tagged_user2
@Insta_user_1
@tagged_user3
@tagged_user4
@Insta_user_3
@tagged_user4

So mostly what i can't figure out is :

-How to add the @ sign in front of the username that are missing it? that would be the first word on each new line that's not the date.

-How to delete all the rest of the text that is not a username.

Extra bonus points for removing all the duplicates of 1 username.

Thanks!
posted by PardonMyFrench to Computers & Internet (9 answers total) 3 users marked this as a favorite
 
Are you open to non-Word solutions? This is quick work for a Python script or the command line.
posted by hoyland at 3:34 AM on July 1, 2019 [3 favorites]


I wrote a quick Python script that should do it. You need to change the filename to your input file (which needs to be text not Word).

I didn't test this, I just wrote it in Pastebin, nor is it maximally efficient, but it's unlikely that you have so many usernames that that's a problem.
posted by hoyland at 3:42 AM on July 1, 2019 [3 favorites]


I can answer the actual question you asked - how to use Word and Excel to make this cleanup easier. This is the kind of thing I do all the time, although it's usually a one-off, so no scripting necessary. I could write a VBA script for you if you like. But that might be overkill.

So, is this cleanup a one-time thing, or something you'll have to repeat? And what is your expertise with Word/Excel?
posted by SuperSquirrel at 4:36 AM on July 1, 2019


Yeah, this is a one liner for any scripting language. Export the file as text & feed it into a python / awk / sed / pick your poison scripting language. Try hoyland’s script, or roll your own if you’re feeling keen.
posted by pharm at 4:52 AM on July 1, 2019 [2 favorites]


If you use LibreOffice or Open Office instead of Word, they are otherwise very similar, so not much of a learning curve, but the search/replace dialogue allows regular expressions, which will make this a very easy task.
posted by lollusc at 6:22 AM on July 1, 2019


Nthing that Word is the wrong tool for this. It does actually have regular expression capability but to do the job easily requires using another tool to do the deduplication and sorting.

If you have access to an OS X machine, this can be easily done at the command line with any number of built in tools.
posted by Candleman at 6:32 AM on July 1, 2019


Best answer: I would try pasting this into Excel. That way, you can filter out the Dates and delete those lines. Next, use Text to Columns to divide the rows using Space as a delimiter. Then you can sort by the usernames and the ones that don't have the initial @ symbol will sort to the bottom. Use a CONCATENATE formula to add the @ symbol. Copy and Paste Special Values on the formula cells and paste them over the original cells.
Then use the Remove Duplicates function to get only unique entries.
posted by soelo at 7:54 AM on July 1, 2019 [5 favorites]


Best answer: soelo's almost got it except that long comments with usernames in the middle will get unwieldy. For example, "insta_user1 omg so excited! @tagged_user4 @tagged_user5 check this out!" would turn into nine columns and you'd have to go through each column individually to filter out usernames.

Here's what I would do:
1. Paste the whole list into Excel. Filter out and delete the dates and blanks.
2. Use Text to Columns to divide the comment using @ as the delimiter. This will split the comment into as many columns as there are usernames, with the username as the first word in each cell. The example above would get split into three columns: [insta_user1 omg so excited!] [tagged_user4] [tagged_user5 check this out!]
3. Combine all of the separate columns into a single long column:
 [insta_user1 omg so excited!]
 [tagged_user4]
 [tagged_user5 check this out!]
This is easy now because there are fewer columns and you don't have to do any filtering or sorting to them.
4. Use Text to Columns again, this time using Space as the delimiter. Long comments will still turn into lots of columns, but now all of your usernames are in the first column so you can just ignore everything else.
5. Use CONCATENATE to put the @ back in. Copy and Paste Values to get a straight text (no formulas) list of usernames, then use Remove Duplicates to get unique entries.
posted by yeahlikethat at 8:53 AM on July 1, 2019


Response by poster: Thanks!
I ended up going the excel version because i was a bit intimidated by the python process, and it worked a treat.
Super grateful for this and i hope it also helps out the next poor soul who organises a big instagram giveaway :-)
posted by PardonMyFrench at 9:08 AM on July 1, 2019


« Older Source of quote about agnosticism   |   Simple prepaid phone that works on Verizon? Newer »
This thread is closed to new comments.