Reorganizing Excel data
January 6, 2017 2:37 PM   Subscribe

I have two related but slightly different questions about how to rearrange Excel data so that I can analyze it properly. I've been struggling with these problems for a while, and finally just gave in and am throwing myself on the mercy of MeFites.

First, I have a spreadsheet of manually gathered data about Instagram posts. Each row in this document represents an individual image on Instagram, and includes columns for the image ID, date, account username, caption, etc. It also has a series of columns for hashtags, one per field (hashtag 1, hashtag 2, and so on). There are over 30 of these columns, since a lot of the images had numerous hashtags. Of course, the order they are used in varies, and some images only have one hashtag while others have many.

I want to carry out some network and other analyses based on the hashtags, but this layout is problematic since I can't do any useful comparisons with the data spread out like this. It would work better if each image were represented by multiple lines having the same Image ID#, username, and date, but with just one hashtag per row. Some images would therefore have a single line in the spreadsheet, and others would have over 30, depending on the hashtags.

I also have a set of spreadsheets downloaded from Picodash, which recently started offering this feature (a godsend!). These also reserve one line for each image. They don't have the problem of the multiple hashtag columns, but instead have all hashtags in a single text column, separated by commas (e.g., "Hashtags: one, two, three, etc."). I still need to pull out and separate these tags so that I can conduct analyses as above.

At this point I've amassed over 2,000 images (and counting), and reformatting the data by hand is a daunting and very time-consuming endeavor. Is there any way to automate it? I don't have any programming skills or access to people who do, so I'd be grateful for any very basic explanations or pointers to tutorials. I've tried searching, but can't seem to come up with the magic string that will give me any relevant answers. Help!
posted by Superplin to Computers & Internet (28 answers total) 3 users marked this as a favorite
 
For the hashtags in a single text column, you need Text to Columns. I'm mulling over what I might do for the rest of it. I bet some other MeFite will know!
posted by purple_bird at 2:45 PM on January 6, 2017 [1 favorite]


You could use the Transpose Columns > Rows menu function for this I think? If I've understood what you're trying to do. This would let you take a bunch of stuff currently in columns and move it over into consistent rows.

And for the cells where you have multiple bits of data separated by comments, you can use the Convert Text to Columns wizard, then use the commas as your delimiter.

That help?
posted by Happy Dave at 2:46 PM on January 6, 2017


Re. the first thing, you're trying to reshape from wide to long. Here's an add-in tool that purports to do that, and a quick how-to video.
posted by theodolite at 2:48 PM on January 6, 2017 [1 favorite]


Just to clarify, without threadsitting: Transposing text to columns is great, except it leads me back to the same problem as with the first spreadsheet. It leaves me with a bunch of columns of hashtags in no particular order, that I can't really compare with one another in any meaningful way.

Transposing columns to rows would also not solve the problem. To illustrate, what I have now is like this:

Image ID . . . Username . . . Date . . . Hashtag 1 . . . Hashtag 2 . . . Hashtag 3 (etc.)
1 . . . . . . . . . . Barglefast . . . 9/28 . . . #stuff . . . . . . . #things
2 . . . . . . . . . . Gobbler . . . . . 9/28 . . . #things . . . . . . #purple . . . . . #sunnyday
3 . . . . . . . . . . Swiftkick . . . . 9/28 . . . #orange

What I need would be more like this:

Image ID . . . Username . . . Date . . . Hashtag 1
1 . . . . . . . . . . Barglefast . . . 9/28 . . . #stuff
1 . . . . . . . . . . Barglefast . . . 9/28 . . . #things
2 . . . . . . . . . . Gobbler . . . . . 9/28 . . . #things
2 . . . . . . . . . . Gobbler . . . . . 9/28 . . . #purple
2 . . . . . . . . . . Gobbler . . . . . 9/28 . . . #sunnyday
3 . . . . . . . . . . Swiftkick . . . . 9/28 . . . #orange

I hope that makes more sense.
posted by Superplin at 3:15 PM on January 6, 2017


Oh, possible complicating factor: I'm working on a Mac.
posted by Superplin at 3:16 PM on January 6, 2017


I have a feeling the answer might be a relational database, but I haven't done enough with them to help you make that transition. Sorry.
posted by Apoch at 3:24 PM on January 6, 2017


This is known as reshaping data from 'wide' format to 'long' format. I know how to do it in R, but I don't know if it's possible to do easily in Excel.

I would encourage you to learn R or Python if you are interested in doing these sorts of data manipulations and analyses.
posted by demiurge at 3:27 PM on January 6, 2017 [5 favorites]


A really inelegant way of doing it would be old fashioned copy and paste.

Copy and paste Image ID: (through) Hashtag 1 into a new file
Copy and paste Image ID:Hashtag 2 (not pasting Hashtag 1) after your first copy and paste. There may be some entries without a hashtag at this point. Don't worry about those for now
Copy and paste Image ID:Hashtag 3 (not pasting Hashtag 1 or 2) after your previous pasted section. There will be a lot of entries without a hashtag now.

Sort the hashtags. Get rid of any entries that are blank.
posted by troytroy at 3:42 PM on January 6, 2017


Ah! Do you need to know which hashtags were listed first, second, third, etc.? If so, I think I agree that you need a relational database, not a spreadsheet.

Otherwise, I think what people are saying is you'll need both tools to deal with your spreadsheets. For the spreadsheet where each hashtag is already in a separate row, it looks like the reshaping tool posted by theodolite would work.

For the spreadsheet from Picodash, use Text to Columns, then repeat the steps for reshaping.
posted by purple_bird at 3:44 PM on January 6, 2017


I cobbled together something using approximate Vlookups and some dummy variables. This is assuming that you're already in the situation you illustrated in your followup.

Step 1: Add two fields to your existing data. One field counts the number of hashtags the image has. You can do this with the COUNTA() function, which counts the number of non-empty cells in a given range. The other field will be a running total of the total number of hashtags. Set this to 0 for the first image, and for subsequent images, calculate it as the previous row's running total + the previous row's hashtag count.

Hashtag Ct . . . Running Total . . . Image ID . . . Username . . . Date . . . Hashtag 1 . . . Hashtag 2 . . . Hashtag 3 (etc.)
2 . . . . . . . . . . 0 . . . . . . . . . . . 1 . . . . . . . . . . Barglefast . . . 9/28 . . . #stuff . . . . . . . #things
3 . . . . . . . . . . 2 . . . . . . . . . . . 2 . . . . . . . . . . Gobbler . . . . . 9/28 . . . #things . . . . . . #purple . . . . . #sunnyday
1 . . . . . . . . . . 5 . . . . . . . . . . . 3 . . . . . . . . . . Swiftkick . . . . 9/28 . . . #orange
posted by bassooner at 3:48 PM on January 6, 2017 [1 favorite]


OK, here's what I would do to get the hashtag data to a tall format. May not be the easiest way, but it would work. (demiurge is right that this is something there are a lot of tools for in R)

1. Make a new sheet with a column for image IDs, a column for "hashtag count" and a column for hashtags. You want this sheet to have as many lines per image ID as is the max number of hashtags an image has. (I'll go with 30, the number you gave)
2. So you want 2000 sets of 30, or 60,000 rows. There's probably a better way to do this but you could autofill a series from 1 to 60,000 and then make a formula that adjusts values above 30 down via something like
[Num]-30*ROUNDDOWN([Num]/30,0)
3. To get the hashtag value, do a vlookup where the lookup value is the image ID and the column index number is the hashtag count (plus an adjustment for wherever your hashtags start in terms of columns away from the image ID in your data). You'll get a bunch of errors in cases where the image doesn't have a hashtag in that column, but those are easy to filter out. (It would be cleaner to use bassoner's counts to only make the needed number of rows for each image but I think you'd need VBA to do that.)
posted by yarrow at 4:05 PM on January 6, 2017


Oh, and you can then vlookup any of the other image info you want into that new sheet using the image ID.
posted by yarrow at 4:08 PM on January 6, 2017


Sorry, accidentally hit post too early.

Step 2 is to go to a new sheet and create a column filled with an increasing dummy variable (0, 1, 2, 3, ...). You can use this variable to do a VLOOKUP on your original data sheet to grab ID, Username, and Date, using the running total as the lookup column. The trick is make sure that the [range_lookup] option in the VLOOKUP is set to TRUE. This ensures that when there's not an exact match on the VLOOKUP, the VLOOKUP will return values from the image with Running Total closest to the dummy variable without going over. For dummy variable values 0 and 1, this is the first image; for dummy variable values 2,3, and 4, this is the second image; etc.

You can get the hashtags with similar approximate VLOOKUPs. The tricky part there is knowing how many columns to go over (aka, what hashtag to grab). You can calculate this as Hashtag Number = Dummy Variable - Approximate VLOOKUP spitting back the first column (which is just the running total) + 1. Then to get the hashtag, do one final VLOOKUP, with the column index being the hashtag number you just calculated + whatever constant you need to skip past the extra fields/ID/username/date and get to the hashtag columns.

(On preview: I'd go with yarrow's suggestion if you're ok with getting error rows and filtering/deleting them, as it's easier to implement.)
posted by bassooner at 4:09 PM on January 6, 2017 [3 favorites]


This all sounds like a nightmare to do in Excel. Do you have a example data file? Maybe 10 rows in csv format? I think a short R script would be able to do the trick.
posted by demiurge at 4:15 PM on January 6, 2017


Oh, another way to get your 60,000 rows would be to start with a 1 in row 1 and then autofill the formula
=IF(A1=30,1,A1+1)

Or really you'd start at A2 since A1 would be the header, but you see what I mean.
posted by yarrow at 4:25 PM on January 6, 2017


You guys are fantastic. Learning R is on my to do list, but I'm not sure I can do it quickly enough to apply it in the short term.

I will play around with some of the suggestions here and see what I can get to work. In the meantime, here's a sample data file of the Picodash downloads. I have a bunch of these, and most are larger.
posted by Superplin at 4:41 PM on January 6, 2017


Forgot to answer purple_bird's question: The order of the hashtags is irrelevant.
posted by Superplin at 5:05 PM on January 6, 2017


So, stupid question. You do know you can sort all rows based on values in any specific column, yes? So you can sort on column 1 & get all the 1's together followed by the 2's, etc. You can even sort on multiple columns at once.
posted by scalefree at 5:34 PM on January 6, 2017


This would be a sub half hour task in Python as well, using the absolutely fantastic tablib library.
posted by rockindata at 5:45 PM on January 6, 2017


No programming: copy your sheet 30 times, reserve one unique hastag column for each sheet and delete the other 29 columns on that shet, mash the sheets back together, and delete any rows your hashtag cells are empty.
posted by glibhamdreck at 6:00 PM on January 6, 2017


I wrote a R script that does what you want, output is here, script is here, with bonus word cloud.
posted by demiurge at 6:29 PM on January 6, 2017 [4 favorites]


Because we can't have an R solution without a python solution. You will need to install pandas (hopefully as easy as `pip install pandas` at the command line if you are on mac), then `python nameofthisastextfile.py`:

import pandas as pd
z = pd.read_csv('picodash_instagram_nra_blog_2016-12-14.csv') #read in data

hashtags = z.Hashtags.str.split(",").apply(pd.Series, 1).stack() #split, and stack

hashtags.index = hashtags.index.droplevel(-1) #relevel to remove index of place of tag
hashtags.name = 'Hashtags' #name
del z['Hashtags'] #remove original
z = z.join(hashtags) #join
z.to_csv('longout.csv', index = False) #write csv
posted by scodger at 7:46 PM on January 6, 2017 [3 favorites]


You can restructure your data using PivotTable in Excel. Here are the steps:

1. Prep your data. Combine all the non-hashtag data into a single ID column, like this:

ID; Hashtag 1; Hashtag 2; Hashtag 3
1|Barglefest|9/28; stuff; things;
2|Gobbler|9/28; things; purple; sunnyday
3|Swiftkick|9/28; orange; ;

2. Open the PivotTable and PivotChart Wizard, select "Multiple consolidation ranges," and click "Next." You will see the question "How many page fields do you want?"; select "Create a single page field for me" and click "Next." You will then see "Where are the worksheet ranges that you want to consolidate?" Select the ID and hashtag columns from Step 1 and click "Finish."

3. You should now have a PivotTable with IDs from Step 1 as the Row Labels and hashtag1, hashtag2, etc. as the Column Labels. Right-click on the very last cell of the PivotTable (where the Grand Totals intersect) and select "Show Details."

4. You should now have a new worksheet with a table containing the restructured data. You can delete the columns you don't need, and use Text to Columns to convert the IDs from Step 1 back into three columns.

PM me if you need to.
posted by jcatus at 8:22 PM on January 6, 2017 [4 favorites]


You can't have an R solution without TWO python solutions. The code is here in this gist, (as well as below) and the data is here.

To use this, download the hashtag_stack.py file from the github gist site linked abot and save it in a folder along with the csv you linked to earlier. Open a finder window with that folder visible. Then open a terminal window, and type
pip install tablib
and hit enter . This will install the tablib library on your system python. Not perfect, but it won't hurt anything.
then type "cd" plus a space and then drag the folder where your files are from the finder into the terminal window. This will automagically add the path to the terminal, and hit enter. Then you can type
python hashtag_stack.py
and a new file, called hashtag_data.xlsx, should appear in the folder along with the source data and the python file.
import tablib
from copy import deepcopy

# pull the data in, using tablib, a really great library for messing around with tabular data
# more details about tablib here: http://docs.python-tablib.org/en/latest/).
# For serious data analysis, you are going to need to mess around with pandas http://pandas.pydata.org/
data = tablib.Dataset().load(open('picodash_instagram_nra_blog_2016-12-14.csv').read())

# make the headers for the new dataset, there has to be a more elegant way to do this, but this works
headers = data.headers
# There is going to be a new header, "hashtag", so lets add it to the headers list
headers.append(u'hashtag')

# The new dataset we are building, with the headers set
stacked_data = tablib.Dataset(headers=headers)

# chug through the rows of the dataset
for row in data.dict:
    # split the data
    for hashtag in row['Hashtags'].split(','):
        newrow = deepcopy(row)
        # add the specific hashtag
        newrow.update({'hashtag': hashtag})
        values = tuple(newrow.values())
        # add the new row to the dataset
        stacked_data.append(values)

# export the dataset as a new excel document
with open('hashtag_data.xlsx', 'wb') as f:
    f.write(stacked_data.xlsx)

The main difference between what I did and what scodger did is that he used the go-to data analysis package in Python, Pandas, while I used a narrower library tablib, that is just about manipulating tabular datasets. Pandas is super powerful, and you can do just about anything with it- if you are diving into python-based data analysis, it is an amazing tool. I haven't used Pandas much lately, so I didn't want to mess around with it.
posted by rockindata at 8:24 PM on January 6, 2017 [2 favorites]


Oh, in terms of vocab, "reshape data wide to long" would get you started on googling. In the R world there are a bunch of really useful packages organized around the concept of "tidy" data which is getting at the same thing - see this quasi-manifesto from the author of most of those packages: http://vita.had.co.nz/papers/tidy-data.pdf . If you do start learning R it's worth familiarizing yourself with those "tidyverse" packages early; they simplify a lot of things that are more of a pain in base R.
posted by yarrow at 5:41 AM on January 7, 2017 [1 favorite]


Or download KNIME and use an excel reader node to read in the data, followed by an unpivoting node to formed long vs wide and then there are a huge number of stastics nodes to do whatever stastics you want. As a bonus it also is integrated into the twitter api somehow so you could design a query and be able to run it at different timepoints to generate report without having to redo all of the manual data gathering.
posted by koolkat at 6:11 AM on January 7, 2017


Wow, what an amazing treasure trove to wake up to! I can't wait to play around with these various solutions.

I can't tell you how much I appreciate your generosity, especially those of you who posted scripts and detailed instructions. Thank you all so much.
posted by Superplin at 6:29 AM on January 7, 2017 [1 favorite]


speaking of tidyverse, here's a solution using tidyr and dplyr in r:


library(dplyr)
library(tidyr)
z = read.csv('/path/to/file/picodash_instagram_nra_blog_2016-12-14.csv')
z = z %>% mutate(Hashtags = strsplit(as.character(Hashtags), ",")) %>%
unnest(Hashtags)
write.csv(z, 'out.csv')

posted by scodger at 10:07 AM on January 7, 2017 [1 favorite]


« Older Baseboard heater melted a piece of plastic and now...   |   Obtaining a restraining order in Massachusetts Newer »
This thread is closed to new comments.