Tags:

Recovering lost data due to text-formatting?
August 10, 2011 6:30 AM   Subscribe

Asking for a friend: I was entering some data into excel and I realized it was changing gene names into dates. E.g. there is a gene in my dataset called SEPT5 and Excel automatically makes this into Sep-05. If I manually set the column as text, it inexplicably changes the cell to a 5-digit number that represents that date, in this case 38596.

I would simply discontinue using Excel, but my collaborators will not.

Any suggestions for recovering the gene names that were changed? Googling the problem presents a lot of options, but most of them are "format the cell beforehand" and that's not an answer to my problem.

(And it actually changes the DATA not just what is displayed)
posted by empath to computers & internet (21 answers total) 3 users marked this as a favorite
 
(also is there a way to just disable the autoformatting entirely?)
posted by empath at 6:30 AM on August 10, 2011


It is formatting it as a date: go into Format (which menu it is depends on which version of Excel) and change it from Date to Number, with 0 decimal places.
posted by ricochet biscuit at 6:34 AM on August 10, 2011


Enter a single quote as the first character in the cell. That's the "literal" operator in Excel.
posted by thelonius at 6:39 AM on August 10, 2011 [2 favorites]


I have not used Excel in a long time - and this was one "feature" that really bugged me. If I recall correctly, write an apostrophe ' as the first character that you want Excel to leave as-is. Thus:
'SEPT5
posted by aroberge at 6:39 AM on August 10, 2011


What you're finding is that once Excel interprets the input as a date, it actually changes the data such that changing the formatting does not bring back the original input. What you need to is change the format before you input the data. Which is what Google is telling you.

If you're trying to get that data back, Excel supports Find/Replace functionality. Just do ctrl+f, find 38596, and replace it with SEPT5.
posted by valkyryn at 6:39 AM on August 10, 2011 [3 favorites]


You can also type an apostrophe ( ' ) in front of the string to force it to display as you typed it. The ' will not show in the cell once you hit Enter.
posted by The Deej at 6:40 AM on August 10, 2011


ricochet Biscuit tells you how to format the cell beforehand. Once you've done that for the entire errant column (you can do that for all the cells in a column by hilighting the column and formatting it as he describes), then you can just do a search-and-replace for the data you want to recover ("replace '38596' with 'SEPT5'"); to find the search-and-replace, highlight the cell in question, go to the "Edit" menu, and scroll down to "replace". Enter the search string and the replace string in the respective boxes. Then hit "replace" and it will bring you to each instance of "38596" and ask if that's something you want to replace. (If that's too time-consuming, go to "replace all" in the dialogue box instead and it'll just do it all.)

and the way to turn off the autoformatting is to highlight the target cell(s) in question, go to the "format" menu, and scroll down to "text". But you knew that. the trick is to do all of the ones you're going to use instead (but you knew that too).
posted by EmpressCallipygos at 6:41 AM on August 10, 2011


I've run across the same problem before, and yeah, change the format BEFORE you input the data.

But if it's too late for that, there's a couple things you could do. Set up a new column using an "IF...THEN" function. I can't remember the exact format, but basically "If Column A = Sept-05, then Column B = SEPT5".

Or export it to a text file and do a search and replace.
posted by mikeand1 at 6:42 AM on August 10, 2011


If you're trying to get that data back, Excel supports Find/Replace functionality. Just do ctrl+f, find 38596, and replace it with SEPT5

This doesn't really help if it changed more than one name. Is there a formula that will reconvert the numbers back to the original text?
posted by empath at 6:44 AM on August 10, 2011


This may be a bit convoluted, but here is what I'd do:

1. Format all the numbers back into dates in that Month-Day format
2. Copy the ENTIRE column of all genes and paste into something like notepad
3. Find and replace all, for instance, "Sep-" with "SEPT5" and so on for each 'month'
4. Reformat the original column as text
5. Copy and paste all the data in notepad back into the column you original copied the data from

I tried it with various SEPTx values and it seems to work.
posted by Green With You at 6:50 AM on August 10, 2011


This doesn't really help if it changed more than one name. Is there a formula that will reconvert the numbers back to the original text?

Not that I know of; I'm afraid this may be a time-consuming matter of doing a search-and-replace for each separate name.
posted by EmpressCallipygos at 6:52 AM on August 10, 2011


Oops, step 3 should say:
3. Find and replace all, for instance "Sep-" with "SEPT" and so on for each month.
posted by Green With You at 6:53 AM on August 10, 2011


This is a known problem with using excel for storing gene names; there was at least one news story about how something on the order of a year's work was lost due to this kind of auto formatting.

A publication in biomed central (http://www.biomedcentral.com/1471-2105/5/80) looks like it has some tips and tricks for avoiding this kind of data loss (there are more gotchas that may be lurking in Excel to nom your data).
posted by VeritableSaintOfBrevity at 6:57 AM on August 10, 2011


There's not really an inverse; there are many potential names that excel changes to 38596. The best that I can come up with is to take all the names you have, put them in excel next to the literal (pre-formated) version, then copy as value, paste to another column, and filter on equality of the two. Then you have a list of all the stupid excel transformations; if there are no collisions you can use that as a substitution rule to get your original data back. That substitution is easy in many tools.
posted by a robot made out of meat at 7:21 AM on August 10, 2011 [1 favorite]


question: If the ' means make this input literal, how does one do that when importing mass quantities of data, CSV, Tabbed, or otherwise?
posted by Gungho at 7:44 AM on August 10, 2011


iirc, Excel internally represents Dec 31 1899 by the number 1, Jan 1 1900 by the number 2, and so on. 38596 is Sept 1, 2005 - are you sure that is what 'SEPT5' was converted to? Maybe the auto-formating evil thinks 'SEPT5' means the first day of September, in 2005?

The VBA function CDate() will give the date value for a long integer. CLong() will give the long integer value for a date. I don't know if there are Excel formula functions that do that kind of type conversion.

If the errors in your data are all like this - gene name turned to date and then to the number representation of that date - then, it should be possible to recover the original values, I think.
posted by thelonius at 8:00 AM on August 10, 2011


Maybe the auto-formating evil thinks 'SEPT5' means the first day of September, in 2005?

Sept. 5 of the current year, I think.
posted by aught at 8:08 AM on August 10, 2011


question: If the ' means make this input literal, how does one do that when importing mass quantities of data, CSV, Tabbed, or otherwise?

Use a text editor with the ability to run macros to prepend a ' on each line in the file before importing it.


The question I have is, why Excel instead of an Access database?
posted by MikeWarot at 8:41 AM on August 10, 2011 [1 favorite]


The question I have is, why Excel instead of an Access database?

Forget it Jake, it's bio-town.*

*I kid; I work with very computationally sophisticated genetics researchers.
posted by a robot made out of meat at 8:47 AM on August 10, 2011 [1 favorite]


question: If the ' means make this input literal, how does one do that when importing mass quantities of data, CSV, Tabbed, or otherwise?

You can set individual columns' formatting as you import, I think.
posted by BungaDunga at 12:40 PM on August 10, 2011


Gungho: "question: If the ' means make this input literal, how does one do that when importing mass quantities of data, CSV, Tabbed, or otherwise"

When you import data into Excel you can specify the formatting for each column.
posted by turkeyphant at 6:20 AM on August 11, 2011


« Older The Mormons have put images of...   |  We just got a 4 month old Shet... Newer »
This thread is closed to new comments.