Excel query
June 13, 2019 4:14 AM   Subscribe

Trying to attach a date and ID to multiple rows of data - do I need INDEX MATCH or some kind of multiple VLOOKUP or what?

I have an Excel file which looks like this:

PersonID, Date, ,,, [many other fields]
111, Date1, ,,,
111,,,,
111,,,,
222,,,,
222, Date2, ,,,
222,,,,
222,,,,
333, Date3, ,,,
333,,,,
333,,,,
333,,,,
333,,,,
333, Date4, ,,,
333,,,,

It was an XML export originally. I want to put the date against every related row and also add a unique reference number for that personID and date combination, ending up like this:

PersonID, Date, ,,, [many other fields]
111, Date1, ,,,UniqueID1
111, Date1, ,,,UniqueID1
111, Date1, ,,,UniqueID1
222, Date2, ,,,UniqueID2
222, Date2, ,,,UniqueID2
222, Date2, ,,,UniqueID2
222, Date2, ,,,UniqueID2
333, Date3, ,,,UniqueID3
333, Date3, ,,,UniqueID3
333, Date3, ,,,UniqueID3
333, Date3, ,,,UniqueID3
333, Date3, ,,,UniqueID3
333, Date4, ,,,UniqueID4
333, Date4, ,,,UniqueID4

The date that should go against each row is the date on the first or second row of the group of rows with that PersonID, unless there is a second date (as in record 333 above). So I have at least one and potentially several activity dates for each PersonID. If the first relevant date was always in the first row after change of PersonID, I could do this, but as it is sometimes in the second row and possibly subsequent rows, I cannot work it out. I cannot sort the data so that the first date is always in the first row, as the data in subsequent rows will get muddled - it is only clear which activity they relate to by their place in the dataset. Help!

PS I do also have MS Access but can't see a way to do it that way either.
posted by sock of ages to Computers & Internet (18 answers total)
 
Here's a link that I think will help -->
https://www.ablebits.com/office-addins-blog/2014/05/02/fill-blanks-excel/

Basically select all cells in the Date column, including the blank ones. Then go to Home->Editing->Find & Select-> Go To Special -> select "blanks". Then hit equals sign, arrow to the cell directly above, then hit Ctrl + Enter. All the blank data should then be filled in with the text directly above it. You can copy paste values so that the formula gets removed.
posted by watrlily at 4:29 AM on June 13, 2019


Response by poster: Hi, thanks for responding. The problem is that in some cases (my PersonID 222 in the first dataset above) the date is in the second or subsequent row. So filling in the value above for blanks does not work.
posted by sock of ages at 4:33 AM on June 13, 2019


I would use the auto filter and then filter by date column to suppress all lines where the date is blank - by your data then there would be a list of person IDs and dates that should be unique (ie no person ID in there twice). Then i'd make a new tab with this list, this is now your lookup list. Remove filtering.

After that, use vlookup on the person ID to fill the date for all lines. I'd probably make a new column then you can quality control by a quick visual inspection to make sure the dates look right. Then another new column using concatenate to mash the ID and date together.
posted by london explorer girl at 4:44 AM on June 13, 2019 [2 favorites]


The problem is that in some cases (my PersonID 222 in the first dataset above) the date is in the second or subsequent row. So filling in the value above for blanks does not work.

How sure are you, in that case, that the third last line needs to be filled in with Date3 as opposed to Date4? By what criterion is that decision made?

Also, what are your chances of gaining access to the original XML file? It may well be easier to generate a clean and complete export from than than to fix up the wonky one you're working with at present.
posted by flabdablet at 4:54 AM on June 13, 2019


I came in here to say what london explorer girl did.
posted by moiraine at 5:10 AM on June 13, 2019


Response by poster: Sorry, london explorer girl, there are data on every line, just in different fields - the export from XML has somehow stepped the data, so it looks

PersonID, Date, ,,, [many other fields]
111, Date1, ,,,
111,[date field blank],[some other value in this field],[remaining fields blank],
111,[date field blank],[this field blank],[some value here],[remaining fields blank]

So apart from PersonID, there is only one field with data in it in each row, but that field moves to the right each time, if that makes any sense at all. I'd lose data with what you suggest.

I do have the original XML file, flabdablet - how would I go about getting a better export from it? I exported into Excel which I'm sure is the cause of these issues.
posted by sock of ages at 5:13 AM on June 13, 2019


Response by poster: And yes, I am reasonably sure (have tested extensively against the actual data) that in cases where there is more than one dated activity, the rows following the second date relate to the second date rather than the first.
posted by sock of ages at 5:15 AM on June 13, 2019


Maybe I'm missing something here, but is your date field always in the second column, blank or not blank? I don't see how you would lose data using the method london explorer girl suggests.

You are importing into Excel as comma-delimited columns, right?
posted by moiraine at 5:42 AM on June 13, 2019


Response by poster: Hi moraine, yes, date field is always in the same column, but there are other columns of data further to the right. Date is not in all cells of the column so if I cut blanks I would lose rows where date is blank but there is other data in other columns. It's an output from a relational database with a whole raft of activity and associated data.

I am using Data from Other Sources - XML to import into Excel.
posted by sock of ages at 5:51 AM on June 13, 2019


If you'd be willing to email me a suitably redacted version of that XML file, I'd be happy to have a play and see whether something both useful and easily comprehensible can be achieved with PowerShell. Email address in profile.

Otherwise: when I'm doing this kind of data cleanup work, I like to keep a sheet in the workbook that has only the original imported data, then build secondary sheets for the manipulated versions of it.

So assuming your original import is on Sheet1, the first thing I'd do is create Sheet2, put =Sheet1!A1 in cell A1, and then fill across and down to whatever extent is needed to cover the biggest anticipated import. The result should be that Sheet2 looks the same as Sheet1. If it doesn't, all it should need is a copy and paste-special of only Formats to fix it.

Now we can fill in the missing dates. Paste the following formula into cell B2
=IF(ISBLANK(Sheet1!B2),IF(ISBLANK(Sheet1!A2),"","yow"),Sheet1!B2)
and fill it all the rest of the way down column B.

What this says: If the date for this row on Sheet1 is blank, and the PersonID is also blank, make this cell blank as well (this case handles rows where nothing has been pasted in on the import sheet). If the date for this row on Sheet1 is blank, but the PersonID is not also blank, put in the text "yow". Otherwise the date for this row on Sheet1 is not blank, so use it.

So what you should see is a column that has "yow" wherever a missing date needs to be filled in.

If that works, the next thing is to replace the "yow" with something more useful:
IF(A2=A1,B1,IF(A2=A3,B3,"wah"))
What this says: if the PersonID in this row is the same as that in the previous row, use the previous row's date; otherwise if the PersonID in this row is the same as that in the next row, use the next row's date; otherwise use "wah" (in theory this last should never happen, so therefore you should eyeball it to make sure it hasn't).

So after doing that replacement, the whole formula in Sheet2!B2 should be as follows:
=IF(ISBLANK(Sheet1!B2),IF(ISBLANK(Sheet1!A2),"",IF(A2=A1,B1,IF(A2=A3,B3,"wah"))),Sheet1!B2)
and filling that all the way down column B should get you a column of correctly filled in dates.

Now let's add some IDs. Insert a new column C to the right of column B on Sheet2. Put "ID" in C1, 0 in C2, and the following in C3:
=IF(AND(A3=A2,B3=B2),C2,C2+1)
and fill that all the way down the rest of column C. This will generate an integer ID that increments whenever either column A or column B changes.
posted by flabdablet at 6:26 AM on June 13, 2019


PersonID, Date, ,,, [many other fields]
111, Date1, ,,,
111,[date field blank],[some other value in this field],[remaining fields blank],
111,[date field blank],[this field blank],[some value here],[remaining fields blank]

So apart from PersonID, there is only one field with data in it in each row, but that field moves to the right each time, if that makes any sense at all.


Oh, ouch. Yeah, I can imagine a mechanism by which this would happen, and that imagining has destroyed my confidence in the assertion that the date will always turn up in the first or second row of the block for any given PersonID. If that can't be guaranteed, the logic I just pasted above won't work.

What process was responsible for generating the original XML file? Is there any possibility of going one step further back toward clean data that isn't in XML format?
posted by flabdablet at 7:02 AM on June 13, 2019


but there are other columns of data further to the right. Date is not in all cells of the column so if I cut blanks I would lose rows where date is blank

You don't cut out any rows. You are just copying the rows that do have a date into another tab (keeping them on the first tab as well) and using that tab to look up the date for any rows with missing dates.
posted by soelo at 7:26 AM on June 13, 2019 [2 favorites]


So, will you have uniqueIDs where the date ends up blank?

I think you can copy and paste the person and date fields into a new tab, and then remove duplicates. You'll end up with each unique person/date combination. Assign Unique IDs.

Then you should be able to set up an index/match function (I've been using this website) that will look up the person and date and spit back the Unique ID.
posted by DoubleLune at 7:27 AM on June 13, 2019


I think you are reading this wrong, sock of all ages.

I would use the auto filter and then filter by date column to suppress all lines where the date is blank - by your data then there would be a list of person IDs and dates that should be unique (ie no person ID in there twice). Then i'd make a new tab with this list, this is now your lookup list. Remove filtering.

So what I'm reading from this, is that the "new" list is a lookup list, i.e, you are going to filter, paste this filtered sheet into Sheet 2. Then, you are going go back to Sheet 1 (original) and unfilter. You are going to vlookup in Sheet 1 using the table in Sheet 2, using person ID and date into the empty Date columns. You may need to do some manual edits because vlookup person ID may have several dates attached to it, but hopefully not too many.

On preview, yes to soelo.
posted by moiraine at 7:31 AM on June 13, 2019 [1 favorite]


If you use =IF(A2=A1;B1) in the blanks, it copies the right date into the B column. For anything that doesn't match, it puts a FALSE which you can then manually fix to be the right date.

111 Date1
111 Date1
111 Date1
222 FALSE
222 Date2
222 Date2
222 Date2
333 Date3
333 Date3
333 Date3
333 Date3
333 Date3
333 Date4
333 Date4
posted by Ftsqg at 7:59 AM on June 13, 2019


It would be useful to know whether this is a once-off job or something you're likely to have to do regularly, and also roughly how many rows of imported data we're talking about. If the job's both small and not likely to recur, the error rate from a pointy-clicky solution might actually end up tolerable.
posted by flabdablet at 8:08 AM on June 13, 2019


I do data parsing in VBScript for xml files, so that I can scan for exact tags using regular expressions. xml is a bear for imports because nested tags exist and wreak merry hell on flat files. If this is something that will recur frequently, it may be worth it to write a script to ensure there are no errors.
posted by domo at 9:26 AM on June 13, 2019


It sounds odd when there are two or more dates for a Person ID the second+ date always appears in the first row of data it applies to, and it applies to all subsequent rows until you hit a new PersonID or a new date, when that's not also the case for the first date for that PersonID. It would mean there's a sort order imposed on parts of the data, but not all of it.

I have a solution to populating the date that will work if that is what's happening, but I'd use the results of this to help with testing, I wouldn't trust it to be right unless I could see how that sort order had occurred.

Similar to Ftsqg's solution, but it uses two passes to create a full date field. Assuming PersonID is in column A, Date in column B and row 1 is headers with data starting in row 2:
Insert column C: =if(B2<>"",B2,if(A2=A1,C1,""))
Insert column D: =if(C2<>"",C2,if(A2=A3,C3,""))

Copy those down all the way and column D should now be populated with your dates.

The first pass in column C populates date from either a date in the same row, or if there isn't one, looks for the last date given for that person. When the initial rows for a person don't have a date this remains blank. The second pass in column D deals with these cases by looking ahead to find the next date given for that person.
posted by SometimeNextMonth at 10:40 AM on June 13, 2019


« Older I Once Had a Pair of Flowing, Loose, Silky...   |   What is this sleep disturbance thing? Newer »
This thread is closed to new comments.