Filtering for text uniqueness in Excel
April 27, 2005 12:24 PM   Subscribe

I dumped a week's worth (about 1400 rows) of successful connection data from my company's VPN concentrator logs into a .csv file containing the following columns (A through D): Date, Time, Username, Source IP. I am trying to filter the list in Excel so it only shows the most recent connection data per username, but I can't figure out how to accomplish this programmatically.

I've sorted it by username, then by date, then by time. Visually, I know I only want the last row of data per username, but again, how to get Excel give me just those rows is beyond me. Since I need this done quickly, I have already begun manually deleting duplicate logon rows, keeping only the most recent entry. I know there's gotta be a better way!
posted by pmbuko to Computers & Internet (17 answers total)
 
How many different users do you have? Don't underestimate how much you can do in 15 minutes. If you sort everything by username, then date, then time and delete the duplicates manually, it might be a bit boring to do but I reckon you could do it in 15 minutes tops.

In a dataset of the size you are working with, it's probably going to take longer to find a programmatic solution than it is to do it manually. On the other hand, if you had 10 times as much data, and you need to do this operation loads of times then it would be worth the time investment.
posted by gaby at 12:42 PM on April 27, 2005


Make a new column with cells containing only the date values you want, then use copy/paste special(values) or some sorting to get what you want.

If the date column is A and the user column is C, set the cells in this new column X to something like this
= if($A1=$A2, "",$C1)
Copy it down all records, you might have to monkey with the last row. Then if you sort the entire dataset by this new column X, you bring only the relevant records up to the top, where you can sort just those by username again.
posted by fleacircus at 12:46 PM on April 27, 2005


You could use an Array Formula (sometimes called a CSE formula, because after you finish typing it in, you have to hit CTRL+SHIFT+ENTER instead of just Enter) like this:

=MAX(IF(A1:A1400="Sam",B1:B1400))

Where column A has your usernames, and column B is the login time. I'm assuming the date and time are stored in the same cell.

When you enter it with the right keystroke, it appears in your formula bar like this:

{=MAX(IF(A1:A1400="Sam",B1:B1400))}
posted by odinsdream at 12:47 PM on April 27, 2005


Stupidly I wrote the code for the user column being A and the date column C.
posted by fleacircus at 12:47 PM on April 27, 2005


Assuming the logs are in date order, and the username is in the first column, in PHP, you could do it this way:

// Our output array
$output = array();

// Open up the log file
$file = "logs.csv";
$fh = fopen( $file, "r" );

// Concentrate data down to one per username
while ( $row = fgetcsv( $fh, 102400 ) ) {
$username = $row[0];
$output[ $username ] = $row;
}

// Print out the last row found for each user
foreach ( $output as $username => $data ) {
print join( ",", $data ) . "\n";
}
posted by gaby at 12:48 PM on April 27, 2005


Further, you can extend that by making the formula reference an adjacent cell for the username, instead of having it hard-coded. Then, auto-fill the formula down with your list of unique usernames, and you should have the appropriate times.
posted by odinsdream at 12:48 PM on April 27, 2005


Keep the sort you've got: username, date, time.

Copy and paste so that the user name is the last column. Format the date, time and IP so that none had internal whitespace (you can leave whitespace in the username), so you end up with something like this:
2005-01-01 12:34:56 127.0.0.1 JoeBlow
2000-01-01 12:34:56 127.0.0.1 JoeBlow
2005-01-01 12:34:56 127.0.0.1 Mary Q Public

Export that as text.

Now use the unix uniq command like this:
uniq --skip-fields=3 name.of.text.file.txt > output.file.txt

Import the output file back into excel.

If you don't have uniq, install cygwin and get it.
posted by orthogonality at 12:53 PM on April 27, 2005


- Sort the table in reverse chronological order
- Use a pivot table to get a de-duped list of user names
- For each user name in the pivot table, use that in a VLOOKUP in the original table to get the most recent connection data. Here's the clever bit: run the film backward set the last value in the VLOOKUP function ("range_lookup") to FALSE--this will force it to look for the first exact match in the table to the username. Since you've sorted the connections in reverse chronological order, the first reference in the table for a given username is the most recent.
posted by DevilsAdvocate at 12:57 PM on April 27, 2005


And in Perl (multi lingua day for me today:) ), making the same assumptions as PHP:

my %output;

my $file = "logs.csv";
open( CSV, "<$file" );

while( <CSV> ) {
   chomp( $_ );
   my @row = split( ",", $_ );
   my $username = $row[0];
   $output{ $username } = join( ",", @row );
   }

for my $username ( sort( keys( %output ) ) ) {
   print $output{ $username } . "\n";
   }
posted by gaby at 1:03 PM on April 27, 2005


Once upon a time, I'd do that in awk.
posted by gimonca at 1:05 PM on April 27, 2005


I'm sure there's the command line option for sort that would do the same job. sort -u for starters but I don't know where else to take it.
posted by gaby at 1:08 PM on April 27, 2005


Depending on what you're going to do with it after, sub-totals might be your easiest choice. Sort the data by the username, then add a sub-total to it. That'll bring up the summary bar on the side, which you can collapse to show just the sub-total lines.

This information is of limited usefulness from this point on, but if all you want to do is get a list of when everyone last connected it's a dead simple solution.
posted by jacquilynne at 1:10 PM on April 27, 2005


It seems to me that Access, if you have it available to you, is ideal for this.
posted by kc0dxh at 2:02 PM on April 27, 2005


The PHP and Perl solutions are wrong because they forgot that the username is in column C, not A. Here's a Python 2.4 solution:

import csv, fileinput

usernames = {}
for row in csv.reader(fileinput.input()):
usernames[row[2]] = ",".join(row)

print "\n".join(usernames.itervalues())

posted by grouse at 2:40 PM on April 27, 2005


Of course, the body of the for loop has to be indented. Now the Perl and PHP weenies will taunt me mercifully for my meaningful whitespace biting me in the ass. But at least my program will get the right solution, if you indent it. ;)
posted by grouse at 2:42 PM on April 27, 2005


fleacircus wins, with extra points for simplicity and keeping it entirely within Excel! I'm all for using external tools when they can do the job quicker/better, but since the final product needs to be in Excel anyway (to send on to the higher-ups), this is clearly the best answer.

Thank you much!

(incidentally, I ended up finishing the task with a manual sort and sent it on, but I went back to the raw data on my lunch break to try the various solutions presented here.)
posted by pmbuko at 3:20 PM on April 27, 2005


fleacircus,

can you post a slightly more comprehensive answer? I'd like to learn what you suggested.
posted by seinfeld at 8:13 PM on April 27, 2005


« Older Remote connection to Win95 pc   |   Thar' she blows Newer »
This thread is closed to new comments.