Filtering for text uniqueness in Excel
April 27, 2005 12:24 PM Subscribe
I dumped a week's worth (about 1400 rows) of successful connection data from my company's VPN concentrator logs into a .csv file containing the following columns (A through D): Date, Time, Username, Source IP. I am trying to filter the list in Excel so it only shows the most recent connection data per username, but I can't figure out how to accomplish this programmatically.
I've sorted it by username, then by date, then by time. Visually, I know I only want the last row of data per username, but again, how to get Excel give me just those rows is beyond me. Since I need this done quickly, I have already begun manually deleting duplicate logon rows, keeping only the most recent entry. I know there's gotta be a better way!
I've sorted it by username, then by date, then by time. Visually, I know I only want the last row of data per username, but again, how to get Excel give me just those rows is beyond me. Since I need this done quickly, I have already begun manually deleting duplicate logon rows, keeping only the most recent entry. I know there's gotta be a better way!
Best answer: Make a new column with cells containing only the date values you want, then use copy/paste special(values) or some sorting to get what you want.
If the date column is A and the user column is C, set the cells in this new column X to something like this
Copy it down all records, you might have to monkey with the last row. Then if you sort the entire dataset by this new column X, you bring only the relevant records up to the top, where you can sort just those by username again.
posted by fleacircus at 12:46 PM on April 27, 2005
If the date column is A and the user column is C, set the cells in this new column X to something like this
= if($A1=$A2, "",$C1)
Copy it down all records, you might have to monkey with the last row. Then if you sort the entire dataset by this new column X, you bring only the relevant records up to the top, where you can sort just those by username again.
posted by fleacircus at 12:46 PM on April 27, 2005
Best answer: Stupidly I wrote the code for the user column being A and the date column C.
posted by fleacircus at 12:47 PM on April 27, 2005
posted by fleacircus at 12:47 PM on April 27, 2005
Assuming the logs are in date order, and the username is in the first column, in PHP, you could do it this way:
// Our output array
$output = array();
// Open up the log file
$file = "logs.csv";
$fh = fopen( $file, "r" );
// Concentrate data down to one per username
while ( $row = fgetcsv( $fh, 102400 ) ) {
$username = $row[0];
$output[ $username ] = $row;
}
// Print out the last row found for each user
foreach ( $output as $username => $data ) {
print join( ",", $data ) . "\n";
}
posted by gaby at 12:48 PM on April 27, 2005
// Our output array
$output = array();
// Open up the log file
$file = "logs.csv";
$fh = fopen( $file, "r" );
// Concentrate data down to one per username
while ( $row = fgetcsv( $fh, 102400 ) ) {
$username = $row[0];
$output[ $username ] = $row;
}
// Print out the last row found for each user
foreach ( $output as $username => $data ) {
print join( ",", $data ) . "\n";
}
posted by gaby at 12:48 PM on April 27, 2005
Keep the sort you've got: username, date, time.
Copy and paste so that the user name is the last column. Format the date, time and IP so that none had internal whitespace (you can leave whitespace in the username), so you end up with something like this:
2005-01-01 12:34:56 127.0.0.1 JoeBlow
2000-01-01 12:34:56 127.0.0.1 JoeBlow
2005-01-01 12:34:56 127.0.0.1 Mary Q Public
Export that as text.
Now use the unix
Import the output file back into excel.
If you don't have uniq, install cygwin and get it.
posted by orthogonality at 12:53 PM on April 27, 2005
Copy and paste so that the user name is the last column. Format the date, time and IP so that none had internal whitespace (you can leave whitespace in the username), so you end up with something like this:
2005-01-01 12:34:56 127.0.0.1 JoeBlow
2000-01-01 12:34:56 127.0.0.1 JoeBlow
2005-01-01 12:34:56 127.0.0.1 Mary Q Public
Export that as text.
Now use the unix
uniq
command like this: uniq --skip-fields=3 name.of.text.file.txt > output.file.txt
Import the output file back into excel.
If you don't have uniq, install cygwin and get it.
posted by orthogonality at 12:53 PM on April 27, 2005
- Sort the table in reverse chronological order
- Use a pivot table to get a de-duped list of user names
- For each user name in the pivot table, use that in a VLOOKUP in the original table to get the most recent connection data. Here's the clever bit:run the film backward set the last value in the VLOOKUP function ("range_lookup") to FALSE--this will force it to look for the first exact match in the table to the username. Since you've sorted the connections in reverse chronological order, the first reference in the table for a given username is the most recent.
posted by DevilsAdvocate at 12:57 PM on April 27, 2005
- Use a pivot table to get a de-duped list of user names
- For each user name in the pivot table, use that in a VLOOKUP in the original table to get the most recent connection data. Here's the clever bit:
posted by DevilsAdvocate at 12:57 PM on April 27, 2005
And in Perl (multi lingua day for me today:) ), making the same assumptions as PHP:
my %output;
my $file = "logs.csv";
open( CSV, "<$file" );
while( <CSV> ) {
chomp( $_ );
my @row = split( ",", $_ );
my $username = $row[0];
$output{ $username } = join( ",", @row );
}
for my $username ( sort( keys( %output ) ) ) {
print $output{ $username } . "\n";
}
posted by gaby at 1:03 PM on April 27, 2005
my %output;
my $file = "logs.csv";
open( CSV, "<$file" );
while( <CSV> ) {
chomp( $_ );
my @row = split( ",", $_ );
my $username = $row[0];
$output{ $username } = join( ",", @row );
}
for my $username ( sort( keys( %output ) ) ) {
print $output{ $username } . "\n";
}
posted by gaby at 1:03 PM on April 27, 2005
I'm sure there's the command line option for sort that would do the same job. sort -u for starters but I don't know where else to take it.
posted by gaby at 1:08 PM on April 27, 2005
posted by gaby at 1:08 PM on April 27, 2005
Depending on what you're going to do with it after, sub-totals might be your easiest choice. Sort the data by the username, then add a sub-total to it. That'll bring up the summary bar on the side, which you can collapse to show just the sub-total lines.
This information is of limited usefulness from this point on, but if all you want to do is get a list of when everyone last connected it's a dead simple solution.
posted by jacquilynne at 1:10 PM on April 27, 2005
This information is of limited usefulness from this point on, but if all you want to do is get a list of when everyone last connected it's a dead simple solution.
posted by jacquilynne at 1:10 PM on April 27, 2005
It seems to me that Access, if you have it available to you, is ideal for this.
posted by kc0dxh at 2:02 PM on April 27, 2005
posted by kc0dxh at 2:02 PM on April 27, 2005
The PHP and Perl solutions are wrong because they forgot that the username is in column C, not A. Here's a Python 2.4 solution:
posted by grouse at 2:40 PM on April 27, 2005
import csv, fileinput
usernames = {}
for row in csv.reader(fileinput.input()):
usernames[row[2]] = ",".join(row)
print "\n".join(usernames.itervalues())
posted by grouse at 2:40 PM on April 27, 2005
Of course, the body of the for loop has to be indented. Now the Perl and PHP weenies will taunt me mercifully for my meaningful whitespace biting me in the ass. But at least my program will get the right solution, if you indent it. ;)
posted by grouse at 2:42 PM on April 27, 2005
posted by grouse at 2:42 PM on April 27, 2005
Response by poster: fleacircus wins, with extra points for simplicity and keeping it entirely within Excel! I'm all for using external tools when they can do the job quicker/better, but since the final product needs to be in Excel anyway (to send on to the higher-ups), this is clearly the best answer.
Thank you much!
(incidentally, I ended up finishing the task with a manual sort and sent it on, but I went back to the raw data on my lunch break to try the various solutions presented here.)
posted by pmbuko at 3:20 PM on April 27, 2005
Thank you much!
(incidentally, I ended up finishing the task with a manual sort and sent it on, but I went back to the raw data on my lunch break to try the various solutions presented here.)
posted by pmbuko at 3:20 PM on April 27, 2005
fleacircus,
can you post a slightly more comprehensive answer? I'd like to learn what you suggested.
posted by seinfeld at 8:13 PM on April 27, 2005
can you post a slightly more comprehensive answer? I'd like to learn what you suggested.
posted by seinfeld at 8:13 PM on April 27, 2005
This thread is closed to new comments.
In a dataset of the size you are working with, it's probably going to take longer to find a programmatic solution than it is to do it manually. On the other hand, if you had 10 times as much data, and you need to do this operation loads of times then it would be worth the time investment.
posted by gaby at 12:42 PM on April 27, 2005