Uniq New York, Uniq New York
July 17, 2008 12:00 PM   Subscribe

Why isn't GNU uniq 6.12 working as I expect?

I am trying to use the --skip-fields=n argument to skip over some columns that I don't want to use for the criteria of uniqueness.

But it seems like uniq is ignoring those criteria.

Here is my sample input ("chr3test"). Imagine that that the spaces are really tabs:

+ aatgtaatt 12124016 12124007 aattNaatt chr3
+ aactgaatt 37509704 37509695 aattNaatt chr3
+ aatttaatc 43787257 43787248 aattNaatt chr3
+ aatttaatt 81433256 81433247 aattNaatt chr3
+ aattaattt 81433277 81433268 aattNaatt chr3
+ aatttcatt 121944135 121944126 aattNaatt chr3
+ aattcaatt 128374695 128374686 aattNaatt chr3
+ aattcaatc 128374700 128374691 aattNaatt chr3
+ aagtgaatt 151747168 151747159 aattNaatt chr3
+ tattaaatt 175957080 175957071 aattNaatt chr3
+ aatttaatt 178762409 178762400 aattNaatt chr3
- aattacatt 12124016 12124007 aattNaatt chr3
- aattcagtt 37509704 37509695 aattNaatt chr3
- gattaaatt 43787257 43787248 aattNaatt chr3
- aattaaatt 81433256 81433247 aattNaatt chr3
- aaattaatt 81433277 81433268 aattNaatt chr3
- aatgaaatt 121944135 121944126 aattNaatt chr3
- aattgaatt 128374695 128374686 aattNaatt chr3
- gattgaatt 128374700 128374691 aattNaatt chr3
- aattcactt 151747168 151747159 aattNaatt chr3
- aatttaata 175957080 175957071 aattNaatt chr3
- aattaaatt 178762409 178762400 aattNaatt chr3


When I run:

uniq -f 2 chr3test

I get the entire file back, which is wrong.

If this worked, I would expect:

+ aatgtaatt 12124016 12124007 aattNaatt chr3
+ aactgaatt 37509704 37509695 aattNaatt chr3
+ aatttaatc 43787257 43787248 aattNaatt chr3
+ aatttaatt 81433256 81433247 aattNaatt chr3
+ aattaattt 81433277 81433268 aattNaatt chr3
+ aatttcatt 121944135 121944126 aattNaatt chr3
+ aattcaatt 128374695 128374686 aattNaatt chr3
+ aattcaatc 128374700 128374691 aattNaatt chr3
+ aagtgaatt 151747168 151747159 aattNaatt chr3
+ tattaaatt 175957080 175957071 aattNaatt chr3
+ aatttaatt 178762409 178762400 aattNaatt chr3


When I run:

uniq -f 3 chr3test

I also get the same results.

But when I run:

uniq -f 4 chr3test

Then I get back:

+ aatgtaatt 12124016 12124007 aattNaatt chr3

It seems like those numerical entries are treated as unique, and skipping over them makes the comparison criteria "aattNaatt\tchr3", which eliminates all but the first row.

I checked the metacharacters and there are \t ("tabs") correctly placed in matching rows (the matching condition being everything but the first two columns).

Can someone point to what I'm doing incorrectly, or suggest another method (preferably command-line) to strip out duplicates without losing column data? Thanks!
posted by Blazecock Pileon to Computers & Internet (3 answers total)
 
Uniq requires sorted input. It looks like your input is not sorted on that field.
posted by mkb at 12:06 PM on July 17, 2008


Best answer: So basically what you need if
sort -k3 -n chr3test | uniq -f2

posted by mkb at 12:07 PM on July 17, 2008


Response by poster: I forgot about that requirement. Thanks!
posted by Blazecock Pileon at 12:11 PM on July 17, 2008


« Older Where to rent beach umbrellas in San Diego?   |   What's in a (handmade crafts shop) name? Newer »
This thread is closed to new comments.