qsort comparison function with large structures. Best Methods?
September 28, 2006 1:00 PM Subscribe
What is the best way to create a function which compares two structures consisting of LOTS of elements for a C-based qsort routine?
There's gotta be a nicer way to do this... And wow does that look bad here on the green.
I'm looking for formatting tips and I'm looking for ideas which will make this more maintainable. It would be nice to avoid having to type each compared element four times during future modifications and to avoid searching for its neighbour halfway down the screen.
In the actual routine, I have a structure with about 10 elements and the function looks like the roof of a house tilted on its side...
Currently I have:
There's gotta be a nicer way to do this... And wow does that look bad here on the green.
I'm looking for formatting tips and I'm looking for ideas which will make this more maintainable. It would be nice to avoid having to type each compared element four times during future modifications and to avoid searching for its neighbour halfway down the screen.
In the actual routine, I have a structure with about 10 elements and the function looks like the roof of a house tilted on its side...
Currently I have:
int record_sort(const void *a, const void *b) { _record *c1 = (_record *)a; _record *c2 = (_record *)b; if (c1->a==c2->a) if (c1->b==c2->b) if (c1->c==c2->c) if (c1->d==c2->d) if (c1->e==c2->e) if (strncmp(c1->f, c2->f, SIZE_RULE)==0) if (c1->g==c2->g) if (c1->h==c2->h) if (c1->i==c2->i) return(1); else return(c1->i>c2->i); else return(c1->h>c2->h); else return(c1->g>c2->g); else return(strncmp(c1->f, c2->f, SIZE_RULE)); else return(c1->e>c2->e); else return(c1->d>c2->d); else return(c1->c>c2->c); else return(c1->b>c2->b); else return(c1->a>c2->a); }
... and by calculate i mean: store the hash in an array, and compare the hash values
posted by Psychnic at 1:07 PM on September 28, 2006
posted by Psychnic at 1:07 PM on September 28, 2006
Are the elements in the structure all simple datatypes (int, double, etc) or are there pointers there as well? Because you can do a memcmp() if you aren't storing stuff in pointers.
Also, I don't see how storing a hash would help.
posted by chunking express at 1:36 PM on September 28, 2006
Also, I don't see how storing a hash would help.
posted by chunking express at 1:36 PM on September 28, 2006
Actually, memcmp probably makes no sense either, for the same reason a hash makes no sense.
posted by chunking express at 1:36 PM on September 28, 2006
posted by chunking express at 1:36 PM on September 28, 2006
Memcmp makes sense only in that it makes the case of identical structures really fast. If that is a very common case, it is a clear win. Basically, if the memory is identical, return 0, *then* do piecewise comparison.
If this is not the case, then yes, your way is pretty much it. If you want to insulate your code more, you might want to write specific record element delta functions.
You can also structure you code like this:
int delta = 0;
if (!(delta = Rec_DeltaA(c1, c2))
return delta;
if (!(delta = RecDeltaB(c1, c2))
return delta;
if (!(delta = RecDeltaC(c1, c2))
return delta;
posted by plinth at 1:57 PM on September 28, 2006
If this is not the case, then yes, your way is pretty much it. If you want to insulate your code more, you might want to write specific record element delta functions.
You can also structure you code like this:
int delta = 0;
if (!(delta = Rec_DeltaA(c1, c2))
return delta;
if (!(delta = RecDeltaB(c1, c2))
return delta;
if (!(delta = RecDeltaC(c1, c2))
return delta;
posted by plinth at 1:57 PM on September 28, 2006
Would something like this work: http://pastie.caboo.se/15363
posted by chunking express at 2:00 PM on September 28, 2006
posted by chunking express at 2:00 PM on September 28, 2006
You can get rid of the nesting and make it look a bit prettier:
You can even write a set of macros to make this pattern systematic: (verify your compiler does recursive
If you want to do something a bit more extensible/maintainable, you'd have to do some template magic, do a recursive function, or have a table with function pointers, etc.
posted by clord at 2:09 PM on September 28, 2006
int c; return (c1->a != c2->a ? c1->a > c2->a : (c1->b != c2->b ? c1->b > c2->b : (c1->c != c2->c ? c1->c > c2->c : (c1->d != c2->d ? c1->d > c2->d : (c1->e != c2->e ? c1->e > c2->e : (0==c=strncmp(c1->f, c2->f, SIZE_RULE) ? c : (c1->g != c2->g ? c1->g > c2->g : (c1->h != c2->h ? c1->h > c2->h : (c1->i != c2->i ? c1->i > c2->i : 1 )))))))));Which is just a more dense version of your cod, but with appropriate comments would be acceptable style in my opinion.
You can even write a set of macros to make this pattern systematic: (verify your compiler does recursive
define
expansion.)
#define cmpit(a,b, nxt) \ a!=b ? a > b : (nxt) #define cmpit_str(a,b, nxt) \ 0==c=strncmp((a), (b), SIZE_RULE) ? c : (nxt) ... int c; return cmpit(c1->a, c2->a, cmpit(c1->b, c2->b, cmpit(c1->c, c2->c, ... cmpit_str(c1->f, c2->f, cmpit(c1->g, c2->g, cmpit(c1->h, c2->h, cmpit(c1->i, c2->i, 1)))))))));You can do the same thing with real functions, too. (Defines are sometimes difficult to work with, especially when you are working with others.)
If you want to do something a bit more extensible/maintainable, you'd have to do some template magic, do a recursive function, or have a table with function pointers, etc.
posted by clord at 2:09 PM on September 28, 2006
The first define should be
posted by clord at 2:12 PM on September 28, 2006
#define cmpit(a,b, nxt) \ (a) != (b) ? (a) > (b) : (nxt)
posted by clord at 2:12 PM on September 28, 2006
Aside: comparison functions are supposed to return -1, 0, or 1, for a < b, a = b, and a > b, respectively. Your function only distinguishes between "equal" and "not equal" which is not enough information to get you a sorted list.
posted by Khalad at 3:45 PM on September 28, 2006
posted by Khalad at 3:45 PM on September 28, 2006
I don't think C has a 'sequential OR' but that is the construct you want. Let '|' mean a sequential 'or' - 'a|b' evaluates 'a' and returns it if not false, (==0) otherwise it evaluates b and returns that. 'b' is not evaluated unless 'a' doesn't pan out.
Your function would then just be 'return test1 | test2 | .... | testn'
I'm not sure how to do this as a macro, or if it is possible
posted by MonkeySaltedNuts at 4:01 PM on September 28, 2006
Your function would then just be 'return test1 | test2 | .... | testn'
I'm not sure how to do this as a macro, or if it is possible
posted by MonkeySaltedNuts at 4:01 PM on September 28, 2006
Abuse macros.
#define CMP(a,b) ((b) < (a) ? 1 : -1)br> #define TEST(f) if(c1->f != c2->f) return CMP(c1->f, c2->f)
TEST(a);
TEST(b);
TEST(c);
return 0;>
posted by jepler at 4:55 PM on September 28, 2006
#define CMP(a,b) ((b) < (a) ? 1 : -1)br> #define TEST(f) if(c1->f != c2->f) return CMP(c1->f, c2->f)
TEST(a);
TEST(b);
TEST(c);
return 0;>
posted by jepler at 4:55 PM on September 28, 2006
well that'd didn't quite work. perhaps you can puzzle it out.
posted by jepler at 4:56 PM on September 28, 2006
posted by jepler at 4:56 PM on September 28, 2006
jepler has it. You can use return statements to get rid of the nesting, and macros to remove the redundancy.
Reformatting jepler's suggested code:
posted by russilwvong at 5:17 PM on September 28, 2006
Reformatting jepler's suggested code:
#define CMP(a, b) ((a) < (b) ? -1 : 1)#define TEST(x) if (c1->x != c2->x) return CMP(c1->x, c2->x)TEST(a);TEST(b);TEST(c);TEST(d);TEST(e);if (strncmp(c1->f, c2->f, SIZE_RULE) != 0) return strncmp(c1->f, c2->f, SIZE_RULE);TEST(g);TEST(h);return 0;
posted by russilwvong at 5:17 PM on September 28, 2006
For members that are integral (including char) , use operator minus.
For non-integral numerics, use operator less than twice, and the ternary operator (see below).
For members that are bool, use operator minus, replying on the automatic conversion to int.
For the members that are (const) char*, use strcmp.
For members that are structures or pointers to struct, use the appropriate qsort comparator function for that structure type; of it's a pointer, check for the null pointer; if it's not a pointer, remember to take the address when passing to its comparator.
For arrays, walk the array and compare elements pair-wise.
By using only one operator minus for integral types, we can save cycles.
(In most assembky languages, == and < and> are all translated to cmp, which does the same thing as and takes the same cuycles as a sub instruction. Doing it this way takes one sub instead of two cmps instead of two.)
Recall that C (like C++ and Java, but not quite like Javascript) uses short-circuit evaluation. Only necessary expressions in a logical expression are evaluated. Since "false && anything" is always false, and "true || anything" must always be true, in those expressions, "anything" is not evaluated.
Recall also that the operands of a logical expression are bool or converted to bool, and that zero converts to false and all other values convert to true. Because of this, we need to keep the unconverted int value in the variable "rrr" while letting thwe assignment expression's value convert to a bool.
struct ss { const char* c; double d; int i ; bool b ;} ;
// ss is ordered by I then d then c then b
int sscompare (const void* lhs, const void* rhs ) {
//assumes neiterh lhs nor rhs is null
const struct ss* a = (struct ss*) lhs ;
const struct ss* b = (struct ss*) rhs ;
int rrr ;
/* rrr is the result we return,
it's zero if the structs have the equal members, negative if struct less is ordered first, positive if rhs is ordered first */
( rrr = a->i - b->i)
|| ( rrr = a->d < a->b ? -1 : ( b->b < a->b ) )
|| ( rrr = strcmp( a->c, b->c ) )
|| ( rrr = a->b - b->b ) ;
return rrr ;
}
What's happening here?
First operator minus compute the difference between a->i and b->i, and that value is stored into variable rrr. The value of an assignment operator expression is the value assigned, so the value of the sub-expression:
( rrr = a->i- b->i)
is rrr; that sub-expression is the operand of operator ||, so the expression's value is converted to bool.
If rrr is not zero, then a->i and b->i are not the same, and the value of the sub-expression is bool true. Since the values of the first operand of operator || is true, the rest of the expression is short-circuited, and control passes to the return statement.
The return statement returns rrr, which is negative if a->i is less than b->i, and positive if a->i is greater than b->i. Since the i member is (arbitrarily) the major comparison for struct ss, we're done.
If a->i does equal b->I, then rrr is zero. The first operator ||'s first operand converts to false, and the logical expression is not short-circuited. This requires the evaluation of the second sub-expression:
( rrr = a->d < a->b ? -1 : ( b->b < a->b ) )
Since member d is a double, we us operator less than to prevent truncation that would occur if we cast operator minus's result to int. We make the slightly faster branch occur when the things compared are already in the correct order.
In this expression, rrr becomes zero if !( a->d < b->d && b->d < a->d ), -1 if a->d < d->b, and 1 if b->d > a->b. Again, equality == 0 == false, which means the logical expression does not short-circuit. Inequality == -1 or ,1 which converts to true, which short-circuits the logical expression. We use only operator less than, not operator equal, because that's sufficient for a strict wealk orderiing and does not require operator equality to be defined. (Of course it always is for primitve types; but this allows easier transition to User defined Types in C++.)>>>>>>>>
posted by orthogonality at 5:53 PM on September 28, 2006
For non-integral numerics, use operator less than twice, and the ternary operator (see below).
For members that are bool, use operator minus, replying on the automatic conversion to int.
For the members that are (const) char*, use strcmp.
For members that are structures or pointers to struct, use the appropriate qsort comparator function for that structure type; of it's a pointer, check for the null pointer; if it's not a pointer, remember to take the address when passing to its comparator.
For arrays, walk the array and compare elements pair-wise.
By using only one operator minus for integral types, we can save cycles.
(In most assembky languages, == and < and> are all translated to cmp, which does the same thing as and takes the same cuycles as a sub instruction. Doing it this way takes one sub instead of two cmps instead of two.)
Recall that C (like C++ and Java, but not quite like Javascript) uses short-circuit evaluation. Only necessary expressions in a logical expression are evaluated. Since "false && anything" is always false, and "true || anything" must always be true, in those expressions, "anything" is not evaluated.
Recall also that the operands of a logical expression are bool or converted to bool, and that zero converts to false and all other values convert to true. Because of this, we need to keep the unconverted int value in the variable "rrr" while letting thwe assignment expression's value convert to a bool.
struct ss { const char* c; double d; int i ; bool b ;} ;
// ss is ordered by I then d then c then b
int sscompare (const void* lhs, const void* rhs ) {
//assumes neiterh lhs nor rhs is null
const struct ss* a = (struct ss*) lhs ;
const struct ss* b = (struct ss*) rhs ;
int rrr ;
/* rrr is the result we return,
it's zero if the structs have the equal members, negative if struct less is ordered first, positive if rhs is ordered first */
( rrr = a->i - b->i)
|| ( rrr = a->d < a->b ? -1 : ( b->b < a->b ) )
|| ( rrr = strcmp( a->c, b->c ) )
|| ( rrr = a->b - b->b ) ;
return rrr ;
}
What's happening here?
First operator minus compute the difference between a->i and b->i, and that value is stored into variable rrr. The value of an assignment operator expression is the value assigned, so the value of the sub-expression:
( rrr = a->i- b->i)
is rrr; that sub-expression is the operand of operator ||, so the expression's value is converted to bool.
If rrr is not zero, then a->i and b->i are not the same, and the value of the sub-expression is bool true. Since the values of the first operand of operator || is true, the rest of the expression is short-circuited, and control passes to the return statement.
The return statement returns rrr, which is negative if a->i is less than b->i, and positive if a->i is greater than b->i. Since the i member is (arbitrarily) the major comparison for struct ss, we're done.
If a->i does equal b->I, then rrr is zero. The first operator ||'s first operand converts to false, and the logical expression is not short-circuited. This requires the evaluation of the second sub-expression:
( rrr = a->d < a->b ? -1 : ( b->b < a->b ) )
Since member d is a double, we us operator less than to prevent truncation that would occur if we cast operator minus's result to int. We make the slightly faster branch occur when the things compared are already in the correct order.
In this expression, rrr becomes zero if !( a->d < b->d && b->d < a->d ), -1 if a->d < d->b, and 1 if b->d > a->b. Again, equality == 0 == false, which means the logical expression does not short-circuit. Inequality == -1 or ,1 which converts to true, which short-circuits the logical expression. We use only operator less than, not operator equal, because that's sufficient for a strict wealk orderiing and does not require operator equality to be defined. (Of course it always is for primitve types; but this allows easier transition to User defined Types in C++.)>>>>>>>>
posted by orthogonality at 5:53 PM on September 28, 2006
Macros, schmacros. If you want to write clear, maintainable code, don't ever do anything clever with macros.
The first thing wrong with the code as you've presented it is that it's fundamentally broken. You return 1 if all fields are equal, and either 0 or 1 (the result of something like
What your comparison function needs to return, according to the qsort manpage, is
The second thing wrong with it is that it has a sucky name. The function doesn't record sorts, it compares records; so it should be called compare_records(), not record_sort().
I'd code your comparison function like this:
If you don't have any float or double fields, leave out s.
posted by flabdablet at 5:56 PM on September 28, 2006
The first thing wrong with the code as you've presented it is that it's fundamentally broken. You return 1 if all fields are equal, and either 0 or 1 (the result of something like
return(c1->e>c2->e)
) if they're not.What your comparison function needs to return, according to the qsort manpage, is
an integer less than, equal to, or greater than zero if the first argument is considered to be respectively less than, equal to, or greater than the second.Note that this does not require that the returned value be exactly -1, 0 or 1, though these are naturally acceptable. For integer comparisons, just return the result of a subtraction.
The second thing wrong with it is that it has a sucky name. The function doesn't record sorts, it compares records; so it should be called compare_records(), not record_sort().
I'd code your comparison function like this:
int compare_records (const void *a, const void *b)
{
_record *ra = (_record *) a;
_record *rb = (_record *) b;
int d;
double s;
d = ra->a - rb->a; if (d) return d;
d = ra->b - rb->b; if (d) return d;
d = ra->c - rb->c; if (d) return d;
s = ra->d - rb->d; d = (s>0) - (s<0); if (d) return d;
d = ra->e - rb->e; if (d) return d;
d = strncmp(ra->f, rb->f, SIZE_RULE); if (d) return d;
d = ra->g - rb->g; if (d) return d;
d = ra->h - rb->h; if (d) return d;
d = ra->i - rb->i; if (d) return d;
return 0;
}
If you don't have any float or double fields, leave out s.
posted by flabdablet at 5:56 PM on September 28, 2006
flabdablet writes "The first thing wrong with the code as you've presented it is that it's fundamentally broken. You return 1 if all fields are equal, and either 0 or 1 (the result of something like
Er, no!
NO NO NO!
qsort, according to the ANSI Standard C99 (and in C98, which C++98 incorporates by reference), returns 0 for equality, <0 if teh first parameter us orderd before than the second, and>0 if the second parameter comes first. It's in fact modeled on the result of first minus second, as are the results of the strcmp family of functions0>
posted by orthogonality at 6:09 PM on September 28, 2006
return(c1->e>c2->e)
) if they're not."Er, no!
NO NO NO!
qsort, according to the ANSI Standard C99 (and in C98, which C++98 incorporates by reference), returns 0 for equality, <0 if teh first parameter us orderd before than the second, and>0 if the second parameter comes first. It's in fact modeled on the result of first minus second, as are the results of the strcmp family of functions0>
posted by orthogonality at 6:09 PM on September 28, 2006
Oh, you were being descriptive of teh OP, not proscriptive. Sorry!
posted by orthogonality at 6:09 PM on September 28, 2006
posted by orthogonality at 6:09 PM on September 28, 2006
Best answer: It took me many, many years to realize that clear code is better than clever code. I think it's the same kind of realization that drivers have after a few crashes or near-misses: the road is not a racetrack, and source code is the wrong place to do peephole optimization.
Not that I'm opposed to macro abuse in general. I remember being very proud of one particularly egregious bunch that let me generate a table for Morse Code error display on a front panel LED using something like
But macros that hide control flow are just Wrong.
I like orthogonality's useful rules for doing correct comparisons on appropriate types, but I dislike the || structure for ordering the field comparisons.
It would work nicely in Perl, where the || operator yields the actual value of the first nonzero term instead of a Boolean result based on it. In C, the use of || strikes me as a little bit forced.
Using assignment = operators in a context where you'd normally find comparison == operators will make most readers doubletake before they realize the code is actually correct. The trailing parentheses necessitated by the low precedence of = are also a nuisance because they need counting.
My own code runs roughshod over every style guideline I've ever seen, but it reads like robot poetry. That makes it easy for anybody even slightly familiar with C to understand quickly that it returns the first non-equal comparison result.
It's also clear that the comparison lines are just cut-and-paste with variations, so it's easy to add more and be confident they'll work without thinking about it too much; which is also why I used the explicit "return 0" at the end instead of just making the last return non-conditional. The compiler will optimize that out anyway, if it's any good.
Hmm. I've just realized that both orthogonality and I have missed a failure mode. If the two values being compared are far enough apart, subtracting them causes a signed overflow that will bollix up the sort results. For example, 2000000000 - -2000000000 should yield 4000000000, a positive number showing that two billion is greater than minus two billion. But if int is a 32-bit type on your machine, the greatest positive int is 2147483647 and the subtraction will overflow to -294967296 which clearly has the wrong sign.
If you have fields with value limits that might trigger this fault, it probably is time to reach for the macro preprocessor:
posted by flabdablet at 9:53 PM on September 28, 2006
Not that I'm opposed to macro abuse in general. I remember being very proud of one particularly egregious bunch that let me generate a table for Morse Code error display on a front panel LED using something like
short morse_lookup[] = {
MORSE DI DAH _ _ _ _ _ _ 'a',
MORSE DA DI DI DIT _ _ _ _ 'b',
MORSE DA DI DA DIT _ _ _ _ 'c',
MORSE DA DI DIT _ _ _ _ _ 'd',
MORSE DIT _ _ _ _ _ _ _ 'e',
MORSE DI DI DA DIT _ _ _ _ 'f',
...
MORSE DA DA DA DA DAH _ _ _ '0',
MORSE DI DA DI DA DI DAH _ _ '.',
MORSE DA DA DI DI DA DAH _ _ ',',
MORSE DI DI DA DA DI DIT _ _ '?',
}
But macros that hide control flow are just Wrong.
I like orthogonality's useful rules for doing correct comparisons on appropriate types, but I dislike the || structure for ordering the field comparisons.
It would work nicely in Perl, where the || operator yields the actual value of the first nonzero term instead of a Boolean result based on it. In C, the use of || strikes me as a little bit forced.
Using assignment = operators in a context where you'd normally find comparison == operators will make most readers doubletake before they realize the code is actually correct. The trailing parentheses necessitated by the low precedence of = are also a nuisance because they need counting.
My own code runs roughshod over every style guideline I've ever seen, but it reads like robot poetry. That makes it easy for anybody even slightly familiar with C to understand quickly that it returns the first non-equal comparison result.
It's also clear that the comparison lines are just cut-and-paste with variations, so it's easy to add more and be confident they'll work without thinking about it too much; which is also why I used the explicit "return 0" at the end instead of just making the last return non-conditional. The compiler will optimize that out anyway, if it's any good.
Hmm. I've just realized that both orthogonality and I have missed a failure mode. If the two values being compared are far enough apart, subtracting them causes a signed overflow that will bollix up the sort results. For example, 2000000000 - -2000000000 should yield 4000000000, a positive number showing that two billion is greater than minus two billion. But if int is a 32-bit type on your machine, the greatest positive int is 2147483647 and the subtraction will overflow to -294967296 which clearly has the wrong sign.
If you have fields with value limits that might trigger this fault, it probably is time to reach for the macro preprocessor:
#define numcmp(x, y) ( (x) < (y) ? -1 : (x) > (y) )
int foocmp(const void *pa, const void *pb)
{
foo *a = (foo *) pa;
foo *b = (foo *) pb;
int d;
d = numcmp(a->a, b->a); if (d) return d;
d = numcmp(a->b, b->b); if (d) return d;
d = numcmp(a->c, b->c); if (d) return d;
d = numcmp(a->d, b->d); if (d) return d;
d = numcmp(a->e, b->e); if (d) return d;
d = strncmp(a->f, b->f, SIZE_RULE); if (d) return d;
d = numcmp(a->g, b->g); if (d) return d;
d = numcmp(a->h, b->h); if (d) return d;
d = numcmp(a->i, b->i); if (d) return d;
return 0;
}
posted by flabdablet at 9:53 PM on September 28, 2006
This thread is closed to new comments.
posted by Psychnic at 1:06 PM on September 28, 2006