Comments on: The missing de meets the Big O
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O/
Comments on Ask MetaFilter post The missing de meets the Big OTue, 27 Mar 2007 14:23:17 -0800Tue, 27 Mar 2007 14:23:17 -0800en-ushttp://blogs.law.harvard.edu/tech/rss60Question: The missing de meets the Big O
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O
Given two unsorted lists, size N and M (N < M), how long [ O (???) ] and how much storage [ O (???) ] would be required to find elements in common between N and M (the number of common elements could range from 0 to N)? How would you go about doing this? <br /><br /> Is there anything that could be done to improve this time by restructuring the data (i.e. sorted list or something else entirely) and what benefit would this have on the time and storage needed?<br>
<br>
Note: This question was inspired by the MeTa meetups sidebar, which says there is an April 12 meetup in "Berlin, DE" which to me could be interpreted as either Delaware or Germany. Fortunately, there is no Berlin in Delaware, but I was curious about how to go about finding such collisions if we had full lists of all towns/cities in 2 places and determining where colliding meetups might be possible.post:ask.metafilter.com,2007:site.59468Tue, 27 Mar 2007 13:59:17 -0800langeNUbigOlistcomparisoncompsciBy: mhum
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894424
The usual way to do this is to sort both lists and run pointers down each list, looking for matches along the way.<br>
<br>
In more detail:<br>
Let A and B be the two sorted lists. For convenience, let them be represented as arrays A[1..N] and B[1..M]. <br>
<pre><br>
1. Let p = 1 and q = 1<br>
2. If A[p] < B[q] then <br>
p = p+1<br>
Else if A[p] > B[q] then <br>
q = q+1<br>
Else if A[p] == B[q] then <br>
Output A[p]<br>
p = p+1<br>
q = q+1<br>
3. If p>N or q>M then<br>
Exit<br>
Else<br>
Goto 2.<br>
</pre>The time and space of this method is dominated by the sorting step.comment:ask.metafilter.com,2007:site.59468-894424Tue, 27 Mar 2007 14:23:17 -0800mhumBy: blenderfish
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894426
Three approaches immediately pop into mind:<br>
1. Sort both lists, then traverse both lists at the same time to find common elements. O(m+n) temporary storage, O( m log m + n log n ) computation. (The O(m+n) final traversal is hidden by the sort)<br>
<br>
2. Slightly better, at least asymptotically, would be to sort _one_ list (the longer one), then do a binary search on each element in the smaller list to find out if it is in the bigger one. O(m) storage, O( (m + n) log m ) complexity ( m log m to sort the m list, plus n * log m for doing n binary searches of the sorted m list)<br>
<br>
3. No restructuring, and do a O( m * n ) search. This sucks, but requires no (O(0), I guess?) temporary storage.comment:ask.metafilter.com,2007:site.59468-894426Tue, 27 Mar 2007 14:24:07 -0800blenderfishBy: mhum
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894428
PS: The UNIX utility "comm" pretty much does this, though it requires sorted lists as inputs.comment:ask.metafilter.com,2007:site.59468-894428Tue, 27 Mar 2007 14:24:19 -0800mhumBy: smackfu
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894430
The number of state vs. cty code overlaps is small, so if you had those shorter lists, you could find the overlap and use that to only check some of the big list.comment:ask.metafilter.com,2007:site.59468-894430Tue, 27 Mar 2007 14:25:43 -0800smackfuBy: blenderfish
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894433
Hmm. thinking about it a little more, if n is smaller than m, O( m log m + n log n) would be smaller than O( (m+n) log m ). So, #2 is slower (but still requires less storage.) <br>
<br>
So, yeah, #1 (Which is what mhum details) is probably the fastest obvious algorithm.comment:ask.metafilter.com,2007:site.59468-894433Tue, 27 Mar 2007 14:27:18 -0800blenderfishBy: unSane
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894439
The storage and the time taken depend hugely on various factors.<br>
<br>
-- how big are the lists?<br>
-- how big are the list items?<br>
-- how long does it take to compare two list items?<br>
-- are there simply ways of computing hashes of the list items?<br>
-- are list items likely to repeat or be unique?<br>
<br>
For small lists of simple items such as integers, the naive algorithm is trivially simple:<br>
<br>
For each item in N<br>
-- For each item in M<br>
---- compare the item from N with the item from M<br>
---- if they match, add them to the list of common items<br>
<br>
In this case the storage required is approximately 2N + M times the number of bytes required for each item (lists N and M, plus up to N common items).<br>
<br>
The time taken is proportional to N times M, since you make N times M comparisons in the nested loop.<br>
<br>
For bigger lists or more expensive comparisons, the algorithm would need to be optimised.comment:ask.metafilter.com,2007:site.59468-894439Tue, 27 Mar 2007 14:29:57 -0800unSaneBy: true
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894441
For something that would take what I believe to be minimal O(m+n) time but slightly more storage you could traverse one list and turn it into a hashmap/set, then traverse the other list and detect if elements were already in that map. Since hash insertion is O(1) it's linear time (need a good hash function and equality test etc). For size it depends on your load factor but it's not too bad for most reasonable values.comment:ask.metafilter.com,2007:site.59468-894441Tue, 27 Mar 2007 14:31:29 -0800trueBy: jepler
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894444
This is the "set intersection" algorithm. It requires N tests for membership in a set of size M. So if your membership test is O(f(x)) time, the intersection test is O(N*f(x)). Likely f(x) are f(x) = log(x) or f(x)=1.<br>
<br>
Since you test for membership in the set M, you need to keep around O(M) information. You can take the items from N and produce the outputs one at a time if you like, so those both contribute O(1) giving a total of O(M+1+1) = O(M). If you prefer to keep M, N and result, then you get O(M+2N).<br>
<br>
Here's the guts of a Python set intersection algorithm:<br>
if len(self) < len(other):br>
little, big = self, other<br>
else:<br>
little, big = other, self<br>
common = ifilter(big._data.has_key, little)<br>
which works pretty much as I describe. <b>ifilter</b> means 'take the items from 'little' in turn, and if 'little' is a member of 'big' (<b>has_key</b>), put it on the result. In Python the 'has_key' test is O(1).<br>
<br>
I hope I didn't just answer a homework question.</>comment:ask.metafilter.com,2007:site.59468-894444Tue, 27 Mar 2007 14:32:12 -0800jeplerBy: blenderfish
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894449
Okay. Dammit; last post, I promise.<br>
If you sort only the _smaller_ of the two lists (Call it m), then<br>
iterate through the larger list, using a binary search on each element<br>
in that list against the smaller list, your computation O<br>
is O ( m log m + n log m ), which is smaller than the O( m log m + n log n ) of the 'sort both lists' approach (again, assuming m < n.) also, it requires o( m ) storage, rather than o( n + m ).br>
<br>
If you use a hash, (which makes things a little more challenging, analysis-wise, since worst case is much different than average case,) then, as 'true' points out, you could add each element in the shorter list to a hash table, and iterate through the other list, for _average case_ O( n + m ).<br>
<br>
Okay. Back to work!</>comment:ask.metafilter.com,2007:site.59468-894449Tue, 27 Mar 2007 14:36:31 -0800blenderfishBy: langeNU
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894450
Jepler - I added the note on the bottom detailing why I was asking so hopefully everyone would recognize that it's not a homework problem - it's pure curiosity on my part.comment:ask.metafilter.com,2007:site.59468-894450Tue, 27 Mar 2007 14:37:33 -0800langeNUBy: nomisxid
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894497
Insert into DB, creating tables N and M. <br>
<br>
Select keyfield<br>
from N<br>
union<br>
select keyfield<br>
from Mcomment:ask.metafilter.com,2007:site.59468-894497Tue, 27 Mar 2007 15:34:33 -0800nomisxidBy: plinth
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894526
Allocate two bit vectors the size of the shorter each element in each list, set the according bit in each list, skipping anything that's not in the shorter list. Perform a logical and of every byte in each list.<br>
<br>
Storage is 2 log<sub>2</sub>(n)<br>
Time is O(m + 2n)<br>
<br>
This assumes that all elements are unique.comment:ask.metafilter.com,2007:site.59468-894526Tue, 27 Mar 2007 16:00:18 -0800plinthBy: aberrant
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894576
There's a possibly cool way to do this if your universe is finite: that is, assign each element in the universe a power of two, and calculate n = {N} (e.g., the sum of the values of all the elements in list with size N) and m = {M}. M - N will uniquely specify the difference. This should take O(1) time and O(1) storage.comment:ask.metafilter.com,2007:site.59468-894576Tue, 27 Mar 2007 17:02:15 -0800aberrantBy: aberrant
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894586
<small>to be more precise, it will take O(n) storage in order to store the values for all the elements in the universe.</small>comment:ask.metafilter.com,2007:site.59468-894586Tue, 27 Mar 2007 17:09:33 -0800aberrantBy: aberrant
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894591
<small>and apologies for the multiple posts, but if you're assuming that the smaller list is a subset of the larger, then it becomes trivial to show that this can be done in O(1) time with O(n) storage using the above method, since the size of the universe is by definition finite, being bounded by and contained within the larger list.</small>comment:ask.metafilter.com,2007:site.59468-894591Tue, 27 Mar 2007 17:12:16 -0800aberrantBy: blenderfish
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894605
<i>This should take O(1) time and O(1) storage.</i><br>
<br>
Assuming that an infinite-precision math is constant time is cheating. The time that addition takes is proportional to the number of possible values an element of either m or n may have. I guess you can _call_ that value constant, but that's a bit misleading. Hypothesizing that the two sets contained unsigned 4-byte integers, these tables would each be billions of bits long, and take billions of bit operations to merge!<br>
<br>
Your strategy essentially amounts to building two perfect hashtables (or sparse sorted lists, if you prefer) and performing a set operation on them. (Not a bad approach, depending on the context of the problem.)comment:ask.metafilter.com,2007:site.59468-894605Tue, 27 Mar 2007 17:25:07 -0800blenderfishBy: aberrant
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894624
Agreed that the feasibility of the approach decreases in relation to the expansion of U, but for moderately-sized finite universes, it's a viable option. It certainly would get unwieldy as n(U) approaches 2^32, as you suggest - but for smaller set sizes it can be used to elegant effect, especially if multiple compares are necessary, as you've pregenerated your hash tables.comment:ask.metafilter.com,2007:site.59468-894624Tue, 27 Mar 2007 17:41:48 -0800aberrantBy: smackfu
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894657
From a practical perspective, I usually do matching using hashtables now. Something like:<br>
<br>
1) loop thru list A, adding each item to the hastable<br>
2) loop thru list B, checking the hashtable for each item<br>
<br>
It gives easy to read code without fussing about with multiple indexes. And you only need one list (or the hashed equivalent) in memory; the other can come from a file or a database cursor.<br>
<br>
The dual sort sliding thingie is clever, but I've seen that kind of thing go wrong too often. Namely when people assume that lists from different sources use the same sort order (stupid ORDER BY).comment:ask.metafilter.com,2007:site.59468-894657Tue, 27 Mar 2007 18:10:21 -0800smackfuBy: Pronoiac
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894665
The hash test seems like a winner to me: it only requires one pass over the data, & no sorting.<br>
<br>
If you were doing this by hand with index cards, then even the method you'd use to sort one list is up for debate.comment:ask.metafilter.com,2007:site.59468-894665Tue, 27 Mar 2007 18:21:07 -0800PronoiacBy: equalpants
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894693
The <a href="http://en.wikipedia.org/wiki/Aho-Corasick_algorithm">Aho-Corasick</a> algorithm is extremely cool and would work nicely for the specific meetup problem (since we're comparing strings). Time is O(N' + M' + Z), where N' is the total length of the N strings in the first set, M' is the same for M, and Z is the number of matches found. Not sure about space, but it'll be some function of N'.<br>
<br>
Although N' and M' are larger than N and M, this'll still take about the same amount of time as the hashing approach, because Aho-Corasick just builds and then walks along a state machine; no hash computations needed.<br>
<br>
Of course, it's way more complicated than just using a hash table, so it'd be completely silly to actually use it for this problem. It's very very cool, though...comment:ask.metafilter.com,2007:site.59468-894693Tue, 27 Mar 2007 18:39:24 -0800equalpantsBy: blenderfish
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894729
<i>The Aho-Corasick algorithm is extremely cool and would work nicely for the specific meetup problem</i><br>
<br>
That's a cool algorithm, and provides a slick way to look for any of N sequences of characters in another, non-delimited sequence of characters, but since his lists are delimited, it would be of little use. (It would degenerate into just a search tree, since the first character of all the members of the dictionary would be the delimiter.)comment:ask.metafilter.com,2007:site.59468-894729Tue, 27 Mar 2007 19:16:08 -0800blenderfishBy: of strange foe
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894792
Your problem is pretty much an <a href="http://en.wikipedia.org/wiki/Join_%28SQL%29">equal join</a> in relational database terms, and this is a much visited area. For databases, the fun thing is that both lists (tables) are usually too large to be held in memory. <br>
<br>
The classic solution is <a href="http://en.wikipedia.org/wiki/Sort-Merge_Join">Merge-sort</a>, after <a href="http://www.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch13_ExtSort.pdf">external sort</a> has been performed on the two lists. An alternative is hash-based join, which <a href="http://www.cs.wisc.edu/~dewitt/includes/paralleldb/vldb85.pdf">scales well if done in parallel</a>.comment:ask.metafilter.com,2007:site.59468-894792Tue, 27 Mar 2007 20:42:36 -0800of strange foeBy: MonkeySaltedNuts
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894897
Say your two lists are represented as lists of atomic objects (i.e. not lists of strings - if "Berlin" appears in both lists it will be represented as a pointer to the "Berlin" object).<br>
<br>
Then finding the intersection is O(N + M)<br>
Assume <code>FlagNo</code> has an integer value.<br>
<br>
<code>FlagNo = FlagNo+1;<br> for (x in list1) x.flag = FlagNo;<br> for (y in list2) if (y.flag == FlagNo) collect y;</code>comment:ask.metafilter.com,2007:site.59468-894897Tue, 27 Mar 2007 23:07:56 -0800MonkeySaltedNutsBy: Pronoiac
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#894925
... that's just nuts.comment:ask.metafilter.com,2007:site.59468-894925Wed, 28 Mar 2007 00:17:06 -0800PronoiacBy: unSane
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#895271
MonkeySaltedNuts' solution only works if your test for equality is 'it is the same atomic object', which is a very special case.comment:ask.metafilter.com,2007:site.59468-895271Wed, 28 Mar 2007 09:20:59 -0800unSaneBy: MonkeySaltedNuts
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#895452
<b>unSane</b>: <i>"...only works if your test for equality is 'it is the same atomic object',..."</i>.<br>
<br>
WRONG. The equality test in this example is comparing integer values and a similar example could be made that compares boolean values.<br>
<br>
The example works because the strings have been mapped to canonical objects (N strings can be "canonicalized" in O(NlogN)).<br>
<br>
With canonical objects all set-theoretical operations become linear. If you are doing lots of such operations the cost of canonicalization is a set-up cost not an operational cost.<br>
<br>
Even if you want to evaluate 1 intersection, the canonical approach is theoretically optimum (in terms of O), and involves no explicit sorting (of course the "canonicalizer" might involve implicit sorting for really big lists).comment:ask.metafilter.com,2007:site.59468-895452Wed, 28 Mar 2007 11:51:21 -0800MonkeySaltedNutsBy: unSane
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#895499
Well, you didn't actually say the list had been canonicalized.<br>
<br>
If the list has been 'canonicalized' then the equality test is actually 'canonicalizes to the same atomic object'.comment:ask.metafilter.com,2007:site.59468-895499Wed, 28 Mar 2007 12:38:27 -0800unSaneBy: MonkeySaltedNuts
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#895675
<b>unSane</b>,<br>
Your words still indicate you have little grasp on what you are saying.comment:ask.metafilter.com,2007:site.59468-895675Wed, 28 Mar 2007 15:41:22 -0800MonkeySaltedNutsBy: unSane
http://ask.metafilter.com/59468/The-missing-de-meets-the-Big-O#895688
The fact that you think I'm wrong shows that you haven't thought about it hard enough.comment:ask.metafilter.com,2007:site.59468-895688Wed, 28 Mar 2007 15:49:33 -0800unSane