The missing de meets the Big O
March 27, 2007 1:59 PM   Subscribe

Given two unsorted lists, size N and M (N < M), how long [ O (???) ] and how much storage [ O (???) ] would be required to find elements in common between N and M (the number of common elements could range from 0 to N)? How would you go about doing this?

Is there anything that could be done to improve this time by restructuring the data (i.e. sorted list or something else entirely) and what benefit would this have on the time and storage needed?

Note: This question was inspired by the MeTa meetups sidebar, which says there is an April 12 meetup in "Berlin, DE" which to me could be interpreted as either Delaware or Germany. Fortunately, there is no Berlin in Delaware, but I was curious about how to go about finding such collisions if we had full lists of all towns/cities in 2 places and determining where colliding meetups might be possible.
posted by langeNU to Computers & Internet (29 answers total) 1 user marked this as a favorite
The usual way to do this is to sort both lists and run pointers down each list, looking for matches along the way.

In more detail:
Let A and B be the two sorted lists. For convenience, let them be represented as arrays A[1..N] and B[1..M].
1. Let p = 1 and q = 1
2. If A[p] < B[q] then 
     p = p+1
   Else if A[p] > B[q] then 
     q = q+1
   Else if A[p] == B[q] then 
     Output A[p]
     p = p+1
     q = q+1
3. If p>N or q>M then
     Goto 2.
The time and space of this method is dominated by the sorting step.
posted by mhum at 2:23 PM on March 27, 2007

Three approaches immediately pop into mind:
1. Sort both lists, then traverse both lists at the same time to find common elements. O(m+n) temporary storage, O( m log m + n log n ) computation. (The O(m+n) final traversal is hidden by the sort)

2. Slightly better, at least asymptotically, would be to sort _one_ list (the longer one), then do a binary search on each element in the smaller list to find out if it is in the bigger one. O(m) storage, O( (m + n) log m ) complexity ( m log m to sort the m list, plus n * log m for doing n binary searches of the sorted m list)

3. No restructuring, and do a O( m * n ) search. This sucks, but requires no (O(0), I guess?) temporary storage.
posted by blenderfish at 2:24 PM on March 27, 2007

PS: The UNIX utility "comm" pretty much does this, though it requires sorted lists as inputs.
posted by mhum at 2:24 PM on March 27, 2007

The number of state vs. cty code overlaps is small, so if you had those shorter lists, you could find the overlap and use that to only check some of the big list.
posted by smackfu at 2:25 PM on March 27, 2007

Hmm. thinking about it a little more, if n is smaller than m, O( m log m + n log n) would be smaller than O( (m+n) log m ). So, #2 is slower (but still requires less storage.)

So, yeah, #1 (Which is what mhum details) is probably the fastest obvious algorithm.
posted by blenderfish at 2:27 PM on March 27, 2007

The storage and the time taken depend hugely on various factors.

-- how big are the lists?
-- how big are the list items?
-- how long does it take to compare two list items?
-- are there simply ways of computing hashes of the list items?
-- are list items likely to repeat or be unique?

For small lists of simple items such as integers, the naive algorithm is trivially simple:

For each item in N
-- For each item in M
---- compare the item from N with the item from M
---- if they match, add them to the list of common items

In this case the storage required is approximately 2N + M times the number of bytes required for each item (lists N and M, plus up to N common items).

The time taken is proportional to N times M, since you make N times M comparisons in the nested loop.

For bigger lists or more expensive comparisons, the algorithm would need to be optimised.
posted by unSane at 2:29 PM on March 27, 2007

For something that would take what I believe to be minimal O(m+n) time but slightly more storage you could traverse one list and turn it into a hashmap/set, then traverse the other list and detect if elements were already in that map. Since hash insertion is O(1) it's linear time (need a good hash function and equality test etc). For size it depends on your load factor but it's not too bad for most reasonable values.
posted by true at 2:31 PM on March 27, 2007

This is the "set intersection" algorithm. It requires N tests for membership in a set of size M. So if your membership test is O(f(x)) time, the intersection test is O(N*f(x)). Likely f(x) are f(x) = log(x) or f(x)=1.

Since you test for membership in the set M, you need to keep around O(M) information. You can take the items from N and produce the outputs one at a time if you like, so those both contribute O(1) giving a total of O(M+1+1) = O(M). If you prefer to keep M, N and result, then you get O(M+2N).

Here's the guts of a Python set intersection algorithm:
       if len(self) < len(other):br>            little, big = self, other
           little, big = other, self
       common = ifilter(big._data.has_key, little)
which works pretty much as I describe. ifilter means 'take the items from 'little' in turn, and if 'little' is a member of 'big' (has_key), put it on the result. In Python the 'has_key' test is O(1).

I hope I didn't just answer a homework question.
posted by jepler at 2:32 PM on March 27, 2007

Okay. Dammit; last post, I promise.
If you sort only the _smaller_ of the two lists (Call it m), then
iterate through the larger list, using a binary search on each element
in that list against the smaller list, your computation O
is O ( m log m + n log m ), which is smaller than the O( m log m + n log n ) of the 'sort both lists' approach (again, assuming m < n.) also, it requires o( m ) storage, rather than o( n + m ).br>
If you use a hash, (which makes things a little more challenging, analysis-wise, since worst case is much different than average case,) then, as 'true' points out, you could add each element in the shorter list to a hash table, and iterate through the other list, for _average case_ O( n + m ).

Okay. Back to work!
posted by blenderfish at 2:36 PM on March 27, 2007

Response by poster: Jepler - I added the note on the bottom detailing why I was asking so hopefully everyone would recognize that it's not a homework problem - it's pure curiosity on my part.
posted by langeNU at 2:37 PM on March 27, 2007

Insert into DB, creating tables N and M.

Select keyfield
from N
select keyfield
from M
posted by nomisxid at 3:34 PM on March 27, 2007

Allocate two bit vectors the size of the shorter each element in each list, set the according bit in each list, skipping anything that's not in the shorter list. Perform a logical and of every byte in each list.

Storage is 2 log2(n)
Time is O(m + 2n)

This assumes that all elements are unique.
posted by plinth at 4:00 PM on March 27, 2007

There's a possibly cool way to do this if your universe is finite: that is, assign each element in the universe a power of two, and calculate n = {N} (e.g., the sum of the values of all the elements in list with size N) and m = {M}. M - N will uniquely specify the difference. This should take O(1) time and O(1) storage.
posted by aberrant at 5:02 PM on March 27, 2007

to be more precise, it will take O(n) storage in order to store the values for all the elements in the universe.
posted by aberrant at 5:09 PM on March 27, 2007

and apologies for the multiple posts, but if you're assuming that the smaller list is a subset of the larger, then it becomes trivial to show that this can be done in O(1) time with O(n) storage using the above method, since the size of the universe is by definition finite, being bounded by and contained within the larger list.
posted by aberrant at 5:12 PM on March 27, 2007

This should take O(1) time and O(1) storage.

Assuming that an infinite-precision math is constant time is cheating. The time that addition takes is proportional to the number of possible values an element of either m or n may have. I guess you can _call_ that value constant, but that's a bit misleading. Hypothesizing that the two sets contained unsigned 4-byte integers, these tables would each be billions of bits long, and take billions of bit operations to merge!

Your strategy essentially amounts to building two perfect hashtables (or sparse sorted lists, if you prefer) and performing a set operation on them. (Not a bad approach, depending on the context of the problem.)
posted by blenderfish at 5:25 PM on March 27, 2007

Agreed that the feasibility of the approach decreases in relation to the expansion of U, but for moderately-sized finite universes, it's a viable option. It certainly would get unwieldy as n(U) approaches 2^32, as you suggest - but for smaller set sizes it can be used to elegant effect, especially if multiple compares are necessary, as you've pregenerated your hash tables.
posted by aberrant at 5:41 PM on March 27, 2007

From a practical perspective, I usually do matching using hashtables now. Something like:

1) loop thru list A, adding each item to the hastable
2) loop thru list B, checking the hashtable for each item

It gives easy to read code without fussing about with multiple indexes. And you only need one list (or the hashed equivalent) in memory; the other can come from a file or a database cursor.

The dual sort sliding thingie is clever, but I've seen that kind of thing go wrong too often. Namely when people assume that lists from different sources use the same sort order (stupid ORDER BY).
posted by smackfu at 6:10 PM on March 27, 2007

The hash test seems like a winner to me: it only requires one pass over the data, & no sorting.

If you were doing this by hand with index cards, then even the method you'd use to sort one list is up for debate.
posted by Pronoiac at 6:21 PM on March 27, 2007

The Aho-Corasick algorithm is extremely cool and would work nicely for the specific meetup problem (since we're comparing strings). Time is O(N' + M' + Z), where N' is the total length of the N strings in the first set, M' is the same for M, and Z is the number of matches found. Not sure about space, but it'll be some function of N'.

Although N' and M' are larger than N and M, this'll still take about the same amount of time as the hashing approach, because Aho-Corasick just builds and then walks along a state machine; no hash computations needed.

Of course, it's way more complicated than just using a hash table, so it'd be completely silly to actually use it for this problem. It's very very cool, though...
posted by equalpants at 6:39 PM on March 27, 2007

The Aho-Corasick algorithm is extremely cool and would work nicely for the specific meetup problem

That's a cool algorithm, and provides a slick way to look for any of N sequences of characters in another, non-delimited sequence of characters, but since his lists are delimited, it would be of little use. (It would degenerate into just a search tree, since the first character of all the members of the dictionary would be the delimiter.)
posted by blenderfish at 7:16 PM on March 27, 2007

Your problem is pretty much an equal join in relational database terms, and this is a much visited area. For databases, the fun thing is that both lists (tables) are usually too large to be held in memory.

The classic solution is Merge-sort, after external sort has been performed on the two lists. An alternative is hash-based join, which scales well if done in parallel.
posted by of strange foe at 8:42 PM on March 27, 2007

Say your two lists are represented as lists of atomic objects (i.e. not lists of strings - if "Berlin" appears in both lists it will be represented as a pointer to the "Berlin" object).

Then finding the intersection is O(N + M)
Assume FlagNo has an integer value.

FlagNo = FlagNo+1;
for (x in list1) x.flag = FlagNo;
for (y in list2) if (y.flag == FlagNo) collect y;

posted by MonkeySaltedNuts at 11:07 PM on March 27, 2007

... that's just nuts.
posted by Pronoiac at 12:17 AM on March 28, 2007

MonkeySaltedNuts' solution only works if your test for equality is 'it is the same atomic object', which is a very special case.
posted by unSane at 9:20 AM on March 28, 2007

unSane: "...only works if your test for equality is 'it is the same atomic object',...".

WRONG. The equality test in this example is comparing integer values and a similar example could be made that compares boolean values.

The example works because the strings have been mapped to canonical objects (N strings can be "canonicalized" in O(NlogN)).

With canonical objects all set-theoretical operations become linear. If you are doing lots of such operations the cost of canonicalization is a set-up cost not an operational cost.

Even if you want to evaluate 1 intersection, the canonical approach is theoretically optimum (in terms of O), and involves no explicit sorting (of course the "canonicalizer" might involve implicit sorting for really big lists).
posted by MonkeySaltedNuts at 11:51 AM on March 28, 2007

Well, you didn't actually say the list had been canonicalized.

If the list has been 'canonicalized' then the equality test is actually 'canonicalizes to the same atomic object'.
posted by unSane at 12:38 PM on March 28, 2007

Your words still indicate you have little grasp on what you are saying.
posted by MonkeySaltedNuts at 3:41 PM on March 28, 2007

The fact that you think I'm wrong shows that you haven't thought about it hard enough.
posted by unSane at 3:49 PM on March 28, 2007

« Older I need suggestions for getting over my CSS block   |   Will energy saving light bulbs burn my house down? Newer »
This thread is closed to new comments.