Statistically comparing different search engine results
December 6, 2006 8:12 AM   RSS feed for this thread Subscribe

Stats101Filter. I have (I think) a stats question, but little stats knowledge. Problem: The same library, and two different information retrieval systems - A and B - incorporating different metadata and search engines. I search the library with each IR system, using the same query. Is there a way to compare the similarity or difference between the different results sets I get from each system, and also assign this difference some statistical significance?

For example, if I search for 'frogs,' I could get 'All about Frogs,' 'Lifecycle of the Frog,' 'Florida Frog Cam,' 'Cool Frog Pics,' etc., as results. I can see a range of scenarios for comparing A and B.

H0 - There is no difference
Scenario 1 - A and B return the same results, in the same order

H1 - There is a difference
Scenario 2 - A and B return the same results, but in different order
Scenario 3 - A and B return at least some different results
Etc.

For various reasons - basically we are replacing one engine with another that we think is more efficient and scalable - we are hoping for Scenario 1. However, we are worried that we may encounter Scenario 3. So the question is, how can we calculate any 'difference' we might encounter in Scenario 3, and how can we decide whether or not this difference is significant (and in real-world terms, likely to confuse users if we do switch our IR systems). Phew! (And thanks!)
posted by carter to technology (10 comments total)
Do you have a way of comparing the rankings? Do you have a way to tell if one book returned is better result than another? If you do, how? You'll need some sort of relevance result to be able to score scenario #2.

Otherwise, the classic way to test and IR system is through precision and recall. That is, does a system return all of the possible relevant results and of the relevant results returned how many are correct? That will take care of scenario #3. If you have a relevance score then you can make sure that the most important matches are being returned.
posted by Alison at 9:59 AM on December 6, 2006


Sorry, that should be "of the results returned how many are correct?"
posted by Alison at 10:00 AM on December 6, 2006


This isn't a question of statistical significance. Any difference between the search results for identical terms is entirely due to the new system, not to any sort of stochastic element.

Even if it were, it is entirely possible, and quite common, to observe differences that are statistically significant but that also don't make a damn bit of difference. Statistical significance has absolutely nothing whatsoever to do with real or substantive significance, ever.

One way you could do this is to run queries on both and look at any differences. Then use your own expert judgment to decide whether these differences are likely to confuse users.

You could use statistical measures to assess the sameness of both lists. But really I think you're better off having librarians look at the results for some random searches and making a decision based on their experience and expertise.
posted by ROU_Xenophobe at 10:05 AM on December 6, 2006


Or you could observe user behavior before and after the switch. Are users behaving afterwards in a way consistent with being confused by their search results, or dissatisfied with searches?
posted by ROU_Xenophobe at 10:07 AM on December 6, 2006


Alison:

Do you have a way of comparing the rankings? Do you have a way to tell if one book returned is better result than another? If you do, how?

Unfortunately not; our users can vary quite significantly in their assessment of the usefulness and relevance of a particular resource, with variance depending on the task that user is enagaged in.

Re. precision/recall, this may sound a bit weird, but due to a range of factors - basically to do with distributed/federated collection development, and lack of agreement on metadata standards - we don't really have a complete picture of what's in the database. We are however going to compare the number of search results returned by each system.
posted by carter at 10:30 AM on December 6, 2006


One way you could do this is to run queries on both and look at any differences. Then use your own expert judgment to decide whether these differences are likely to confuse users. Or you could observe user behavior before and after the switch. Are users behaving afterwards in a way consistent with being confused by their search results, or dissatisfied with searches?

We're probably going to do this at some stage, although see the point about user subjectivity above. However asking someone else to write a script to do some kind of automatic search comparison seemed to be the lower cost alternative ;)
posted by carter at 10:36 AM on December 6, 2006


You could run pseudo-recall on each of the systems. For example, you could check to see if there are any results returned by system A that system B failed to pick up, and vice-versa. This will tell you if one system is finding things the other isn't and which one is more thorough.
posted by Alison at 10:53 AM on December 6, 2006


It actually looks as if - from our first runs - that there sometimes considerable differences in the numbers of results returned for a query by each engine.

So even though we don't know the number of items in the set of relevant items in the database, this will allow us to have some kind of 'relative recall' measurement - e.g. we could do a scatter plot of engine vs. engine and figure out how far we are away from agreement (which should be a 45 degree line, I think).

And we will also be doing some difference tests - based on the top 10 and 20 results, that is the first 1 and 2 pages of results - to see how many results are in common (and, therefore, how many have not been picked up by individual engines).

Thanks both! Very useful!
posted by carter at 11:25 AM on December 6, 2006


You should consider is that there is not a strict more results = better or missing results = bad relationship. The extra items returned by the system that returns more items could either be the most relevant/useful items possible or they could be completely irrelevant chaff. If you don't have a way of evaluating this, I don't really see how the kind of analyses you propose are helpful.
posted by juv3nal at 3:38 PM on December 6, 2006


True! As it turns out, the way things are looking after the first runs today, there seems to be quite a bit of difference between the two engines, differences which are apparent when you just eyeball the results. And so we will be moving on to doing some usability work, to see if these quantitative differences translate into qualitative differences for the users. I think with attempting to do some basic quantitative analyses, we were trying to see if such tests were necessary; if we'd had pretty much the same results, we probably wouldn't consider user testing.
posted by carter at 5:07 PM on December 6, 2006


« Older Please, help me understand DVI...   |   I seem to be getting horrible ... Newer »
This thread is closed to new comments.