Amazon is now automatically generating pull quotes from similar customer reviews: How are they doing that?
May 2, 2012 11:45 AM   Subscribe

Amazon recently started featuring pull quotes on product reviews that are automatically compiled from 'similar' customer statements. For example, take a look at top of the reviews section for an iPad. At first I thought these might be manually maintained on popular products, but the pull quotes exist for some pretty obscure items with only a handful of reviews. They are obviously being procedurally compiled. I'm working on a project where this exact type of natural language processing could be very useful. Is there a preexisting library available that offers functionality similar to what Amazon is doing here? Or could anybody point me in the right direction for understand the technology behind this better?
posted by csimpkins to Computers & Internet (3 answers total) 3 users marked this as a favorite
I don't see any pull quotes on that page (A/B testing? I've heard they're always running them) but what you describe sounds similar to the Review Highlights section on some yelp pages (e.g this one.) I very much doubt that either company is using anything off the shelf however.
posted by jacalata at 3:23 PM on May 2, 2012

I used to work at Amazon, but have no knowledge of their language processing efforts besides as an outsider interested in this kind of thing.

I imagine if they are doing something like this on a large scale, they are either:

1) using text summarization tools to find passages in comments that have overlap or that say the same thing (as reported by a summarizer). For overlap I naively assume you could look at the search problem in reverse: which passages would act as successful queries to retrieve multiple reviews?

2) using a whole lot of manual labor via Mechanical Turk, either to supplement the above, or instead of the above.
posted by zippy at 5:32 PM on May 2, 2012

I imagine that this works something like:

1) Find significant features. That is, find words are phrases that are statistically significant and not appearing in reviews in general
2) Cut up each review into sentences and longer sentences into fragments. NLTK can do the first part, and the second part would just be a sliding window.
3) Find fragments containing the features.
4) Cluster the fragments. Maybe something as simple as Jaccard for a similarity metric?
5) Pick a representative fragment for the cluster. Shortest or commonest fragment with the highest scoring terms.
posted by joshu at 1:40 AM on May 3, 2012

« Older Would parents hire me?   |   Can I put solid deck stain right over... Newer »
This thread is closed to new comments.