Help me craft a Google ngrams query
October 7, 2013 10:10 AM
I wan to make a Google ngram showing the history of the term "spoiler" as it is used in popular culture to mean the reveal ahead of time of an important piece of plot information. Problem is, "spoiler" has at least two other meanings and I don't know how, or even if, I can craft a query to separate out those other meanings.
As you can see from this ngram, "spoiler" seems to have had a theological or religious implication in the 1800s and then got a boost in the 1900s as the term for an important part of an airplane wing. I'd like to separate these meanings out and JUST look at how spoiler is used to mean "piece of the plot of a narrative, revealed ahead of time."
In a legal research database I think I'd do this by excluding instances of the term "Spoiler" that occur in the same sentence or perhaps the same paragraph as a big list of terms about airplanes & theology, but 1) I'm not finding anything on whether it's possible to do proximity searches and 2) I feel that is an imprecise way to go about it. Can anybody point me in the direction of resources for how to do this better?
As you can see from this ngram, "spoiler" seems to have had a theological or religious implication in the 1800s and then got a boost in the 1900s as the term for an important part of an airplane wing. I'd like to separate these meanings out and JUST look at how spoiler is used to mean "piece of the plot of a narrative, revealed ahead of time."
In a legal research database I think I'd do this by excluding instances of the term "Spoiler" that occur in the same sentence or perhaps the same paragraph as a big list of terms about airplanes & theology, but 1) I'm not finding anything on whether it's possible to do proximity searches and 2) I feel that is an imprecise way to go about it. Can anybody point me in the direction of resources for how to do this better?
Maybe it was a regional thing or something, but I'd never heard the word spoiler in this context prior to reading Usenet, so 1987 or later for me. The phrase I had heard offline that I'd expect to find in the ngram viewer is 'spoil the ending.' If 'spoiler warning,' shows up, that would also be clearly related, but again that's a usage I didn't hear until getting online.
posted by Monsieur Caution at 10:20 AM on October 7, 2013
posted by Monsieur Caution at 10:20 AM on October 7, 2013
The other phrase commonly used in pop culture is "spoiler alert" but as Monsieur Caution says above, that seems to rise with the advent of communities or discussion groups.
posted by a halcyon day at 10:30 AM on October 7, 2013
posted by a halcyon day at 10:30 AM on October 7, 2013
"Spoiler alert" is a phrase that must have originated pretty simultaneously with the sense you're interested in. The N-gram for "spoiler alert" shows a slow buildup in the 1990s (Usenet period mentioned by M.Caution), then a broader buildup in the 2000s.
You can exclude proximate terms with a minus, so for example exclude airplanes like this, which gives a very different curve.
Trouble is, this is just Google Books and neither this new sense of "spoiler" nor "spoiler alert" are all that likely to occur in books. It's more of a blogs and chat groups thing. In fact if you set the smoothing to zero you'll see it looks like we have just a handful of instances over the period.
Another option is to look at Google Trends, which tracks search term usage. Trends data goes back only to 2004, but for "spoiler alert" it appears to show a buildup from zero around 2005. That seems weird since the terminology has clearly been around much longer. Check out this history of the term.
posted by beagle at 10:41 AM on October 7, 2013
You can exclude proximate terms with a minus, so for example exclude airplanes like this, which gives a very different curve.
Trouble is, this is just Google Books and neither this new sense of "spoiler" nor "spoiler alert" are all that likely to occur in books. It's more of a blogs and chat groups thing. In fact if you set the smoothing to zero you'll see it looks like we have just a handful of instances over the period.
Another option is to look at Google Trends, which tracks search term usage. Trends data goes back only to 2004, but for "spoiler alert" it appears to show a buildup from zero around 2005. That seems weird since the terminology has clearly been around much longer. Check out this history of the term.
posted by beagle at 10:41 AM on October 7, 2013
Well, I think you'd have to download the ngram data yourself to do Boolean searching (I poked around their interface a bit and couldn't make it work), but then you could do something like ((spoiler OR "spoiler alert") AND (tv OR television OR film OR movie OR book OR fiction OR ending))
posted by unknowncommand at 10:42 AM on October 7, 2013
posted by unknowncommand at 10:42 AM on October 7, 2013
Just to further complicate things, it is also a term for a part of a car.
posted by Rock Steady at 10:42 AM on October 7, 2013
posted by Rock Steady at 10:42 AM on October 7, 2013
You may also wish to screen out the use of "spoiler" as a device to enhance cars' aerodynamics.
posted by carmicha at 10:43 AM on October 7, 2013
posted by carmicha at 10:43 AM on October 7, 2013
Hot damn, or you could use +/-! Neato.
posted by unknowncommand at 10:43 AM on October 7, 2013
posted by unknowncommand at 10:43 AM on October 7, 2013
I'm not finding anything on whether it's possible to do proximity searches
My understanding is, no, it's not. If you downloaded the raw 5-gram data you might be able to jury-rig some sort of search for "X within five words of Y." But they've only got n-grams for n up to five, so "X within the same paragraph as Y" is going to be impossible even working from the raw data. Similarly, the sort of boolean searching that unknowncommand suggests is going to be impossible in this data set, because you can't use it to search for "X in the same document as Y."
The whole point of a big n-gram dataset like that is that a lot of information gets thrown away: not just the info you'd need in order to do proximity searches or document-level boolean searches, but also for instance punctuation, position within the document, etc. The good news is that this makes it possible to work with the data efficiently. If they hadn't thrown all that info away, running any sort of query at all would be WAY too much work to be feasible. (That's why they do it in the first place!) But the bad news is that it limits what sorts of work you can do.
Often the best you'll be able to do with data like this is get some kind of approximation for what you really want to know. Perfection is rarely possible. Searching on "spoiler alert" or "spoiler warning" may be the best approximation you can manage here, unsatisfying though that is.
(Interestingly, spoil the ending has a big spike in the 1920s, but looking at some examples the sense seems to be "make a big dramatic moment fall flat by doing something stupid or tacky" or "end a happy event in a way that leaves a bad taste in everyone's mouth," not "ruin the suspense by revealing plot details.")
posted by Now there are two. There are two _______. at 10:49 AM on October 7, 2013
My understanding is, no, it's not. If you downloaded the raw 5-gram data you might be able to jury-rig some sort of search for "X within five words of Y." But they've only got n-grams for n up to five, so "X within the same paragraph as Y" is going to be impossible even working from the raw data. Similarly, the sort of boolean searching that unknowncommand suggests is going to be impossible in this data set, because you can't use it to search for "X in the same document as Y."
The whole point of a big n-gram dataset like that is that a lot of information gets thrown away: not just the info you'd need in order to do proximity searches or document-level boolean searches, but also for instance punctuation, position within the document, etc. The good news is that this makes it possible to work with the data efficiently. If they hadn't thrown all that info away, running any sort of query at all would be WAY too much work to be feasible. (That's why they do it in the first place!) But the bad news is that it limits what sorts of work you can do.
Often the best you'll be able to do with data like this is get some kind of approximation for what you really want to know. Perfection is rarely possible. Searching on "spoiler alert" or "spoiler warning" may be the best approximation you can manage here, unsatisfying though that is.
(Interestingly, spoil the ending has a big spike in the 1920s, but looking at some examples the sense seems to be "make a big dramatic moment fall flat by doing something stupid or tacky" or "end a happy event in a way that leaves a bad taste in everyone's mouth," not "ruin the suspense by revealing plot details.")
posted by Now there are two. There are two _______. at 10:49 AM on October 7, 2013
I'm not sure this really helps, but... you can add a part-of-speech tag for a term, like "spoiler_ADJ". This might filter for uses where spoiler is used as part of a noun phrase, like "spoiler warning" etc. At least the trend seems to roughly fit what might be called the consensus intuition of what it should look like (climbs in the 80s & 90s).
Unfortunately you can't search Google books using POS-tagged terms, so there's no way to verify the results.
posted by jjwiseman at 11:31 AM on October 7, 2013
Unfortunately you can't search Google books using POS-tagged terms, so there's no way to verify the results.
posted by jjwiseman at 11:31 AM on October 7, 2013
« Older Jewelers and Small Business People - Discount for... | Self-help books that will help me make peace with... Newer »
This thread is closed to new comments.
Also, that's not the ngram that has the spike for the term being part of an airplane. I know I saw that earlier today but can't re-construct it now. This is embarrassing.
The point is, I'd like to know how to just look at one sense of the term, and I'd like to know how both in terms of method and in terms of talking to Google's ngram viewer.
Thank you.
posted by gauche at 10:17 AM on October 7, 2013