Playlist maker for words
March 22, 2021 10:03 AM   Subscribe

Is there a tool that can take a set of words and output a list of words that are associated with those words (NOT a thesaurus)?

I'm looking for something that can help me brainstorms names/titles/nouns for a game project, where I can say "here are some words, give me some more words like those words".

For example: if I say "obelisk, artifact, codex" those are words that evoke a sense of ancient mystery, despite not really having overlapping meanings. A tool that suggested "palimpsest, crypt, ..." would be the goal.

Another example: "singularity, void, nexus" are words associated with sci fi/astrophysics and related words might be "warp, wormhole".

I've found, but that's limited to a single word or phrase input.

It's possible the tool I'm imagining doesn't exist, but maybe someone has harnessed the power of GPT-3 to aid lazy writers?
posted by justkevin to Writing & Language (9 answers total) 4 users marked this as a favorite
Best answer: Do you have any experience with python, or even just the command line? gensim's most_similar can do this if you can download and load a large pretrained model like GloVe. most_similar can take a list of words like this:

In [104]: import gensim.downloader

In [105]: print(list(['models'].keys()))
['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']

In [106]: glove_vectors = gensim.downloader.load('glove-wiki-gigaword-100')
[==================================================] 100.0% 128.1/128.1MB downloaded

In [107]: glove_vectors.most_similar(positive=['obelisk', 'artifact', 'codex'])
[('inscription', 0.6273232698440552),
('manuscript', 0.6243857145309448),
('figurine', 0.622911274433136),
('artefact', 0.6198338866233826),
('etruscan', 0.6018654108047485),
('funerary', 0.5979099273681641),
('stele', 0.5959630608558655),
('relic', 0.5920830965042114),
('tomb', 0.591293215751648),
('artifacts', 0.5911427736282349)]

posted by supercres at 10:28 AM on March 22, 2021 [1 favorite]

Also, most_similar takes more arguments like this:

glove_vectors.most_similar(positive=None, negative=None, topn=10)

You can specify a list for "negative" in the same way, like words it should steer away from, and you can return more results with a higher number for "topn".
posted by supercres at 10:33 AM on March 22, 2021

At some point in the past I saw a tool like this on the internet, with a nice interactive UI. It was sort of like word webs, or something. You gave it a word and it would show you other words that were associated with the first word in some way, in groups and with proximity. I'll try to dig it up, but I mention it in case someone else here remembers it. It might even have been an FPP on the blue.
posted by Winnie the Proust at 10:52 AM on March 22, 2021

I don't think this is the specific site I remembered, but Word Webs does something along the lines of what you're looking for.

Obelisk yields Inscription, Monument, Invocation, Shape, Form, and Transport. Then you can click on any of those words to extend the web further.
posted by Winnie the Proust at 10:55 AM on March 22, 2021

Oh, I'm sorry. I see that you are looking for multi-word inputs. I don't think Word Webs does that.
posted by Winnie the Proust at 10:57 AM on March 22, 2021

You might try Visuwords, it may only work with single words but it certainly does a nice job of coming up with alternatives.
posted by ptm at 11:06 AM on March 22, 2021

I tried out a different GloVe model, trained on twitter data, that's a little more chaotic but may be more up your alley 😆:

In [118]: glove_vectors = gensim.downloader.load('glove-twitter-100')
[==================================================] 100.0% 387.1/387.1MB downloaded

In [119]: glove_vectors.most_similar(['obelisk', 'artifact', 'codex'], topn=30)
[('changeling', 0.6311368942260742),
('camelot', 0.6306096315383911),
('pantheon', 0.629768967628479),
('outpost', 0.6262741088867188),
('ampersand', 0.6254072189331055),
('behemoth', 0.6238597631454468),
('montezuma', 0.6165039539337158),
('bastion', 0.6158690452575684),
('persepolis', 0.6139004230499268),
('crucible', 0.6085838079452515),
('tesseract', 0.6073752641677856),
('arabesque', 0.6062545776367188),
('duelling', 0.6061664819717407),
('marauders', 0.6053487062454224),
('panopticon', 0.6039743423461914),
('defiance', 0.6037322282791138),
('crusader', 0.6036583185195923),
('mothership', 0.603611946105957),

Anyway, feel free to memail me if i can steer you in the right direction on setting this up.
posted by supercres at 11:28 AM on March 22, 2021

Response by poster: I don't know Python, but I was able to install it and replicate the results of your examples and see some immediate usefulness.
posted by justkevin at 12:25 PM on March 22, 2021

Cool! Yeah, you'll get different results from all sorts of models for obscure but interesting reasons, mostly relating to what "similarity" means for different models, i.e., are words similar because they show up in the same documents/sentences a lot, because they show up in the same sentence context (i.e. are often substituted for each other), because they're spelled similarly or sound similar... It's a rabbithole!
posted by supercres at 2:14 PM on March 22, 2021

« Older Finding a Mortgage Broker? (Do I need one?)   |   Is Chan Marshall (Aka Cat Power) doing better live... Newer »
This thread is closed to new comments.