What's the easiest way to sort a huge list of words by part of speech?
August 14, 2022 5:59 PM Subscribe
I have a Mac and a Raspberry Pi. I have a text file with a list of a few thousand words. I want the list split into separate lists of all nouns, verbs, adjectives, etc. If a word can be used as multiple parts of speech, it should end up in multiple lists. This is just for a silly project and it only needs to be done once, so the way it gets done can be sloppy or hacky. I'd rather not do it manually since it would take way too long.
This should do the trick:
Reads a file called
posted by wesleyac at 6:20 PM on August 14, 2022 [2 favorites]
#!/usr/bin/env python3 import nltk from collections import defaultdict wordtags = nltk.ConditionalFreqDist((w.lower(), t) for w, t in nltk.corpus.brown.tagged_words(tagset="universal")) word_pos = defaultdict(list) for word in open("dict.txt"): for pos in wordtags["report"]: word_pos[pos.lower()].append(word) for pos in word_pos: with open(f"{pos}.txt", "w") as f: f.write("".join(word_pos[pos]))Requires the NLTK library, and you'll need to download a couple files it needs, but it's not too hard to figure out, hopefully.
Reads a file called
dict.txt
and writes noun.txt
, verb.txt
, etc.posted by wesleyac at 6:20 PM on August 14, 2022 [2 favorites]
Response by poster: re: wesleyac
Ooh, that could be perfect! Are the files it needs part of the install, as described here?
posted by 2oh1 at 6:28 PM on August 14, 2022
Ooh, that could be perfect! Are the files it needs part of the install, as described here?
posted by 2oh1 at 6:28 PM on August 14, 2022
Best answer: Once you do that, you'll need to run
Also it comes with the caveat that the dataset is ~1 million words from 1961, so particularly niche or new words may not appear in the dataset, and will be silently ignored.
Also the script as I wrote it is totally busted, this is one that actually works:
posted by wesleyac at 6:35 PM on August 14, 2022 [1 favorite]
nltk.download('brown')
and nltk.download('universal_tagset')
in order to get the datasets (it'll tell you to do that if you don't).Also it comes with the caveat that the dataset is ~1 million words from 1961, so particularly niche or new words may not appear in the dataset, and will be silently ignored.
Also the script as I wrote it is totally busted, this is one that actually works:
#!/usr/bin/env python import nltk from collections import defaultdict wordtags = nltk.ConditionalFreqDist((w.lower(), t) for w, t in nltk.corpus.brown.tagged_words(tagset="universal")) word_pos = defaultdict(list) for word in open("dict.txt"): for pos in wordtags[word.strip()]: word_pos[pos.lower()].append(word) for pos in word_pos: with open(f"{pos}.txt", "w") as f: f.write("".join(word_pos[pos]))
posted by wesleyac at 6:35 PM on August 14, 2022 [1 favorite]
Response by poster: > Also it comes with the caveat that the dataset is ~1 million words from 1961, so particularly niche or new words may not appear in the dataset, and will be silently ignored.
No worries. My word list is very generic and very common English words.
posted by 2oh1 at 6:41 PM on August 14, 2022
No worries. My word list is very generic and very common English words.
posted by 2oh1 at 6:41 PM on August 14, 2022
« Older Best sheets and towels, non-linen, non-Turkish... | sometimes you have to dig deeper, but sometimes... Newer »
You are not logged in, either login or create an account to post comments
This python library returns a given word's definition and word type(s).
posted by kzin602 at 6:12 PM on August 14, 2022 [2 favorites]