What's the easiest way to sort a huge list of words by part of speech?
August 14, 2022 5:59 PM   Subscribe

I have a Mac and a Raspberry Pi. I have a text file with a list of a few thousand words. I want the list split into separate lists of all nouns, verbs, adjectives, etc. If a word can be used as multiple parts of speech, it should end up in multiple lists. This is just for a silly project and it only needs to be done once, so the way it gets done can be sloppy or hacky. I'd rather not do it manually since it would take way too long.
posted by 2oh1 to Computers & Internet (5 answers total)
 
https://pypi.org/project/PyDictionary/

This python library returns a given word's definition and word type(s).
posted by kzin602 at 6:12 PM on August 14, 2022 [2 favorites]


This should do the trick:
#!/usr/bin/env python3

import nltk
from collections import defaultdict

wordtags = nltk.ConditionalFreqDist((w.lower(), t) for w, t in nltk.corpus.brown.tagged_words(tagset="universal"))

word_pos = defaultdict(list)
for word in open("dict.txt"):
    for pos in wordtags["report"]:
        word_pos[pos.lower()].append(word)


for pos in word_pos:
    with open(f"{pos}.txt", "w") as f:
        f.write("".join(word_pos[pos]))
Requires the NLTK library, and you'll need to download a couple files it needs, but it's not too hard to figure out, hopefully.

Reads a file called dict.txt and writes noun.txt, verb.txt, etc.
posted by wesleyac at 6:20 PM on August 14, 2022 [2 favorites]


Response by poster: re: wesleyac

Ooh, that could be perfect! Are the files it needs part of the install, as described here?
posted by 2oh1 at 6:28 PM on August 14, 2022


Best answer: Once you do that, you'll need to run nltk.download('brown') and nltk.download('universal_tagset') in order to get the datasets (it'll tell you to do that if you don't).

Also it comes with the caveat that the dataset is ~1 million words from 1961, so particularly niche or new words may not appear in the dataset, and will be silently ignored.

Also the script as I wrote it is totally busted, this is one that actually works:
#!/usr/bin/env python

import nltk
from collections import defaultdict

wordtags = nltk.ConditionalFreqDist((w.lower(), t) for w, t in nltk.corpus.brown.tagged_words(tagset="universal"))

word_pos = defaultdict(list)
for word in open("dict.txt"):
    for pos in wordtags[word.strip()]:
        word_pos[pos.lower()].append(word)

for pos in word_pos:
    with open(f"{pos}.txt", "w") as f:
        f.write("".join(word_pos[pos]))

posted by wesleyac at 6:35 PM on August 14, 2022 [1 favorite]


Response by poster: > Also it comes with the caveat that the dataset is ~1 million words from 1961, so particularly niche or new words may not appear in the dataset, and will be silently ignored.

No worries. My word list is very generic and very common English words.
posted by 2oh1 at 6:41 PM on August 14, 2022


« Older Best sheets and towels, non-linen, non-Turkish...   |   sometimes you have to dig deeper, but sometimes... Newer »

You are not logged in, either login or create an account to post comments