Help me do my job with minimal effort
October 17, 2010 10:36 AM   Subscribe

I have two txt files with single-word lines. How can I create an automated script that will find lines that match between the two, and append a number to the end of the line for each word in one of the files?

File A has 270000 lines with basically all the non-rare words in my language arranged alphabetically, and file B is a list of common words with 1000 lines. I need to find all the words from the latter file in the first file, and add the number "2" to the end. I can use OS X or Windows applications, and probably Unix applications through MacPorts. I apologize in advance if this question isn't very interesting, but I'm sure there's a better way than doing it manually and thought someone might enjoy the challenge.
posted by Aiwen to Computers & Internet (19 answers total)
 
I have no doubt that OS X has command line tools to do this on its own, and I'm pretty sure it comes with Python - but here's how I'd do it in R.

Download R here.

Read in your data using the read.delim file - I think the default arguments should be fine.

#Read in the two lists
longlist <- read.delim(file = "C:/whereveritis.txt")
shortlist <- read.delim(file = "C:/thisonetoo.txt")

#Append 2 if the word is in the short list
lastlist <- ifelse(longlist %in% shortlist, yes = paste(longlist, "2", sep = ""), no = shortlist)

#Write back out to a file
write.table(lastlist, file = "whereyouwantittogo.txt"

Here's a toy example:
x <- c("a", "b", "c", "d")
y <- c("b", "c")

x2 <- ifelse(x %in% y, yes = paste(x, "2", sep = ""), no = x)
x2
[1] "a" "b2" "c2" "d"


Probably not the most mainstream way to do it, but there you go.
posted by McBearclaw at 10:55 AM on October 17, 2010


Of course, I can't help but wonder if this is an intermediate step in an overall process that could benefit from automation elsewhere. What are you working on, if I may ask?
posted by McBearclaw at 10:56 AM on October 17, 2010


Open up Terminal and run (substituting filenames as appropriate):

fgrep -x -f common_words.txt all_words.txt | sed 's/$/ 2/' > result.txt

Does that do the right thing?
posted by Serf at 10:59 AM on October 17, 2010 [1 favorite]


#!/bin/bash

wordlist = ~/small_list.txt
biglist = ~/biglist.txt
outputlist = ~/output.txt

while read -r word; do
  $(grep -F $word $biglist) && echo ${word}2 >> $outputlist
done < "$wordlist"

posted by rhizome at 11:05 AM on October 17, 2010


Response by poster: Wow, that was a fast response McBearclaw! Thanks! I'm currently downloading R (slow internet connection), and will be trying it out. Judging from my interpretation the script you've presented, though, I'm not sure I was specific enough in my question: I still need the file with 270000 words to be intact, just with the words that occur in file B to have the number 2 added to the end.

I'm living far away from home in the orient, and using my (fairly rare) native language as an asset is the easiest way to make money, since my university degree won't get be any real jobs. I've been tasked to do the very tedious job of going through an extensive list of most every word in my language (the 270000 words is actually just the first of two parts), adding a "2" to the end of very common words, "0" to nonexistent or misspelled words, and "1" for basically everything else. 0s are very rare, so I've just been going through the text using a simple keyboard macro (1, down arrow, ctrl-right arrow [for end-of-line]) while scanning through the words looking for incorrect ones as they go by. Determining which words need to be classified as "2" takes more effort, but google presented a file with the 10000 most common words (of which I want to use the upper 1000) based on a corpus of 150 million words, which should be much more accurate than my brain.

This isn't exactly how I was asked to perform the task, but the result should be what matters.

One more difficulty: file B is all lower-case, while file A is case sensitive.

On preview:
Serf, that yielded a blank text file.
rhizome, I'll try to figure out how to run bash scripts and get back to you
posted by Aiwen at 11:38 AM on October 17, 2010


rhizome, I'll try to figure out how to run bash scripts and get back to you

If you're on a mac, just copy and paste that code (changing the filenames and such to the real ones) to a terminal prompt. You can also paste it into e.g. fooscript.sh and run "bash fooscript.sh"

Here's a one-line version (backslash for readability, you can remove it and put everything on one line if you want):
while read -r word; do $(grep -F $word /path/to/biglist.txt) && \
echo ${word}2 >> /path/to/output.txt; done < /path/to/small_wordlist.txt

posted by rhizome at 12:08 PM on October 17, 2010


Response by poster: Thanks for your help, rhizome, but I'm not getting it to work

bash2.sh is the one-line version

me:test me$ bash bash2.sh
bash2.sh: line 1: Binary: command not found
bash2.sh: line 1: Binary: command not found
bash2.sh: line 1: Binary: command not found
bash2.sh: line 1: Binary: command not found
bash2.sh: line 1: Binary: command not found
bash2.sh: line 1: Binary: command not found
me:test me$

bash.sh = the first one you gave me

me:test me$ bash bash.sh
bash.sh: line 3: wordlist: command not found
bash.sh: line 4: biglist: command not found
bash.sh: line 5: outputlist: command not found
bash.sh: line 9: : No such file or directory
me:test me$

McBearclaw, when I ran your script in R I first got an "unexpected symbol" error; after I closed the parenthesis in the last line I got this error:

> write.table(lastlist, file = "/[deleted]/whereyouwantittogo.txt")
Error in inherits(x, "data.frame") : object 'lastlist' not found

Doing it by hand doesn't seem like such a hard job now. God, how tedious though.

Thanks for all your help; I do appreciate it and feel I am learning quite a bit!
posted by Aiwen at 12:28 PM on October 17, 2010


Thanks for the clarification. I'm fairly sure that my script is doing what you need - appending "2" to the end of the words - but I'm not sure. Easiest way is to run it and find out. In order to deal with the case issue, I'd just modify it thusly:

longlist <> shortlist <>
... which is probably better, anyway. But rhizome's bash script looks better still.
posted by McBearclaw at 12:28 PM on October 17, 2010


Ugh. That was supposed to be:

longlist <- tolower(readLines(con = file("C:/whereveritis.txt")))
shortlist <- tolower(readLines(con = file( "C:/thisonetoo.txt")))

lastlist <- ifelse(longlist %in% shortlist, yes = paste(longlist, "2", sep = ""), no = longlist)

write.table(lastlist, file = "whereyouwantittogo.txt")

I didn't realize just how dependent I've become on syntax highlighting. Anyway, give that a shot and let me know how it goes.
posted by McBearclaw at 12:36 PM on October 17, 2010



#!/usr/bin/perl -CSDAL
#
# Usage: PROGNAME big_list.txt small_list.txt > output.txt
# -CSDAL for unicode.
use strict;
use warnings;

# Read the big/small list file names
my $bigfile = shift @ARGV;
my $smallfile = shift @ARGV;

# Read the small list file, map to UPPER CASE, remove line ending
my @small = do { local @ARGV = $smallfile; <> };
@small = map { uc $_ } @small;
chomp @small;

# Make a lookup hash
my %small;
$small{$_} = undef for @small;

# Read through the big list file
@ARGV = $bigfile;
while (<>) {
    chomp;            # Get rid of line ending
    print;            # print without any number following and no line ending
    my $big = uc $_;  # make it UPPER CASE
    if (exists $small{$big}) {                 # if it is in small list
      print " 2";                         # add a " 2" to the end.
    }
    print "$/";       # print the line ending
}
Pretty much the same thing. If you put it in a file program.pl you can then do perl program.pl bigfile.txt small_file.txt > output.txt, or you can chmod +x program.pl and then just do ./program.pl bigfile.txt small_file.txt > output.txt.

This handles files with different case, should handle unicode depending on your LOCALE settings.
posted by zengargoyle at 12:45 PM on October 17, 2010


Oh, you can just download the Perl version here on github. Mac OS X should have Perl installed by default I believe.
posted by zengargoyle at 12:53 PM on October 17, 2010


Response by poster: McBearclaw: unfortunately, that is still giving me

> write.table(lastlist, file = "/Users/test/whereyouwantittogo.txt")
Error in inherits(x, "data.frame") : object 'lastlist' not found
>

zengargoyle: The output.txt appears to be an exact copy of the big file. Could this be a problem with the perl interpreter in OS X?

I've come to realize this must hard when you can't test the result for yourselves. I've uploaded the files to rapidshare (less than one MB compressed), but I'd rather not post the link here. Memail me if you'd like to see it.
posted by Aiwen at 1:14 PM on October 17, 2010


Best answer: Oh! I see. The files have different encodings -- the big one is UTF-16, the little one is Latin-1. I also misunderstood the task. Try this (save to a .py file, then "python whatever_you_called_it.py"):

#!/usr/bin/env python

OUTPUT_ENCODING='utf-8'

with open('smallfile.txt', 'r') as small:
  small_words = set(small.read().decode('iso-8859-1').split())

with open('bigfile.txt', 'r') as big:
  big_words = big.read().decode('utf-16').split()

with open('result.txt', 'w') as result:
  for word in big_words:
    if word in small_words:
      result.write('%s2\n' % word.encode(OUTPUT_ENCODING))
    else:
      result.write('%s\n' % word.encode(OUTPUT_ENCODING))

posted by Serf at 2:55 PM on October 17, 2010


Well, it's likely then that either your short list doesn't have any words in common with the long list, or the data is not clean (maybe there are spaces or tabs on the lines with the words), or you are not being clear about what you want. Mine (at least for me) takes a long list of words:
One
Two
Three
Four
and a short list of words:

three
one
and will give the long list with a '2' appended to words that exist in the short list:
./program.pl long.txt short.txt
One 2
Two
Three 2
Four
Since 'one' and 'three' are in the short list, their matching lines in the long list get a '2'. About the only way I can see this not working is if your data has extra whitespace that you're not noticing. "Bob" will match "bob" but not "bob " or "Bob " will not match "bob". Or your files may have different line endings that your editor is hiding from you. For example Linux uses "\n", Windows uses "\r\n" and Mac uses "\r" (I think). If you memail me the links I'll take a look.

Pretty much any of the proposed solutions would work if it were not for the difference in case and keeping the case correct in the output. I know I tested mine with 1000 random words from my systems dictionary and everything was peachy.
posted by zengargoyle at 2:56 PM on October 17, 2010


Heh, or you could just convert your data. :)

iconv -f utf-16 -t utf-8 bigfile.txt -o fixed_bigfile.txt
iconv -f iso-8859-1 -t utf-8 smallfile.txt -o fixed_smallfile.txt
iconv -f utf-8 -t $PICKONE -o unfixed_output.txt output.txt

Character encodings bring back bad dreams from my last run-in with SHIFT-JIS...
posted by zengargoyle at 3:15 PM on October 17, 2010


The lists are sorted you say?
sort rare-words.txt common-words.txt \
| uniq -c \
| awk '{print $2, $1}'
If the cases are not the same between the files, you can add a tr to adjust everything to lowercase and sort -f to ignore case:
sort -f file-a file-b \
| tr 'A-Z' 'a-z' \
| uniq -c \
| awk '{print $2, $1}'
If you're ok with the fields being reversed (the count, then the word), you can remove the awk call.
posted by autopilot at 3:23 PM on October 17, 2010


cat file1.txt file2.txt | sort | uniq -c | grep " 2 "

Ie, merge both files together, sort them so the words appear next to each other, then call "unique with -c for count the occurances" then filter out only words that appear twice (ie, in both files).

Works for 3 and 4 etc files too.
posted by lundman at 7:12 PM on October 17, 2010


You might also be interested in the standard comm utility.
posted by harmfulray at 8:35 PM on October 17, 2010


Response by poster: Yay Serf, that worked! Don't know how I neglected to notice that the big file was UTF-16. Duh. Probably the cause of all the problems.

Thanks to everyone who contributed!
posted by Aiwen at 10:33 PM on October 17, 2010


« Older How to entertain younger visiting relatives?   |   Less known but tasty drinks? Newer »
This thread is closed to new comments.