Join 3,495 readers in helping fund MetaFilter (Hide)


Help Writing a Simple Python Script
July 7, 2014 11:13 AM   Subscribe

I have thousands of records of variable length from which I need to remove the last four digits. This seems like it should be an easy job for Python, but I keep hitting a wall.

The data is in CSV format, and it can either be rewritten where it lives or become a new file. I have no use for the data with the last four digits of the numbers attached. So, I need to take data lists that look like this "12345671234, 9090901234, 123901234, 6751234" and turn it into "1234567, 909090, 12390, 675". Any thoughts?
posted by stoneweaver to Computers & Internet (11 answers total) 3 users marked this as a favorite
 
Is the data formatted so that each field is ALWAYS an integer number? And do you always want the space after the comma?
posted by foxfirefey at 11:17 AM on July 7


data_list = data_list.split(', ')

data_list = [i[:-4] for i in data_list]

data_list = ', '.join(data_list)
posted by dilaudid at 11:17 AM on July 7 [2 favorites]


I don't know Python, so if that's a requirement this won't be helpful, but I'm not sure if you're just looking for any solution to get this done, which I can help with: since you have the data in CSV already you can do it easily in Google Docs/Excel with (assuming the first piece of data is in A1): =LEFT(A1,LEN(A1)-4) -- and then obviously fill down and paste values over the original data.
posted by brainmouse at 11:18 AM on July 7 [2 favorites]


I think the easiest thing to do is write a new file and for each line in the original file:
", ".join([item.strip()[0:-4] for item in line.split(",")])
which is pretty much what dilaudid did up there looks like though I think dilaudid isn't taking into account the space on the original number, just a bit more compact
posted by foxfirefey at 11:19 AM on July 7


Yes, each field is always an integer number. I don't really care one way or another about the space after the comma, and it's easy enough for me to export it with no space in the data at all.
posted by stoneweaver at 11:22 AM on July 7


I like python but if you're looking to do any other simple transformations of this data it might be easier to transform it into a list with one item per line. In this case all of your bog standard cli tools provide great utility. An example:

echo "12345,23456,34567,45678" | sed 's/,/\n/g' | sed 's/^.*\([0-9]\{4\}$\)/\1/'
2345
3456
4567
5678

Then you could, say, tack on a | sort -g | uniq -c to count each occurrence and the like.
posted by mce at 11:49 AM on July 7


It's hacked together, but it can work as long as each value has more than 4 characters and you have the same number of values per line:

import numpy as np

def outwrite(file,list):
    for jj in list:
        file.write('%i ' %jj)
    file.write('\n')
    return

vals = np.loadtxt('inlist.cat',delimiter=',')
outvals = np.floor(vals/10000.)
f = open('outfile.cat','w')
for ii in outvals:
    outwrite(f,ii)
f.close()
posted by miguelcervantes at 12:33 PM on July 7


I like using the csv package, which automatically converts csv files into lists and handles the parsing for you. The items in the csv (i.e. each thing between commas) come in as strings, which can be a pain when you need numbers but in this case you just want to chop off the last four digits, which is the same as the last four characters.
import csv

# Open input and output files
fin = open("inputfile.csv", "r")
inFile = csv.reader(fin)
fout = open("outputfile.csv", "w")
outFile = csv.writer(fout)

for row in inFile: # main loop
    outRow = []
    for item in row:
        if len(item) > 3: # check item length; not needed if all items are longer than 4 char
            outRow.append(item[:-4])
        else:
            outRow.append("")
    outFile.writerow(outRow)

fin.close()
fout.close()

posted by Homeboy Trouble at 1:17 PM on July 7


Pretty easy with Perl:
perl -pe 's{\d+}{int $&/10000}eg' infile.csv > outfile.csv

posted by nicwolff at 2:29 PM on July 7


If you're using Python 2 on Windows, you want binary mode (rb and wb). You can also tighten up Homeboy Trouble's code and handle exceptions with context managers:
import csv

with open("inputfile.csv", "rb") as in_file, open("outputfile.csv", "wb") as out_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    for row in reader:
        output_row = []
        for item in row:
            if len(item) > 3:
                output_row.append(item[:-4])
            else:
                output_row.append("")
        writer.writerow(output_row)


posted by djb at 6:34 PM on July 7


The excel trick worked like a charm, and is a good stopgap. When the month slows down for me, I'll be working out the Python script to make next month even better. Thank you all so much!
posted by stoneweaver at 11:54 AM on July 8


« Older A group of 10 of us would like...   |  In the Orange Juice song, Blue... Newer »

You are not logged in, either login or create an account to post comments