I would like to convert 4000 records
August 27, 2008 8:12 AM   Subscribe

I scanned with OmniPage 4,000 records and now have a single column of records saved as a text file each record is separated by a blank line. Each record is composed of 4, 5, 6 or 7 lines with the first 3 or 4 or 5 lines being the address the rest is fax, phone, email..

I would like to convert each record so each line becomes a field (comma delimited) and continue to keep each record separated. Then I can Import this file into MS Access and do what I need to do. I was trying to figure this out in MS Word ‘s search and replace (3 hrs) …no luck for me. Is there a simple program out there that will do this for me or can you tell me how to accomplish this, explained simply. I do not know how to write script. Your time and help is much appreciated. This will really help me out

Much Thanks,
Mauimarty
posted by Mauimarty to Computers & Internet (9 answers total)
 
This would be a really easy perl script, except for the variable line thing. If I understand you, your data is like this:

address
address
address
fax
phone
email

address
address
address
address
phone
email
fax

right?

so if it was converted to csv it would look like this :

address,address,address,fax,phone,email
address,address,address,address,phone,email,fax

so things don't line up. Are there any line ids? Like

Address1: 123 mockingbird lane

or is it just

123 mockingbird lane?
posted by d4nj450n at 8:23 AM on August 27, 2008


Convert the blank line to some dummy value, like "xxx---xxxx---xxx".

Convert every line ending* character to double-quote comma double-quote

Convert the dummy value to a line ending character*

* on MS-Windows the line ending character is really two characters. I don't recall how word represents it, but you'll figure it out.
posted by orthogonality at 8:54 AM on August 27, 2008


Oh. If you have any real double quotes in your data, escape them first.
posted by orthogonality at 8:56 AM on August 27, 2008


Please post a sample (3 or 4 records) of the data.
posted by signal at 10:23 AM on August 27, 2008


Response by poster: Thanks d4nj450n
Your interpretation of the format "right?" is correct.
?... It is 123 mockingbird lane
I need each line to be a field like this "xxxxx", and each record separated by a blank line

Thanks orthogonality
The line ending character is a paragraph mark ^p in MS Word
I will give your suggestion a try and let you know.
No real double quotes to delete
Regards
mauimarty
posted by Mauimarty at 10:39 AM on August 27, 2008


So if you want just the fields next to each other in csv format, with out regard to what they are, I can whip up a perl script that does that and post it here. Of course you have to be able to run perl somehow.

If you need a blank field where the fourth address is if there is not a fourth, and the rest of the fields in the right order, you are pretty much sol. Do it your self or hire a temp to do it is about the only way.
posted by d4nj450n at 11:01 AM on August 27, 2008


Response by poster:

d4nj450n
I did what you suggested and almost problem solved... but no cigar I end up with 1 huge record with multiple fields instead of multiple records with a few fields..

How do I keep the records separated for importing into MS Access. I appreciate everyone's time.

This is the existing text formatted 1 column

address line ending characters are all the same.... paragraph symbol
address line ending character
address line ending character
fax line ending character
phone line ending character
email line ending character
line ending character Blank Line
address
address
address
address
phone
email
fax
posted by Mauimarty at 12:01 PM on August 27, 2008


This works in python:
data = """
address
address
address
fax
phone
email

address
address
address
address
phone
email
fax

address
address
address
address
phone
email
fax
foo
bar

address
address
address
address
phone
email
fax
spam
eggs
"""
out = ""
line = []
for i in data.splitlines():
    if i != "":
        line.append(i)
    else:
        out += ",".join(line)
        out += "\n"
        line = []
out += ",".join(line)
print out
gives:
address,address,address,fax,phone,email
address,address,address,address,phone,email,fax
address,address,address,address,phone,email,fax,foo,bar
If you want, memail me the data and I'll memail you the result.
posted by signal at 1:58 PM on August 27, 2008


Best answer: Woops, it gives:
address,address,address,fax,phone,email
address,address,address,address,phone,email,fax
address,address,address,address,phone,email,fax,foo,bar
address,address,address,address,phone,email,fax,spam,eggs

posted by signal at 1:59 PM on August 27, 2008


« Older Web Marketing/Usability Think Tank Exercises   |   Dumb, drunken credit card problem. Newer »
This thread is closed to new comments.