MySQL import question
November 2, 2006 3:15 PM   Subscribe

What is the easiest way to get this data into a MySQL database?

I've got dated formatted in a text document as such follows...

Name: Joe Smith
Email: joe@smith.com
Address: 1000 Smith Drive
Completed_the_Test: Yes

Name: Jane Jones
Email: jane@jones.com
Address: 1000 Jones Drive
Completed_the_Test: No

etc, etc for about 60 items.

What is the easiest way to get this into a MySQL database without the use of PHPMyAdmin and without typing it in manually? I'm looking to avoid a crazy parsing script but I can write one if need be.

Is there a way to get it into MS Access and then exported as a .csv and then imported over in MySQL?

I've got shell access to the MySQL db.

Thanks!
posted by rlef98 to Technology (19 answers total) 1 user marked this as a favorite
 
Well, if this data is formatted very evenly, one quick way is to set up a loop to simply take what's after the first colon, e.g. in Pascal, trim(copy(line,pos(':',line)+1,255)); and with that create a comma-delimited string for each group. I'm afraid that will require a scripting or programming language, but it's nothing complicated at all. But when that's done you can use mysqlimport or the LOAD DATA LOCAL INFILE command to bring it into the database.
posted by chef_boyardee at 3:36 PM on November 2, 2006


If you can manage simple programming, I would do exactly as chef_boyardee says.

If you can't do that, might you be able to accomplish something similar with some well-thought-out "search and replace"?

i.e.
Replace "Name: " with an empty string.
Replace the hard-return and "E-mail: with a comma
etc...

Then you'll have a file in comma-delimited format that should import quite easily. That is assuming, of course, there are no commas in the data itself!
posted by sanitycheck at 3:52 PM on November 2, 2006


Wait, so you're ruling out phpMyAdmin *and* parsing scripts? What's left, voodoo? Asking it to nicely put itself into the database? I don't get it.

perl -e 'local $/; print "INSERT INTO tablename (name, email, address, completed) VALUES "; $_ = <>; s,\r,,sig; print "(\"$1\", \"$2\", \"$3\", \"$4\"), " while(m,\GName: (.*?)\nEmail: (.*?)\nAddress: (.*?)\nCompleted_the_Test: (.*?)\n+,sig);' input_file.txt

On your sample input this returns:

INSERT INTO tablename (name, email, address, completed) VALUES ("Joe Smith", "joe@smith.com", "1000 Smith Drive", "Yes"), ("Jane Jones", "jane@jones.com", "1000 Jones Drive", "No"),

So just lop off the last "," and change the column/db names to taste.
posted by Rhomboid at 4:16 PM on November 2, 2006 [1 favorite]


The easiest way to do this is to import it into Excel using the very tweakable text import tool. That'll break it up into fields for you (which you can tweak and purify if they're imperfect), and which you can then re-export as csv or whatever. From that point it's trivial to gank it with Access and make database tables, then dump into whatever xSQL db you like...

(I ran the data conversion for Ernst&Young Australia way back in '94 or so when they went to a new practice management system countrywide (thousands of flat file tables with millions of line items in many cases), and I automated Office to do most of the grunt work like this. Worked beautifully.)
posted by stavrosthewonderchicken at 4:27 PM on November 2, 2006


All of these are great answers for different reasons. Thanks guys.

Rhomboid, I guess what I was envisioning was that I could import the file into Access (or something that could parse horizontally aligned data like that) and then I could export out as .csv. I could then take that .csv and the mysql instance would be able to parse that automatically (something more along the lines of what stravos is talking about)

Understandably that is asking a lot out of automation but I imagine that the hurdle of importing non-columnar data like that has been solved and there are tools for non-hardcore-programmers to use. However, your answer is very specific and I will be able work with it to accomplish my task.

Thanks!
posted by rlef98 at 5:09 PM on November 2, 2006


Er, I mean vertically-aligned data....data that is not tabular but still organized.
posted by rlef98 at 5:15 PM on November 2, 2006


rlef98, I see you've already got some good answers, however, I went ahead and spent a minute or two banging out a quick Perl script that parses files in your custom format and outputs a CSV file. It correctly quotes fields with commas in them and escapes embedded quotes. It also prints some crude warnings if it runs across improperly formatted records in the input file.
posted by RichardP at 5:26 PM on November 2, 2006


I was about to throw together a python script, but it appears that others have beaten me to the punch. But in the future, just be aware that going from text -> access -> csv -> mysql is redundant to say the least. Scripts were invented for problems like this.
posted by gsteff at 5:35 PM on November 2, 2006


Not that this problem needs any more solving but what popped into my head immediately when I saw the layout of the data, is that Ruby On Rails can use YAML to initialize a database, and the default database for RoR is MySQL. In addition, the format of the data the OP is using is pretty much YAML. So it would be trivial to take that text file and import it if you had a RoR setup.
posted by farmersckn at 6:05 PM on November 2, 2006


I do this a lot. I use Python to "rationalize" the text file into a columnar format. I import the result into a spreadsheet, then review the results. This review step is important because there always seem to be a few outliers that don't quite follow the same format as everyone else.

Once the data is cleaned up, then and only then do I import into a database.
posted by SPrintF at 6:50 PM on November 2, 2006


Well, here's a dumbass way to do it:

Use the load command to load each line as a row:

create table foo ( id int unsigned not null auto_increment primary key, line varchar(255) );

load data infile 'INPUT_FILE_NAME' into table `foo`
lines terminated by '\n'
( line ) ;

Now, create another table for the actual data:

create table for_reals( id int unsigned not null auto_increment primary key, name varchar( 255), email varchar(255), address varchar(355), bool completed_test null) ;

Now for the dumbassedly clever part:

select *
from foo where substring( a.line, 1, char_length( 'Name: ' ) ) = 'Name: '

gives us all the lines starting with 'Name; ', yes?

So,
select substring( a.line, char_length( 'Name: ' ) )
from foo where substring( a.line, 1, char_length( 'Name: ' ) ) = 'Name: '

gives us all the names, with the prefix 'Name: ' removed, yes?


Now, we can assume that the auto-increment is increases by 1 for each row, so we add a correlated subquery to get the email address; we add checking to make sure it actually is prefixed with email:

select substring( a.line, char_length( 'Name: ' ) ),
( select substring( b.line, char_length( 'Email: ' ) from foo b where b.id = a.id + 1
and b.substring( b.line, 1, char_length( 'Email: ' ) ) = 'Email: ')
from foo where substring( a.line, 1, char_length( 'Name: ' ) ) = 'Name: '


We add additional subselects for the remaining lines, and a case statement to convert the tests completed lines; note that we can re-use the correlation name 'b', as each 'b' is independent from each other 'b':

select substring( a.line, char_length( 'Name: ' ) ),
( select substring( b.line, char_length( 'Email: ' ) from foo b where b.id = a.id + 1
and b.substring( b.line, 1, char_length( 'Email: ' ) ) = 'Email : '),
( select substring( b.line, char_length( 'Address: ' ) from foo b where b.id = a.id + 2
and b.substring( b.line, 1, char_length( 'Address: ' ) ) = 'Address: '),
( select case when substring( b.line, char_length( 'Completed_the_Test: ' ) = 'Yes' then true
when substring( b.line, char_length( 'Completed_the_Test: ' ) = 'No' then false else null end
from foo b where b.id = a.id + 3
and b.substring( b.line, 1, char_length( 'Completed_the_Test: ' ) ) = 'Completed_the_Test: '),
from foo where substring( a.line, 1, char_length( 'Name: ' ) ) = 'Name: '


Finally, we insert that select into our real table:

insert for_reals ( name, email, address, completed_test )
select substring( a.line, char_length( 'Name: ' ) ),
( select substring( b.line, char_length( 'Email: ' ) from foo b where b.id = a.id + 1
and b.substring( b.line, 1, char_length( 'Email: ' ) ) = 'Email : '),
( select substring( b.line, char_length( 'Address: ' ) from foo b where b.id = a.id + 2
and b.substring( b.line, 1, char_length( 'Address: ' ) ) = 'Address: '),
( select case when substring( b.line, char_length( 'Completed_the_Test: ' ) = 'Yes' then true
when substring( b.line, char_length( 'Completed_the_Test: ' ) = 'No' then false else null end
from foo b where b.id = a.id + 3
and b.substring( b.line, 1, char_length( 'Completed_the_Test: ' ) ) = 'Completed_the_Test: '),
from foo where substring( a.line, 1, char_length( 'Name: ' ) ) = 'Name: '
posted by orthogonality at 9:25 PM on November 2, 2006


Note: there are a few syntax errors, and a couple of intent errors in the above, but you'll have to sort them out yourself. ;)
posted by orthogonality at 10:00 PM on November 2, 2006


orthogonality writes "Now, we can assume that the auto-increment is increases by 1 for each row, so we add a correlated subquery to get the email address; we add checking to make sure it actually is prefixed with email:"


What if your database doesn't increase the auto-increment by exactly one? So long as the auto-increment is monotonic increasing, no problem. Add a view that finds the starting id for each group, and the starting id of the next group:

create view foo_start_id as
select a.id as group_start_id,
from foo where substring( a.line, 1, char_length( 'Name: ' ) ) = 'Name: '

create view foo_start_id_and_next_start_id as
select a.*,
coalesce( ( select min( group_start_id) as next_start_id from foo_start_id b
where b.group_start_id > a.group_start_id ), ( select max( id ) +1 from foo ) ) as next_start_id

Now modify the where clauses that specify "id + N" to be "id < next_start_id".
posted by orthogonality at 10:36 PM on November 2, 2006


And the modification above will work even when your groups have information in varying order, so long as the first line in each is name. Left as an exercise is how to deal with varying order where the first line also varies.
posted by orthogonality at 10:38 PM on November 2, 2006


Wow some great more answers. Interesting different ways to solve a problem.

I gave some more effort to a way of doing the conversion to .csv using a non-scripty method. Using MS Excel you can import deftly enough to separate the label into a different column than the value. Then you can select the values of a particular set (the Joe Smith values, for example), copy that, move to an empty row, Paste Special, and if you check "Transpose" then it will put the values into a series of columns. Now, I toyed with a macro to be able to automate this process throughout all the sets of data (to grab Jane Jones and so on) but I ran into a bit of a wall because when writing the For...Next loop the VB script editor is not very keen on letting me say "ColumnA:rownnumber", "ColumnA:rownumber+1", etc. By this point, I was writing an entire script which is what I was trying to avoid in the first place.

End the end I went with RichardP's perl script and was able to easily hack it to work with some extra columns I added...this produced a very clean .csv file. Working on getting that into the mysql db right now but that should easy enough.

Orthogonality, that is a pretty crazy and creative way to attack the problem. I like that it is done entirely with at the SQL level utilizing the char_length function as the scissors. I will toy with this method a bit in the future but I imagining debugging could get pretty, ahem, interesting.
posted by rlef98 at 10:44 PM on November 2, 2006


Working on getting that into the mysql db right now but that should easy enough.

I'm unfamiliar with MySQL's import tools, but if MySQL lacks a CSV import tool it should be trivial to modify the script I wrote earlier to output SQL insert statements that can be used to directly load the data into the appropriate DB table.
posted by RichardP at 11:11 PM on November 2, 2006


"...etc, etc for about 60 items. ..."

I could have keyed 60 records of the type the OP described, in the time the varioius people contributing to this thread have taken to describe how to avoid keying the data. Seriously, sometimes doing a little typing is the straight way to the finish line.
posted by paulsc at 2:12 AM on November 3, 2006


"As somebody said, I'll happily spend six hours figuring
out how to do a three-hour job in ten minutes."

As far as I'm concerned that's the programmer's motto :)
posted by Skorgu at 9:02 AM on November 3, 2006 [1 favorite]


paulsc, eh, in the future I could be faced with data of the same type with a thousand or more entries. I was more interested in learning how and why the various solutions worked versus getting the 60-item example to work in itself.
posted by rlef98 at 11:44 AM on November 3, 2006


« Older Matching your external self to your actual self   |   New Jersey songs Newer »
This thread is closed to new comments.