Join 3,512 readers in helping fund MetaFilter (Hide)


How to write an SQL query in unix?
September 13, 2011 9:19 AM   Subscribe

I'd like to perform an SQL query with outer joins, from text files on my PC, using only a unix emulator. Is this possible on my PC? How would I write and execute it?

I need to left-join A to each of B and C, and because of the size of A, Microsoft Access can't handle the joins. The results will contain about 1.5 million rows. There are actually dozens of A tables, and I need to run the same query for each A, then further study the results.

HP Compaq nondescript laptop, Windows XP, cygwin as my emulator, Oracle 9.2 installed. My experience with unix doesn't go much beyond manipulating text files, but I have a good programming background. Each discussion I've found via Google, though, starts well beyond my level, so I'm lost before I begin. Any pointers on how I can learn what I need to know?
posted by RobinFiveWords to Computers & Internet (16 answers total)
 
So, what you are asking is someone to help you load these files into a DB like MySQL and then issue your SQL queries against it. You could totally do it this way. Seems like a lot of work.

However, a more Unix-y way to solve this problem would be to do a small amount of processing on the files, and then use the Unix Join utility. This is an utterly simple and feasible way to solve your problem with Cygwin.
posted by doteatop at 9:25 AM on September 13, 2011


If you could provide a few bits of sample data, say, the first 5 lines of each of A, B, and C, I could probably give you a pointer on how to get going on this.
posted by doteatop at 9:28 AM on September 13, 2011


are the text files well-formatted ? Like CSV or somesuch ? Do you have a decent schema for them ?

If so, it may be possible to script it all using sqlite. (.import the file, then do your query/joins .. Not entirely sure sqlite can do the joins you want, however).
posted by k5.user at 9:32 AM on September 13, 2011 [1 favorite]


Why is it not an option to load these into Oracle?
posted by odinsdream at 9:34 AM on September 13, 2011


If they are delimited,you can do this using OleDb in windows. The second section Here called "using the JET engine" covers it.
posted by Ad hominem at 9:37 AM on September 13, 2011


Set operations using core UNIX utilities
posted by Blazecock Pileon at 9:40 AM on September 13, 2011


It's all CSV. Table A has several alphanumeric "group by" fields and two numeric quantity fields; B has two alphanumeric fields, 300 rows, and serves to map one of the fields in A to a U.S. state code; C has one alphanumeric field, 1600 rows, and serves as a yes/no indicator for another of the fields in A.

@odinsdream, I'm not sure what it means to load into Oracle. As I understand it, Oracle allows me to make ODBC connections to databases on remote servers.
posted by RobinFiveWords at 9:41 AM on September 13, 2011


Oracle is a database engine. It sounds like you're unfamiliar with it, so disregard my suggestion.
posted by odinsdream at 9:45 AM on September 13, 2011


Not sure why you need UNIX and Oracle, but it would be pretty trivial to do with SQLite.
posted by wongcorgi at 9:56 AM on September 13, 2011


if it's all CSV and you have enough memory the open source R language ( http://www.r-project.org/ ) will probably help . you can read both CSVs with two instructions and then call merge() to do the actual join.

be sure to read the output of help(merge) first, as i dont remember whether it will give you an inner or outer join by default.
posted by 3mendo at 10:11 AM on September 13, 2011


If they're CSV files, why not just install MySQL for Windows on the machine and load the data in there and do your queries from within MySQL directly and forget about Access and its limitations? MySQL is a much more robust database software and widely used.
posted by barc0001 at 10:25 AM on September 13, 2011 [3 favorites]


I need to left-join A to each of B and C, and because of the size of A, Microsoft Access can't handle the joins.

I guess that might be too big for Access, but if you have them in tables in Access, you can connect to them as ODBC data sources either in MySQL or Oracle.

But I think you might want to step back and describe the data first. It feels like you've jumped two steps further down the road then you should have. It also sounds a lot like you're making a cross join. If you just "need to run the same query for each A", there should only be as many rows as there are in A to begin with. And what do you mean by "dozens of A tables"?
posted by yerfatma at 10:31 AM on September 13, 2011


if it's all CSV and you have enough memory the open source R language

That could be a big if, with 1.5M records. R has a habit of crapping out early when the memory demands get too high. Try to use UNIX utilities with streamed input to minimize memory usage, or import everything into a SQL database.
posted by Blazecock Pileon at 10:36 AM on September 13, 2011


Never tried this exact scenario, but I'd investigate loading it into a sqlite db, and use your scripting language of choice to manipulate it. Python, although using php would work well. Or visually, you could probably use the Firefox plugin. Not sure how sqlite would handel that much data though.
posted by cgg at 10:39 AM on September 13, 2011


BP: if RAM is indeed an issue one could be splitting the big file into smaller subsets using normal shell tools and then rejoining them with the same tools; it should work if complex aggregates are not needed.
posted by 3mendo at 11:18 AM on September 13, 2011


As I understand it, Oracle allows me to make ODBC connections to databases on remote servers.
At a guess, the laptop has Oracle Data Access Components installed, rather than the full-blown Oracle RDBMS. If you had the RDBMS you could use External Tables.
Assuming you're wedded to doing this under cygwin; then I'd install SQLite as cgg suggests and use the .import functionality to import the csv file that correspond to tables A, B and C. You'd have to pre-define the tables in SQLite based on the column headers of A, B, C before the import, I think.

An alternative if you want a more programmatic method would be to use Perl's DBD::CSV module, although Blazecock Pileon's suggestion sounds fun.

If cygwin isn't necessary I'd suggest you download SQL Server Express and do the same using T-SQL's BULK INSERT .
posted by SyntacticSugar at 1:31 PM on September 13, 2011


« Older I want an espresso maker. I li...   |  What are some cool kids movies... Newer »
This thread is closed to new comments.