Join 3,554 readers in helping fund MetaFilter (Hide)


There's no way out of this, I will need the help of a soft machine to do my rote dirty work.
April 16, 2012 6:41 PM   Subscribe

Matlab textparsing filter: So for a project, I am working with over one hundred years of hourly data collected for water levels. I have all of this collected in a giant, ~35 Mb text file. Toward the last decade or so, there are double entries for water level thanks to semidiurnal tide measurement upgrades, and so there is an irregular number of columns that isn't easy to work with the usual data import strategies. I need your help, citizens of Metafilter, so that I can successfully rig the Matlab textscan command (or something else) to read in this formatted data, and make accommodations for empty value strings! More inside.

So, I have a bunch of lines that look like this

8656483 19000101 00:00 0.329
8656483 19000101 01:00 0.217


But then, around line 841400 or so, I have something like this:

8656483 19970904 11:00 0.622 -.603
8656483 19970904 12:00 0.241 705


I'm looking to form a matrix that looks like this below, but I can't seem to get there with the command that I'm using.

[8656483, 19000101, 00, 0.329, 0;
8656483, 19000101, 01, 0.217, 0;

...

8656483, 19970101, 11, 0.622, -.603;
8656483, 19970101, 12, 0.241, 705;]


Here is the command workflow that I'm trying to use, but to no avail:

>> fid = fopen('~/Desktop/values.txt');
>> C = textscan(fid,'%f %f %2f %f %f', 'delimiter', '', 'EmptyValue', 0);
>> fclose(fid);

When I open the variable C though, I end up with these as contents:


[8656483] [19000101] [0] [0x1 double] [0x1 double]


Is anybody a guru with the textscan command, and can tell me where I've gone astray? Also, if you have a better suggestion, that would also be great. I've tried the data import tool with excel, and I've also tried the import data option in the matlab gui, which scoffed at my file and didn't even load.

For what it's worth, I'm running Matlab R2012a on OS X Lion.
posted by oceanjesse to Computers & Internet (13 answers total) 1 user marked this as a favorite
 
Sorry, that matrix should look like this (minor difference):


[8656483, 19000101, 00, 0.329, 0;
8656483, 19000101, 01, 0.217, 0;

...

8656483, 19970904, 11, 0.622, -.603;
8656483, 19970904, 12, 0.241, 705;]

posted by oceanjesse at 6:44 PM on April 16, 2012


You could do this on the command-line (via Terminal.app in the Utilities folder) with a gawk statement, given input.txt containing your data:

$ gawk --source 'BEGIN {idx = 0} {if (NF == 4) { doThis($0, idx) } else if (NF == 5) { doThat($0, idx) } idx++ }' -f functions.awk input.txt > output.txt

Your functions.awk file might look something like:

function doThis(foo, index) {
  var = sprintf ($1", "$2", %02d, "$4", 0", index);
  print var;
}

function doThat(foo, index) {
  var = sprintf ($1", "$2", %02d, "$4", "$5, index);
  print var;
}


This isn't tested, but I think it is mostly, if not wholly correct code. You could reduce that to one function with a conditional in the gawk statement, but hopefully this gets you started.

I can't remember if OS X comes with gawk by default, but you can easily install it with MacPorts.
posted by Blazecock Pileon at 6:57 PM on April 16, 2012


Why not use a text editor like Notepad++ to split it into two(or more) files, so each file only has a single number of columns?
posted by Confess, Fletch at 6:57 PM on April 16, 2012


It's pretty easy to use a text editor to tack " 0" onto the end of the first 841399 lines. That's what I'd do. E.g., with vim, the command is:

:1,841399s/$/ 0/
posted by qxntpqbbbqxl at 6:59 PM on April 16, 2012


My matlab is really rusty, but I think you can do this with fscanf() . Just open the file and fscanf with the correct arguments including the extra columns. I believe it will fill in zeros with an end of line...you have a lot more control here than with textread

If not that, the trick will be to parse line by line. I think matlab uses fgets() and then you can use sscanf to grab the values from the string.

Personally for this kind of processing I usually do the heavy lifting outside of matlab - use something like a python script to pre-process the data into a matlab friendly format. This will save you lots of time if you need to periodically reload the file unless the matlab text file processing speed has increased considerably since I used it last... [On preview - looks like this is the dominant answer...]

Alternatively, a really simple fix is to split the file into two and process each separately - they are in fact two data sets merged into a single file after all. In the end you will need to deal with the "extra" column of data anyway why not do this up front.
posted by NoDef at 7:05 PM on April 16, 2012


What's wrong with parsing it line by line (aside from the slowness)?

You could set up a loop with

line = fgetl(fid);
while ~feof(fid)
   entries = regexp(line, ' ', 'split');
   
an if or switch statement to handle the entries in each line
   line = fgetl(fid);
end


The basic idea is to read each line, tokenize it with regexp, and use the number of tokens to decide how you want to assign the elements to your giant cell array of doom.
posted by Nomyte at 7:12 PM on April 16, 2012


The problem might just be with "%2f", and Matlab freakin out about the colon. Try replacing that with "%f %f" and adding ":" as a second delimiter?
posted by hammurderer at 7:47 PM on April 16, 2012


I much prefer the xlsread function for inputs. It'll read tab delimited text files and put every entry into its own cell. It'll be easy to tell using probably isnan whether there's an additional column of data for that row or not. The biggest concern I think would be the memory because it would want to load the file all at once. But if you have the original files by year or can break into into chunks, you'll be able to loop over all the files easily enough.
posted by cali59 at 8:58 PM on April 16, 2012


You could also try dlmread with single space as the delimiter or leave it unspecified. The only downside is it will fill unfilled matrix entries with zero.
posted by ssri at 3:57 AM on April 17, 2012


You might post on the mathworks matlab forums.
posted by canine epigram at 4:52 AM on April 17, 2012


I tried the vim command method since it seemed elegant, but it didn't work. My second choice was to take hammurder's suggestion of Giving The Colon Special Treatment.™

This worked splendidly:

fid = fopen('LOCATION OF FILE');
C = textscan(fid,'%f %f %f%s%f %f %f', 'delimiter', ' ', 'EmptyValue', 0);
fclose(fid);


Thanks all!
posted by oceanjesse at 10:35 AM on April 17, 2012


Sorry, actually this was the winner:

C = textscan(fid,'%f %f %2f%1s%2f %f %f', 'delimiter', '\t', 'EmptyValue', -10);
posted by oceanjesse at 10:53 AM on April 17, 2012


I used -10 for a reason irrelevant to the question posted.
posted by oceanjesse at 10:53 AM on April 17, 2012


« Older Does anyone know where to get ...   |  Setting up a parent-proof HTPC... Newer »
This thread is closed to new comments.