Most efficient way to stacking rasters into a NetCDF file.
March 21, 2012 3:39 PM
Quickest, most efficient way of stacking 40,000 GeoTIFF rasters (about 6gb in total) into a NetCDF file?
I'm trying to optimize a series of computations in R, which involves processing a series of rasters, extracting data at a point, and turning it into a time-series for a given location. I discovered loading each raster in individually and extracting the value for a coordinate was stupidly slow, but processing multiple rasters simultaneously as a rasterStack object in memory was orders of magnitude faster.
So now I need to turn my 40,000 rasters into a rasterStack - R has been trying to do this for...23 hours now, and isn't giving me any idea of the progress it has been making. It seems that turning my data into a NetCDF file outside of R so R can use it as a rasterStack may be the way to go, but I'm having trouble working out how to do that.
gdal doesn't quite seem to grasp that I want to make a NetCDF file with 40,000 layers, for example. In any case, it fails when I give it the command:
gdal_translate -of netCDF *.tif output.cdf
Complaining of too many options provided. And the gdal documentation really doesn't give me any clues. So what's the best utility for making NetCDF files?
I'm trying to optimize a series of computations in R, which involves processing a series of rasters, extracting data at a point, and turning it into a time-series for a given location. I discovered loading each raster in individually and extracting the value for a coordinate was stupidly slow, but processing multiple rasters simultaneously as a rasterStack object in memory was orders of magnitude faster.
So now I need to turn my 40,000 rasters into a rasterStack - R has been trying to do this for...23 hours now, and isn't giving me any idea of the progress it has been making. It seems that turning my data into a NetCDF file outside of R so R can use it as a rasterStack may be the way to go, but I'm having trouble working out how to do that.
gdal doesn't quite seem to grasp that I want to make a NetCDF file with 40,000 layers, for example. In any case, it fails when I give it the command:
gdal_translate -of netCDF *.tif output.cdf
Complaining of too many options provided. And the gdal documentation really doesn't give me any clues. So what's the best utility for making NetCDF files?
The "too many options" error probably comes from the operating system, not from the gdal_translate program. If you're using a recent version of Linux, the command "ulimit -s 100000" should raise the limit enough to make this work.
(gdal_translate might still be unhappy with 40000 files though.)
posted by reynaert at 5:20 PM on March 21, 2012
(gdal_translate might still be unhappy with 40000 files though.)
posted by reynaert at 5:20 PM on March 21, 2012
Couple of things you could try: gdal_merge is a nice wrapper especially made for merging large datasets, or investigate GDAL virtual formats - they allow you to specify a whole bunch of external sources as the one composite data set.
posted by scruss at 5:59 PM on March 21, 2012
posted by scruss at 5:59 PM on March 21, 2012
Yeah in this case unfortunately I'm using Windows, and I gathered it was unhappy with me specifying 40,000 command line options. I'll have a look at scruss's suggestions though. I've now set up R to add the layers to the stack one by one so I can monitor it (using raster::addLayer), and there is a noticeable decline in speed as it proceeds. At the start it was adding about 100 layers a minute, now we're up to layer 5,000 and it's managing about 20 layers a minute...
posted by Jimbob at 6:08 PM on March 21, 2012
posted by Jimbob at 6:08 PM on March 21, 2012
Well, using gdal_merge I've managed to stack a year's worth (365) of rasters into a NetCDF (any more than that and I ran into Window's command line limit). This should mean I can, in theory, make 111 annual stacks of 365 days, then stack all the years together. Which is what I'm about to attempt right now...
posted by Jimbob at 7:27 PM on March 21, 2012
posted by Jimbob at 7:27 PM on March 21, 2012
I can't help with gdal questions, but have you looked into other options for feeding R commands? Either Rstudio or a tool like NppTor can help with passing command lines.
posted by stratastar at 12:17 AM on March 22, 2012
posted by stratastar at 12:17 AM on March 22, 2012
Ack, sorry you were talking about just gdal and not R-gdal. Out of my depth here.
posted by stratastar at 12:19 AM on March 22, 2012
posted by stratastar at 12:19 AM on March 22, 2012
Which shell are you using in Windows? If it's bash under Cygwin or something similar, try switching to a plain old cmd shell, which will pass the wildcard-containing filename straight through to the app rather than expanding it to multiple filenames in the shell command line. You might be lucky and find that the app does actually know what to do with a wildcard filename, because Windows has system calls that support the use of these.
posted by flabdablet at 2:40 AM on March 22, 2012
posted by flabdablet at 2:40 AM on March 22, 2012
Also, have you tried using separate tools for stacking and format conversion?
posted by flabdablet at 2:51 AM on March 22, 2012
posted by flabdablet at 2:51 AM on March 22, 2012
If it's bash under Cygwin or something similar, try switching to a plain old cmd shell,
It was the standard command shell - and the gdal_merge.py script didn't like the wildcard, complained that file *.tif didn't exist.
However, thanks for pointing out gdalbuiltvrt - I checked out the VRT format as scruss mentioned above, but hacking together the necessary XML looked like a pain. However, that utility looks like it might be useful.
At the moment I've got both cores on my machine running hot, over night, to see if R can accomplish it. I've got one R process trying to put a stack together all in one go (this process has been running since yesterday), the other one is adding layers to the stack one by one so I can watch the progress, and I'll see if one of them finishes or runs out of memory by the time I get back into to work tomorrow. Failing that, then I'm going to start using all the great tips here to delve into some bare-metal gdal.
posted by Jimbob at 3:07 AM on March 22, 2012
It was the standard command shell - and the gdal_merge.py script didn't like the wildcard, complained that file *.tif didn't exist.
However, thanks for pointing out gdalbuiltvrt - I checked out the VRT format as scruss mentioned above, but hacking together the necessary XML looked like a pain. However, that utility looks like it might be useful.
At the moment I've got both cores on my machine running hot, over night, to see if R can accomplish it. I've got one R process trying to put a stack together all in one go (this process has been running since yesterday), the other one is adding layers to the stack one by one so I can watch the progress, and I'll see if one of them finishes or runs out of memory by the time I get back into to work tomorrow. Failing that, then I'm going to start using all the great tips here to delve into some bare-metal gdal.
posted by Jimbob at 3:07 AM on March 22, 2012
Okay, I have solved the problem, so I'll put it here! First I'm stacking my rasters into years, and I'm stacking them as GeoTIFFs rather than NetCDF.
However, R still doesn't like to read these, reading them in as a RasterStack just results in R sitting there doing nothing for hours on end, even with only 365 layers in a file. But, gdal provides a brilliantly fast command for drilling down a multi-layer raster. If I want to find the values at coordinates 140,-35 in all layers of one of those multi-layer GeoTIFFs, this command will give me results almost instantly.
So then it was just a matter of wrapping this in a loop in R, and grepping the values from the data gdallocationinfo returns:
posted by Jimbob at 6:34 PM on March 22, 2012
gdal_merge -o stack1934.tif -of GTiff -ot Byte -separate daily_1934-01-01.tif daily_1934-01-02.tif....
However, R still doesn't like to read these, reading them in as a RasterStack just results in R sitting there doing nothing for hours on end, even with only 365 layers in a file. But, gdal provides a brilliantly fast command for drilling down a multi-layer raster. If I want to find the values at coordinates 140,-35 in all layers of one of those multi-layer GeoTIFFs, this command will give me results almost instantly.
gdallocationinfo -geoloc stack1915.tif 140 -35
So then it was just a matter of wrapping this in a loop in R, and grepping the values from the data gdallocationinfo returns:
this_lat=-35 this_lon=140 full_list=c() for(year in 1900:2011){ d=system(paste("C:\\OSGeo4W\\bin\\gdallocationinfo -geoloc stack",year,".tif ",this_lon," ",this_lat,sep=""),intern=T) d=d[grep("Value:",d,fixed=T)] d=unlist(strsplit(d,":")) nolist=grep("Value",d,fixed=T) nolist=nolist+1 d=as.numeric(d[nolist]) full_list=c(full_list,d) }Drilling down through 40,000 rasters at a given latitude/longitude now takes less than a minute! Thanks everyone for their help.
posted by Jimbob at 6:34 PM on March 22, 2012
This thread is closed to new comments.
posted by axiom at 4:09 PM on March 21, 2012