Organizing images and graphs
July 16, 2009 10:14 AM   Subscribe

Any advice on the best way of organising images and graphs for a PhD student?

I'm interested to learn what strategies people use for organizing images, graphs and diagrams over the course of a long research project. I find that I spend a fair amount of time reformatting images for various purposes, such as documents, presentations and posters.

At the moment I am writing a paper using Latex, so I want all my graphs to be .eps and a certain size. But I can't figure out what the best font/image size to use is, or line-weight. Do people use a personal 'style guide' for each case? Are there any standards out there?

I am creating my images from Matlab, R, ArcGIS and Excel.

I'd like to develop a good strategy now so that I can use the diagrams and graphs I have now in a few years time.

posted by a womble is an active kind of sloth to Education (14 answers total) 4 users marked this as a favorite
Are there any standards out there?

Look through a few of the papers in your field. That should give you a good idea of what most people are doing.

There are big advantages to using something like R, because if you need to change an image later, it's just a matter of editing a line or two in your code and re-generating the image.
posted by chrisamiller at 10:29 AM on July 16, 2009 [2 favorites]

Some journals address these issues in the "Instructions for authors" section of their website.
posted by pizzazz at 10:37 AM on July 16, 2009

First rule: stop using Excel to produce graphs and tables, if at all possible. It's not a completely horrible data organization tool, but it sucks hard for producing graphs for publication. Personally, instead of Excel, I use SigmaPlot, but there are lots of other options too. R can do everything Excel can graphically. Stick with a toolchain that works for you and that you know.

Secondly, read the instructions for authors of the journals that you will publish in.

Thirdly, go to the university library and get the thesis instructions.

Two & three together will tell you the basic requirements. The rest is personal choice, and this is where it gets interesting.

Attend as many presentations and read as many papers in the field as you can. As you do so, ask yourself, does this view of the data work well? Am I confused by it? How much do I need to think about it before I get it? Is it too detailed and busy? Is it too simple (note that this is almost never the case)?

Read a book or two on data presentation. The classic reference is E. Tufte's The Visual Display of Quantitative Information, but all of his books are worth a read. Your university library should have copies.

Lastly, you need to experiment! Try different presentations out, show them to your fellow students, post-docs and supervisor. Get comments, evolve to what works best.

Keep in mind that this is a process, not an end goal. Your presentation style will evolve through your career, but striving for clarity, comprehensibility and aesthetic sensibility has always served me well.
posted by bonehead at 10:43 AM on July 16, 2009

(btw, the other big timesaver for thesis and paperwriting is proper reference management. You are using BibTEX, right?)
posted by bonehead at 10:48 AM on July 16, 2009

5th year PhD student here, writing the last bits of my thesis (yay!). I know my "workflow" is not so much of a "flow" as it is a "rube-goldberg machine" but I will try to summarize.

-I use excel, Origin, and MatLab (although MatLab the least of the three) for most of my plotting. My research also involves a lot of pictures and spectra so it is not all charts & graphs.

-I mainly export plots as .ps from Origin or MatLab and prettify them in Adobe Illustrator. I find that the output from these programs is sub-par for publication, and rather than spending a ton of time fudging with the formats in the program itself, I will just output it and make the lines bolder, the data points larger, or edit the fonts. In Illustrator I have sort of a standardized set of edits: Arial font for all axis & plot labels and my font size guideline is "big enough to read easily", pt 1 lines for axes, pt 0.5 lines for errorbars, pt 0.75 lines for outline of data points (if an open circle, etc), and here I can also group the data points so I can edit them together. It is kind of a pain if I need to go back and change something, but not any more of a pain than re-doing the plot in MatLab and re-exporting as a .eps file. Illustrator will also output as a .eps file. This varies from journal to journal, as pizzazz suggested, so it may be helpful to look there. Also talk to your advisor, he/she may have very specific suggestions on figure formatting, I know mine does.

-Another advantage of Illustrator is if you have several plots that are just a variation on X (several exponential decays of similar datasets, for example) you can just make a big array of all of the plots and have one giant figure instead of like 8 small ones.

-The advantage to having the figures in Illustrator is that I also make posters in Illustrator, and can export figures as .tif for PowerPoint, etc. Illustrator is super super handy for posters.

-As far as organization goes, uh, what is that? But when I'm writing a paper or making a poster I will have a separate folder for that poster/paper and each figure will be an Illustrator file named and so on. Of course, you can always go back and change the file name if the figure number changes, but it helps me remember which figure goes where.

-My guess is that the journal will re-size all of your figures depending on space considerations, so the only reason you would need to worry about having them a uniform size would be for your thesis. However, we don't write papers in LaTeX so I don't know much about how journals that publish LaTeX formatted papers will reformat what you've written.

memail me if you have any more questions or would like to see a paper from our group.
posted by sararah at 11:03 AM on July 16, 2009


I keep a lab notebook in an HTML page. Every figure I ever generate goes in this lab notebook. Almost all of these I have generated in R, where I use a function that takes the current plot, saves it to my lab notebook directory as a PNG, regenerates it as a PDF, and saves that, and regenerates it as a high-resolution PNG with large fonts, for use in future presentations. It also generates HTML that links to all of the above images, and loads it into an open emacs window for easy pasting into the lab notebook.

So when I want to make a presentation, I skim through the lab notebook, and save copies of the files that were already generated for presentation use. This eliminates issues in trying to remember what a file is called because I just look at the images, and click on the link to get the file (they are all placed in a single directory per day, such as 2009/0716/myfigure.png.

I have some code that automatically lists the time, date, directory, and last command in the footer of every figure, which makes it easy to figure out what I did to create the plot at the time. I can go back through the .Rhistory file in this directory and get the necessary stuff to recreate the figure years later, when I am trying to create a publication. Each journal is going to have different ideas of how a figure should be made, unfortunately, so you can't necessarily do this stuff in advance. Therefore being able to reproduce or change the figures later is key. It works best if you create functions that create your plots and save them in separate files (backtracking through .Rhistory is really a stopgap measure when you forget to do that).

I use Sweave to integrate R plots into my LaTeX documents. I would say getting started with it is not for the faint of heart, but once you're comfortable with it, it makes it as easy to change the figures or even the data that go into your LaTeX document as it is to change the typography. Resubmit to another journal? No problem. Import into your thesis? No problem.

My only regret is that I don't have time to post the code for all this right now.

Big R tip: do not use the legacy graphics functions such as plot(). Use lattice functions such as xyplot() instead (lattice is included in R). There is a slight learning curve but it is well worth it. It is much easier to get complex publication-quality graphics out of lattice than out of legacy graphics. It is pretty easy to customize things like font size through lattice as well. I almost never find the need to do manual tweaking in Illustrator. Automation is key, as you don't want to repeat the tweaking when you have a new dataset. And you will. It's amazing how many things you think you will only need to do once that you end up having to do repeatedly.

The system I used as a PhD student was to print them out and paste them on the wall. When I ran out of space, I started stapling similar-looking graphs together. It looked like this. While in some ways it was very efficient, allowing an instant visual inspection of all of my results, I do not recommend this system.


As I said, each journal will have a different idea of what to do here. But for presentations and posters the key is to make your text big. There are conflicting guidelines on how big to make them that you can find all over the internet. One sans serif font will do well for everything. Honestly, I use the lowest common denominator here, Helvetica, since it's far less likely that it will clash with any of the unknown places where my images will end up.

For colors, you should really check out the ColorBrewer palettes, which have been extensively tested in any number of ways. They were initially designed for cartography (and it looks like you'll be doing a little of that), but I also use them for scatter plots, or anything else really (check out the RColorBrewer package). Default color schemes in scientific plotting software are usually pretty awful and make me cringe. The default scheme in Excel is even worse. Get into the habit of never using Excel to make plots. You can produce good-looking, efficient, elegant plots with Excel but it is way, way too hard, and requires too much hand-tuning of each one.

I'd recommend the following books on visualization techniques:

The visual display of quantitative information, Edward Tufte (and his other books are great too)
Visualizing data, William S. Cleveland
Lattice: multivariate data visualization with R, Deepayan Sarkar
posted by grouse at 11:08 AM on July 16, 2009 [1 favorite]

There are big advantages to using something like R, because if you need to change an image later, it's just a matter of editing a line or two in your code and re-generating the image.

This is an excellent suggestion.

In fact, for many of the plots in my thesis I had full-blown Makefiles that went from raw data files to cleaned-up ones via perl, then gnuplot scripts from a different perl script and finally plots from the data and command files. Oh, and of course evertyhing was kept under version control using svn.

Since you are already using latex, I highly recommend gnuplot with the epslatex terminal: It generates a minimal eps file with the lines in your plot and a tex file with all the labels. This makes it automatically consistent with your latex document, even if you make (minor) changes in things like fonts.
posted by Dr Dracator at 11:17 AM on July 16, 2009 [1 favorite]

grouse is pretty on the ball here.

I would only add that this depends on field, but be wary of fussing with them too much. Good enough is good enough, and a journal will likely ask you to change them anyway after a piece is accepted. In most cases, "the best" font size or line width doesn't matter. What matters is only "Is it legible in the intended environment?"

LaTeX: save copies of the figures as pdf as well; the workflow through pdflatex can be easier than latex->ps->pdf, and some packages seem to work more cleanly through pdflatex than latex.

Image size shouldn't be really important for latex. You can always just resize with a [width=0.6\textwidth].

the only reason you would need to worry about having them a uniform size would be for your thesis

If your committee is sane, they're not going to give a crap about changes in figure size so long as everything fits within the margins for the Dreaded Ruler Lady.
posted by ROU_Xenophobe at 11:28 AM on July 16, 2009

Oh, and of course evertyhing was kept under version control using svn.

Yes, version control! Separation of data and code. An essential thing to do if you don't want to be totally lost.
posted by grouse at 11:36 AM on July 16, 2009

In a recent issue of Science, I noticed a range of graphs, ranging from pixelly Excel output to professional work that had been formatted by the magazine's designers. Take from this what you will - there may be standards that I don't know about, but it seems that you're safe picking what looks good to you.

I have written a couple of short papers in LaTeX, and I used Adobe Illustrator to edit my .eps files for consistency. It's then easy to pick a font and create your own system of text sizes and line weights to use in everything. You can copy and paste directly from Excel into Illustrator, which is really nice.

Computer Modern is the default typeface used by LaTeX. I find it quite elegant, but others think it's ugly. The Wikipedia page has links to ttf and Opentype versions of Computer Modern. That might be a good choice to use in your images.
posted by scose at 11:55 AM on July 16, 2009

"Get vector graphics output from hand-editable script files", "use a Makefile to make it easy to regenerate them all at once", and "use version control" were the first three things I wanted to say, and it looks like I've been beaten to all of them. I hope I'm at least the first one to mention "those many script files can be refactored to put common elements (like style choices) into a few header files" and "if you forget to refactor something, you can still change it in a hundred figures at once with perl or your other favorite regexp parser".

My thesis included about 50 graphs, and when I realized (for just one of many examples) that the font size came out too small, it was a godsend to be able to quickly change three lines in one Octave file.

Depending on what you're plotting, you may be able to go even further. Many of my graphs were parametric studies, and the number of samples I took of each parameter kept changing, and it certainly saved a lot of work to automate that sort of plot regeneration too. I picked a saturation and value, then the graphing scripts would check how many parameters were in the latest data set and automatically calculate equispaced hues for the different graph lines. Cool idea, but in hindsight I'm not 100% happy that one of my thesis chapters appears to be full of rainbows.
posted by roystgnr at 6:06 PM on July 16, 2009

Thank you all for your suggestions. I am trying not to use excel (or similar) for graph making anymore, and this will definitely encourage me to resist.

At the moment, I am trying to invest a lot of time in understanding how to use scripts for this type of work, as much of what I do requires graphical output. I'm definitely at the limit of my knowledge, but the suggestions people are making are useful and will guide me on what to try.

One suggestion I don't understand, is version control. Do you mean just structuring your folders by date, or is there a better way of doing this. I have programmed, but not a huge amount, and only on small projects so I never used version control.
posted by a womble is an active kind of sloth at 8:20 AM on July 17, 2009

Version control is very much a personal taste issue.

I've been using the system you describe for more than 25 years without any loss or confusion of data, and I deal with lots of data. My sort criteria are by project, then by publication, then by date. Working data sets for me include word files, .Rhistory, worksheets (in .xls and .cvs forms), pdf scans of field and lab notebooks, pictures, GIS and arcview data, etc.... I find that careful directory structure and file name management is all I need.

You can use automated systems built for programming like svn and cousins. I've always found that to be too much of as hassle, but YMMV. This can work fine at the personal level, but only really works if you get buy-in from your coauthors and research team as well.

Speaking of which, if you work in anything touching a Chem, Tox or even Bio lab these days, you're going to learn much more than you want to about LIMS. These database systems are increasingly in the government or private sectors. LIMS usually have some sort of version management, but only for the data and sample documents. Quality systems in these labs will also require some document control for certain types of documents: SOPs, Forms and Methods, usually.
posted by bonehead at 12:15 PM on July 17, 2009

One suggestion I don't understand, is version control. Do you mean just structuring your folders by date, or is there a better way of doing this.

Perhaps the easiest way to think about version control for a beginner is to look at your favorite Wikipedia page, and click on the "history" tab. Every change that was made to the page is stored in the history, and you can compare (or 'diff') those changes, or roll back to a previous version.

This is awesome for code, and works very well for text too, assuming that you keep it in some non-binary format. (plain text, a html or wiki page, NOT ms word). This is better than just using dated folders for many reasons:

a) you don't have the clutter of many different folders all over. It also saves space, since the VCS will only save the changes, not a complete copy of each iteration of the file

b) you get history and a log of changes for each document, so if you screw something up, it's easy to roll back to a previous version without fumbling around trying to figure out which folder it's in

c) If you end up working on a project with someone else, you'll be able to both edit documents concurrently and intelligently merge the changes later.

One easy way to get a very basic form of version control for your notebook or documents is to use something like Google Docs. For more control and local hosting, it's really quite easy to set up your own wiki, either on some webspace, your computer, or even on a USB flash drive. When you really get into it, you'll want to look into a true version control system like git or subversion.
posted by chrisamiller at 12:19 PM on July 17, 2009

« Older Story problems.   |   What to do in the Southwest that doesn't involve... Newer »
This thread is closed to new comments.