How do I database?
December 8, 2008 7:22 PM   Subscribe

I have a complicated question about databases. Please read on.

Without going into specifics, I work at a company that reads a lot of documents. We have for years kicked around the idea of setting up a database that would allow us to index certain portions (say a page or two) of each of these documents. However, most of what we read comes in PDF. We are interested in setting up a database that would store pasted parts of these documents (mainly because setting up some sort of an OCR solution would be way too difficult). So ideally I want a database that has support for PNG/GIF. I checked out Access and it seems to support images if you want them stored in BMP. That isn't going to work. Are there any solutions out there for this, oh Hive Mind?
posted by prunes to Computers & Internet (20 answers total) 4 users marked this as a favorite
One way to do it is to store the disk paths to images in the database instead of the images themselves. This is very common and, IMO, preferable, to reduce the size and complexity of your database on disk, to make backups easier, and to make your images more easily available for other use. You can give the images meaningful filenames if you like, or just generic ones, and store them however makes the most sense.

Alternatively, most database servers will store blobs (binary large objects), which could be a PNG, GIF, or any other image. If you look for blob support instead of image support you'll find many more options.
posted by pocams at 7:35 PM on December 8, 2008

Well, not for nothing and it might be out of your price range, but Google's Enterprise Search Appliances do what Google does PDFs: indexes them based on text within them.

If it's images embedded in PDFs, or scanned into PDFs, that makes it a bit harder, but they do some pretty impressive things. Click Features and see what I'm talking about.

Plus, they stay off the web; they're only on the box itself.
posted by disillusioned at 7:36 PM on December 8, 2008

Response by poster: I've considered linking to the files but due to various constraints this isn't feasible. And I'm really looking for the ability to cut and paste images in. Having to use an image editor to save the screengrab as a PNG, I fear, will result in the project not leaving the ground. The db needs to be optimized for people who are very busy and not highly technically proficient.
posted by prunes at 7:41 PM on December 8, 2008

I'm not sure I completely understand your question but ... most modern databases will allow storage of binary objects (so for instance a PDF document, an Excel file, a bitmap). I'm surprised Access resricts this to Bitmaps but that's probably MS' marketing department at work.

Regarding your specific needs it seems to me that if you were to store some subset of your PDF's within a database it wouldn't be much help as you wouldn't there be able to search for, say, all the PDF's which contained the word 'banana' ? I'm also not clear why a part of the document is useful but maybe that's something to do with your documents.

There are tools which extract the text from a PDF and this might be useful if you could set things up to store the words from the document.

There are also tools which allow you to extract pages from a PDF and so if, for instance, you did want to store the table of contents of each document within a database then you use that and store the result to the DB.
posted by southof40 at 7:41 PM on December 8, 2008

Evernote can index PDFs.
posted by tayknight at 7:41 PM on December 8, 2008

Artstor's Offline Image Viewer supports both PNG and GIF. However, would the purpose of this database be to search the images? I'm not sure how you could do that without running OCR or separating the text from the image in some way. With ArtStor you could add some amount of text in the description of the images, and that would be searchable.
posted by mpronovost at 7:43 PM on December 8, 2008

Regarding what disullisioned says there's certainly something to be said for keeping things out of the DB and just having references within the DB to the underlying files. Having said that I've worked on some project where attempting to keep those two in synch (during restores of backups etc - or when doing doing transactional processing - ROLLBACK/COMMIT's) that it's definetly got downsides too.
posted by southof40 at 7:43 PM on December 8, 2008

Response by poster: I'm not trying to search the images. The documents in reference have certain data tables that we want to catalog. I think that pasting these tables in as images might be the only feasible way to accomplish what I'm looking for.
posted by prunes at 7:47 PM on December 8, 2008

There are more than a few choices that will work.


1) How many users will be using the system at once?

2) How many documents? Hundreds? Thousands? Tens of thousands? Hundreds of thousands? Etc.

3) Does this system need to be integrated with other IS/IT systems?

4) Who will be maintaining this system?
posted by mosk at 7:51 PM on December 8, 2008

It sounds to me like you're really combining two tiers -- you're not just looking for a database but also a front-end application to allow you to input the data (pasting it in). Access serves both of these functions, but don't let that fool you.

You should probably be searching for document indexing solutions rather than just raw "databases."
posted by toomuchpete at 7:56 PM on December 8, 2008

Response by poster: Less than ten users at once.

Let's say thousands of documents.

Does not need to be integrated with other IS/IT systems.

Would be maintained, I guess, by me. I'm not really looking for an "enterprise quality" solution, however.
posted by prunes at 7:58 PM on December 8, 2008

This is an extremely weird question. I think that you need to explain to us what you mean by saying that you want to "index" these documents; it seems possible to me that what you need might not be a database at all.
posted by XMLicious at 8:06 PM on December 8, 2008

The bug tracking system FogBUGZ has a real easy capture tool for automatically uploading a screenshot into its database. It might be a bit weird to use a bug tracking tool for this purpose but i bet you could use the assorted metadata columns that the product ships with to meet your needs. They have an online demo for you to try and you can pay them to host the software.
posted by mmascolino at 8:49 PM on December 8, 2008

You could do this in FileMaker Pro. It would be a very easy solution, as FileMaker can either store the images by reference or as stored objects. FileMaker can also store PDFs, can be scripted, users can access it through a browser or through a desktop client, etc. It will also scale well with the base of users you describe. In fact, it's probably ideal for what you are describing. It is however, a commercial product, and with 10 simultaneous users you will either want to look into a site license or use its internal Instant Web Publishing engine (which is also quite easy to use).

Disclosure: I am an in-house FileMaker developer for a medium-sized company, and used to work for FileMaker, Inc., so factor that into this recommendation. Nonetheless, you could totally do this in FileMaker and be pretty happy with the results.
posted by mosk at 9:07 PM on December 8, 2008

If you just want to make your PDFs searchable - which is what it sounds like to me - I'll second disillusioned's suggestion, with one caveat: you can probably just get a Google Mini instead. I work with GSAs and Minis quite a bit, and they work very well with PDFs. They make them easily searchable - you just point the appliance via HTTP or SMB/CIFS to the directory in question - and they provide a search interface out of the box. You can easily customize the interface, if you like. They recognize PDF metadata, and you can search against that as well. As a bonus of sorts, your search audience can view PDF content without having Adobe Reader/Acrobat installed, as the appliance converts the PDF to HTML for indexing, and users can view the converted content directly. And the Google Mini is a very affordable solution.

There is one potentially big limitation, however. Neither appliance will index PDFs larger than 30MB, and will only index the first 2 MB in any case.

If you have any questions about either appliance, feel free to memail me. I'm a Google certified instructor, as well as a Google Enterprise partner.
posted by me & my monkey at 9:32 PM on December 8, 2008

What about a document management system like KnowledgeTree?

That link is to their free open source edition. I've set it up for a small editing company in the past and they found it pretty easy to use. I believe there is full text search for pdfs.
posted by meta87 at 10:42 PM on December 8, 2008

So ideally I want a database that has support for PNG/GIF.


You DO NOT store binary data in a database. That is NOT the purpose of a database. If anything, you'd store the locations of the documents in a database, then do a separate lookup.

Anyway, not sure if this is the right tree to bark up, but Image Silo might be the answer you're looking for. If you had to have access to the part of a document that... let's say... holds a signature.
posted by Civil_Disobedient at 12:38 AM on December 9, 2008

I'm getting a different read on his comments and I think he's looking for a digital asset management system or digital library that will let them upload and store portions of a PDF, like say a few paragraphs of text or a table/graph from a larger PDF document. Kind of like digital scrapbooking on an enterprise scale.

In terms of usability, I think tayknight's suggestion of Evernote would probably be the least cumbersome since it allows uploading of PDFs, images, web pages, and bits of text and it automatically runs recognition algorithms across the uploads. It has a pretty decent desktop program for accessing all the content and lets you take notes on content you upload.

The primary downside is that there isn't any type of server or shared account with Evernote. So the whole company would have to use a single account to share content. It's probably also not the most robust solution if you're dealing with thousands of pages of content.
posted by junesix at 12:55 AM on December 9, 2008

For capturing images and snippets from the PDF, SnagIt is a great little tool for capturing images from your desktop and documents. With a bit of education, it shouldn't be much trouble to get your office up to speed with a two-step workflow of 1) snapping images with SnagIt and then 2) dropping them into the Evernote application.
posted by junesix at 1:02 AM on December 9, 2008

I'll disagree with C_D on storing binary content in databases. There are certainly downsides to storing binary content in databases but there is certainly downsides to storing binary content outside of the database namely 1.) more confusing and difficult backup/restore processes and 2.) more difficult to do true transactions 3.) more difficult to scale out a solution to support a larger user base.

Some of these things perhaps aren't big issues for this specific user's case, but they are true of the general problem.
posted by mmascolino at 6:35 AM on December 9, 2008 [1 favorite]

« Older Non-DeBeers Engagement Ring Options   |   A "port" to the US Newer »
This thread is closed to new comments.