Open source XML web searching?
September 7, 2007 12:48 PM   Subscribe

I'm looking for suggestions for open source software that will allow me to take a large amount of XML, do searches of it through a web interface (especially allowing selecting trees or tags for narrowing searches), and return HTML-formatted results.

Have you used any open source software like that? It needs to run as out-of-the-box as possible. This question was helpful but it wasn't sufficiently the same and the answers require more programming than I have skill or time for. The XML would have to be invisible to the user: the users would not be downloading the XML as files, they would merely be looking at it rendered in styled HTML on the web server. The software would have to pay attention to the DTD and respect parent and child trees and allow inter-record linking.
posted by Mo Nickels to Computers & Internet (6 answers total)
 
Have you looked at Swish-e?
posted by breaks the guidelines? at 1:05 PM on September 7, 2007


Response by poster: Yes, I did look at it, and it's a candidate, but it seems too complex at this point.
posted by Mo Nickels at 1:25 PM on September 7, 2007


(oh, hi metafilter -- haven't been here for a while)

It's not an out of the box solution, but Xindice allows xpath searches on large XML documents. Search results can include branches of the document which you could then XSLT into HTML.
I've got some questions though,
  • Why does it need to pay attention to the DTD, do you have strange entities or something? If it's just characters why not replace them with NCRs and ignore the DTD while parsing?
  • What do you mean by "large"? A lot of small files, or a few ones that are 10MB big, or what?
  • Unless you're using a really popular xml feature (like, say, XLINK) it's probably unrealistic to expect out-of-the-box inter-record linking.
  • These XML aware search engines aren't as fast as plain-text ones... do you really need branch selection?

posted by holloway at 8:26 PM on September 7, 2007


Response by poster: Halloway, the DTD describes a sophisticated dictionary8212;an actual English-language lexicon, not a programming library8212;with many subtleties. I think the DTD needs to be paid attention to in order to automatically determine parent and child trees, all tags, and all attributes, and then give me the option to search as broadly or as finely as necessary. The data structure is defined, so why not use that instead of having to re-elaborate it in another language or syntax?

One of the main problems with finding what we need is that most of the open source XML software being used in programming and computing circles is surprisingly unsophisticated when compared against the complex data that can be generated when preparing a high-level language dictionary (think, for example, of the Oxford English Dictionary, which is not the dictionary I am working with).

The searches will be done by a non-specialist group who will not be versed in query languages other than normal Boolean. So, we'd start with a simple empty field and a search button, which will satisfy many users, but we also need to provide complex search forms for advanced users, which would need them to be able to do searches like, "Show me all citations dated before 1930, marked 'Missouri,' only from entries that include etymologies and a Latin gloss." A raw text search wouldn't permit that kind of granularity.

The data is probably 100 to 200MB in size in continuous XML. I can certainly break out each entry into separate files, though.

All of the entries will already have unique sortkeys and URL attributes; I'm hoping to use those to make inter-entry linking easier. We have cross-reference (xref) tags already in place, where needed, and each has an URL attribute that can point to another record in the data.

As you might gather, I have lots of experience in editing XML and working within the confines of DTDs, but I am not a builder of the software that makes these tasks possible. So, I may have misunderstood some of your questions.
posted by Mo Nickels at 6:40 AM on September 8, 2007


micropledge.com might work. I got a small project made for me there.

You can suggest projects and bid on them to get them made.
posted by GregX3 at 10:38 PM on September 8, 2007


Emailing you off-board Mo Nickels
posted by holloway at 3:38 PM on September 9, 2007


« Older Honeymoon in Vanuatu?   |   I'm afraid of a book! Newer »
This thread is closed to new comments.