I've made a huge mistake?
September 19, 2012 7:15 PM Subscribe
How can I create a web-searchable database of indexed newspaper articles?
I've checked the other ask.mes on this subject and none of them seem entirely on point.
I'm in the process of indexing a now-defunct local newspaper using an Excel 2007 spreadsheet. My columns are: issue number, date, title, author, section, and page number. What I'd like to create is a simplified version of, say, a ProQuest database that allows you to search all the contents of the index file. For example, search for the word "cat" and the results would have "cat" in either the title or author name.
I have no database or programming experience and there's no money to hire someone who does. Have I bitten off more than I can chew?
I've checked the other ask.mes on this subject and none of them seem entirely on point.
I'm in the process of indexing a now-defunct local newspaper using an Excel 2007 spreadsheet. My columns are: issue number, date, title, author, section, and page number. What I'd like to create is a simplified version of, say, a ProQuest database that allows you to search all the contents of the index file. For example, search for the word "cat" and the results would have "cat" in either the title or author name.
I have no database or programming experience and there's no money to hire someone who does. Have I bitten off more than I can chew?
I'm a programmer, and I have database experience. My first choice would be Apache Solr. You should be able to feed your spreadsheet right into it and get a searchable dataset. However, I don't have a good handle on how easy it is for a novice to setup.
posted by sbutler at 8:01 PM on September 19, 2012 [2 favorites]
posted by sbutler at 8:01 PM on September 19, 2012 [2 favorites]
When I had a similar task I used it for an excuse to teach myself Drupal.
Once I had it installed it was a fairly simple (simple meaning I was able to figure it out, not simple meaning it was easy) to map the fields in CSV export of the Excel spreadsheet (I think I ran it through Access for some reason, first) to a MySQL database.
Then it was just a matter of displaying a subset of the data linking to the full content. Using Views exposed filters in Drupal you can get pretty close to what you want. There is also the core full text node search and, as mentioned, Apache Solr rocks and happens to play nicely with Drupal.
The details are tedious but I'd be happy to provide more explanation (and examples) if this is a route you want to consider.
posted by cedar at 9:36 PM on September 19, 2012
Once I had it installed it was a fairly simple (simple meaning I was able to figure it out, not simple meaning it was easy) to map the fields in CSV export of the Excel spreadsheet (I think I ran it through Access for some reason, first) to a MySQL database.
Then it was just a matter of displaying a subset of the data linking to the full content. Using Views exposed filters in Drupal you can get pretty close to what you want. There is also the core full text node search and, as mentioned, Apache Solr rocks and happens to play nicely with Drupal.
The details are tedious but I'd be happy to provide more explanation (and examples) if this is a route you want to consider.
posted by cedar at 9:36 PM on September 19, 2012
Is it going to be a public web-searchable database? In that case, you can just put it up on the web via any method and let Google index it. If you look at the Google Advanced Search you can fill in "site or domain" to search just one particular web site and then there's the "terms appearing" dropdown that would let you just search the title. (And you could put the author's name in the title of the page to also search that.)
If you were going to set up a web application to do this with, I'd be inclined to set up a wiki. I would think that the "category" functionality of a wiki would be handy for grouping by issue number, date, title, and author. I find JAMWiki to be nice, light, fast wiki software.
Since your spreadsheet is broken down by page number, if you want to have each original page's worth of text in its own separate web page you might check out how Wikisource glues individual pages together into the full document using the feature referred to as "inclusion". (Wikisource is running a heavily customized version of Mediawiki, the software that runs Wikipedia, but JAMWiki has the inclusion feature as does some other wiki software.)
posted by XMLicious at 10:12 PM on September 19, 2012
If you were going to set up a web application to do this with, I'd be inclined to set up a wiki. I would think that the "category" functionality of a wiki would be handy for grouping by issue number, date, title, and author. I find JAMWiki to be nice, light, fast wiki software.
Since your spreadsheet is broken down by page number, if you want to have each original page's worth of text in its own separate web page you might check out how Wikisource glues individual pages together into the full document using the feature referred to as "inclusion". (Wikisource is running a heavily customized version of Mediawiki, the software that runs Wikipedia, but JAMWiki has the inclusion feature as does some other wiki software.)
posted by XMLicious at 10:12 PM on September 19, 2012
Solr's inappropriate because it's a document indexing engine, and if I'm reading this correctly you have metadata not documents. Drupal is... a bit heavyweight unless you want to do a lot of work.
I suggest you start by just getting the data online in a text format. Do a CSV dump of your Excel spreadsheet, stick it on the web, and you've provided a good 75% of the value. CTRL+F in a browser window can go a long way. Bonus points for putting the original spreadsheet up too, so people can download it and do more complex queries.
Once I had that up and running I'd go back and look at some of the other solutions people have offered (XMLicious seems to have some good ideas).
posted by Leon at 3:39 AM on September 20, 2012
I suggest you start by just getting the data online in a text format. Do a CSV dump of your Excel spreadsheet, stick it on the web, and you've provided a good 75% of the value. CTRL+F in a browser window can go a long way. Bonus points for putting the original spreadsheet up too, so people can download it and do more complex queries.
Once I had that up and running I'd go back and look at some of the other solutions people have offered (XMLicious seems to have some good ideas).
posted by Leon at 3:39 AM on September 20, 2012
I use Apache with MySQL which is very well documented and relatively easy to work with. All I needed was internet server space which supports and provides these tools. Any decent web host will support them. There are also plenty of good books which should be available at your local library to step you through the process of working with MySQL and PHP which is for coding your web front end. My programming expertise isn't too advanced but trial and error works well to get everything working.
posted by JJ86 at 9:44 AM on September 20, 2012
posted by JJ86 at 9:44 AM on September 20, 2012
This thread is closed to new comments.
posted by rhizome at 7:51 PM on September 19, 2012