Tool me but dont fool me....
June 22, 2011 9:01 AM   Subscribe

I want to develop a browser toolbar button which leads people to content related to what they are reading in their browser. I need some help understanding how complex this project is....

What I am looking for is a tool which:

- (with user's permission) 'reads' / 'analyses'/ 'parses' the content of the page the user is currently on (this can be limited to cover certain type of sites such as say news sites etc..

-Matches this content against a panel of reputable sources and shows as, say a number, or colour change availability of related content...

- User then clicks it and is presented with a drop down menu with list of related sites and the user can click and go to that site.

Can someone please help me by:

- converting my lay man request into semi technical language which a programmer will find easy to understand.

- telling me how complex this is on a scale of 1-10 of browser plugin programming

- is there a programming framework which can be used to do it more easily?

- If I was to look for a programmer what skills should I be looking for (like c++, java whatever,,,,)

Thanks very much
posted by london302 to Computers & Internet (8 answers total)
The big issue is the giant question mark at the center of your plan: where is this "panel of reputable sources" going to come from? Are you asking the programmer to also build this giant data warehouse of source material, a la Your Own Private Google? That's a "10" on the programming scale; the "make a button to show the list" part is utterly trivial by comparison.
posted by bcwinters at 9:10 AM on June 22, 2011

bcwinters: The panel of reputable resources are already identified and are publicly available websites in most cases.
posted by london302 at 9:20 AM on June 22, 2011

But you're talking about the actual source websites full of articles, not the system by which all of their data will be downloaded, parsed, indexed & compared, right?

Without some kind of server farm (either that you are running, or a public-ish one that provides an API) doing all the indexing and comparing, you are basically just doing the online equivalent of holding up a clipping and waving it at a pile of newspapers. Without that index, there is no link between the reader's text and the sources.
posted by bcwinters at 9:41 AM on June 22, 2011 [1 favorite]

There are a lot of moving parts to this, some of which are pretty tricky. Off the top of my head:

- You're going to have to decide how you figure out what a page is "about". Well-coded pages will have keywords spelled out for you the header. There are also various linguistic software packages that can parse natural language and pull out keywords; getting these to work well requires a lot of finesse, and probably someone with specialized programming skills.

- How are you accessing "matched" content? You can just link to search results on various sites using keywords, but that's a total crapshoot. Do you plan on indexing content yourself? That's a giant tech undertaking and a legal morass. What you call "publicly available" generally means "to individual users in a web browser", not "to for-profit data mining ventures".
posted by mkultra at 9:46 AM on June 22, 2011 [1 favorite]

(And I should note that I in no way mean to be critical of your idea—just that this is something pretty major that you're going to have to figure out before you can get started.)
posted by bcwinters at 9:47 AM on June 22, 2011

bcwinters: understood

mkultra: these sources exist to disseminate information (some make money from display adverts) so as long as their content is displayed without alteration they should be happy for eyeballs, no? They will be more likely to be discovered (but I agree that legal problem of indexing may be something that may require permission).
posted by london302 at 9:53 AM on June 22, 2011

Google got big by perfecting their pagerank algorithm. Your tool will get big if it can correctly identify what a page is about, and which pages in your reputable websites are of interest of your users. This means, you will need to have your own pagerank algorithm.

So, focus on that. Questions about the user interface, programming framework and language are almost irrelevant at this stage.

I would use a page linguistic analyser, and put the results in a google search, and filter those results with your list of reputable resources.
posted by Psychnic at 10:08 AM on June 22, 2011 [1 favorite]

As a reference, there's a Chrome extension similar to this idea called Google Similar Pages.
posted by switchsonic at 7:42 PM on June 22, 2011

« Older Health professions/schools that do not require any...   |   the opposite of habeas corpus Newer »
This thread is closed to new comments.