Help Me Create an Automatic Search to Return Results for our Database
November 21, 2016 7:04 AM   Subscribe

I'm working to check a list of duplicates in our database and I'd like to be able to automate parts of this process. Please help me figure out what tools I should use and how to get started on this so I can learn something and be a more effective database manager.

Here is what I would like:

I access our database (NGP) through a web browser. I have a list of contacts that are more likely to have duplicate records in our system. They have a "Duplicate Contacts" code attached.

I would like to run some sort of processes to search the database for the last name and first initial of each record. If there is only one record returned (the record I'm checking) I'd like that included in a list so I can bulk remove the code.

If more than one record is returned, I will check these items one at a time as I am doing now (Please note: not looking to see if this is the best way to check for duplicates, looking to find a way to do this same process faster using the magic of technology).

My priorities are:
1) Do this more efficiently
2) Learn something
3) Find a way to do this that is less boring so I don't gouge out my eyes

I'm willing to put some time into figuring out the best way to do this since I will doubtless use this information going forward and, frankly, checking these one at a time is INSANELY boring whereas creating a system to check them automatically sounds like a fun puzzle!

Once I have a starting place, I'm pretty good at poking around to figure out how to make something work but I don't even know where to begin (should I be running a script for this? What does that even mean? Someone with whom I work suggested using Excel but I have no idea how to get it to interact with NGP for this; I've tried the "Get External Data" function but it only seems to download tables, not actually do the searches. Do I need to download something or do I likely already have the tools on my computer?).

Please don't tell me "this will be too hard" or "if you don't know how to do this already don't bother". I'm decently smart and good at figuring out stuff like this and it's really worth it to me to try, I just need some help figuring out where to begin.

I'm really grateful for any help you can provide; I am excited to try to up my database manager game and develop new skills that will help my organization. Thank you!
posted by Mrs. Pterodactyl to Computers & Internet (14 answers total) 3 users marked this as a favorite
What is the database system you're using? I would go directly to that -- open up SQL Management Studio or MySql Workbench or whatever is storing the data and go from there.

This, of course, means understanding database queries but what you're attempting should be somewhat simple if you're doing it in a step-by-step method that you're describing. That, however, depends on how the contact info is stored -- if it is spread out across several tables, that increases the difficulty.

Once you figure out your queries and processes, then you can work on scripting it out, but first familiarize yourself with the methodology of SQL -- or whatever database management system is containing the data -- so you're not trying to speak through levels of in-between interpreters to get to the data.

Edit: on review, is NGP the database system? When I google it I get some politically-related contact management system.
posted by AzraelBrown at 7:13 AM on November 21, 2016

Are you talking about NGP VAN?
posted by XMLicious at 7:20 AM on November 21, 2016

Yup, NGP is the system we use (we're a nonpartisan not-for-profit). As far as I know I'm only able to access our information through the web interface but if there's a way to use a different program I'd love to know about it (I've done some VERY BASIC SQL tutorials e.g. I know you need to end stuff with a semicolon and I'd be happy to learn more, that sounds fantastic, I just don't know how to apply SQL when I'm accessing the database through the web).
posted by Mrs. Pterodactyl at 7:21 AM on November 21, 2016

Yes, NGP VAN, sorry.
posted by Mrs. Pterodactyl at 7:22 AM on November 21, 2016

Hmmm....if you don't have access to the database itself, you're stuck with only what the web interface will allow you to do. Excel's "Get External Data" can parse data out of a webpage, but it can't do the query itself, you'll still have to manually put the data into whatever fields in order to get the results to display.

There are ways of doing this automated, but you'll essentially be building a web bot -- which could be an exciting project, but it's not going to teach you much about databases, just about impersonating a user to scrape data from NGP's database -- plus it might violate the NGP VAN terms of service (they want control of the data, not you to scrape everything out to make your own database for free).

So, if you're stuck with the web interface you might be doing it manually, since that's what their only 'interface' is designed for.

You might try contacting the NGP people, to see if they have a method for getting at the data easier for analysis purposes.
posted by AzraelBrown at 7:31 AM on November 21, 2016

If the best way is building a web scraper I'd be 100% happy to do that; I enjoy learning stuff and that will help me work with NGP even if it's not database specific. I contact NGP all the time and their support people are very nice and there's not much they can do for a lot of this stuff (NGP drives me NUTS NUTS NUTS).

I'll look into the terms of service to see if there's a problem but in the meantime how would I get started building a web bot? What tools would I use? How would I make it go? There are tons of simple tasks I try to do routinely that should be easy but aren't because for whatever reason the options just aren't available so it would be hugely, HUGELY helpful to me if I could get a little robot to perform my repetitive tasks (and, again, that actually sounds like fun in a way that doing this stuff one record at a time is profoundly not).
posted by Mrs. Pterodactyl at 7:39 AM on November 21, 2016

I think the simplest way is to start by using the VAN's Get External Data. (I last used it seven years ago, but IIRC it gives you text files with comma-separated values that represent that table.)

Then, import the CSV files into Excel or Google Spreadsheets. Finally, sort by last name, then by first name. You should be able to scan and easily see duplicates. If you have too many records to do this by sight, you can write a script (a small program that runs takes the spreadsheet as input) to pick out the non-duplicates for you. But first, I'd see if I could do it by eye.

If you end up needing to write a web scraper, you'd start by learning about making browser extensions, e.g. Chrome extensions or Firefox Add-ons. I wrote a fairly technical blog post about making Chrome extensions. If you don't already know JavaScript, you'd have to learn it. It sounds like you'd be fine with it, but it may take you months to make yourself a scraper that does what you want.
posted by ignignokt at 7:47 AM on November 21, 2016 [1 favorite]

Seriously (and after this comment I'll try to stop threadsitting but I want to make sure I'm being clear about my needs so thank you for your patience) I cannot convey to you how helpful a bot or something to perform repetitive tasks would be. I would love this so much.

If I have a list of people who have attended an event along with their contact IDs, I can bulk update them to mark them attended. If I have a list of people who have RSVPed with their contact IDs, I have to input that one at a time because there's no bulk update for RSVPs even though we need them to put together our guest lists. I makes me insane and I would be so, so grateful for help in figuring out how to do my job in a way that's less profoundly stupid.

On preview, oh how I wish I could just export the entire database into Excel! If I could, I would do that and it would solve many of my problems! I could just have Excel highlight the duplicates for me! It would be great! I've tried to export the data so many times! I've contacted support and asked them if there are ways I can do it! The answer is "not really"! Apparently they updated NGP a year or so ago and everything got effing terrible and it's such so frustrating.

I really appreciate the link to the blog post; as you say, I'd be happy to learn how to build the scraper but in the meantime is there something that already exists that I could figure out to modify slightly to fit my needs instead of starting from scratch? This is basically how I learned HTML in high school so maybe it could work now?
posted by Mrs. Pterodactyl at 7:53 AM on November 21, 2016

That's too bad. What does Get External Data get you? Just data from a specific list? Would it be possible to create several lists that's basically the entire database? Figuring out how to get around export limitations is likely the faster way to get this done.

However, I think it's awesome to see people learn JavaScript and build unconventional technology! I just have to make sure it's clear that it's not a thing you can do within a few weeks if you're coming from no JavaScript knowledge. I think you're likely going to have to invest a lot of time.

The problem with bots that scrape web sites is that they're very, very specific to the data source that they're scraping. So, there's likely not a very similar example you can tweak, especially considering that the VAN is not a public site that many people know about.

I have an example Chrome extension that scrapes MetaFilter. It consists of several modules that A) load html from parts of the site B) comb through them for specific information then C) decide what to do with them. This is one of the simplest scraping modules in it. It takes a profile page, then looks for an `a` tag with the class `profile` then looks for the last part of the URL linked by that `a` tag, which should be the MeFi userid.
posted by ignignokt at 8:16 AM on November 21, 2016

If you can get the data into an Excel spreadsheet, you can use Conditional Formatting, Highlight Cells Rules, and then Duplicate Values to find the duplicate values. Then sort by color and delete the dupes (or whatever you need to do.)
posted by lyssabee at 8:16 AM on November 21, 2016

Selenium might be something to consider. The WebDriver provides a way to talk to the web browser and have it interact with your database's web interface while the IDE can help determine the necessary clicks/navigation for WebDriver use as well as partially automate the interface navigation.

For the contact ID example, even if only one contact ID can be input at a time, as a first step, the IDE could be used to construct a script to, depending on how your database interface works, initiate the search, to select an attendee's record, open it, mark them as having attended the event, and return to the search screen to allow you to input the next contact ID. If the script works reasonably well, the steps can be described in a separate file to have the WebDriver component control the browser and handle the contact ID input for you as well.
posted by UrbietOrbi at 8:26 AM on November 21, 2016

On this page, NGP has headers called "Easy Importing and Exporting" and "Built-In Duplicate Management" and both make it sound like you should be able to do them easily. If their support group can't walk you through that or point you to some tutorials, there seems to be a big issue.
posted by soelo at 9:59 AM on November 21, 2016

Use iMacros. It's a very handy way to quickly build a simple scraper - I've used it myself (in the firefox addon version) to quickly knock something together to scrape large amounts of data from work databases. Once you get the hang of it I doubt it will take you more than an hour or two to set up what you've described in your question

Michael Schrenk's Webbots, Spiders, and Screen Scrapers is a great guide to building webbots and the like. His website has a bunch of talks he's given at Defcon, any of which are well worth watching for webbot ideas. It was reading his book that I learnt of iMacros
posted by coleboptera at 6:28 PM on November 21, 2016 [1 favorite]

Oh my gosh thank you! I got the Firefox addon and it is doing exactly what I wanted. This has made my life hugely much better, thank you thank you thank you!
posted by Mrs. Pterodactyl at 7:00 AM on November 23, 2016 [1 favorite]

« Older Lighten Your Load And Go Still Higher. War On...   |   Four Meyer Lemons Newer »
This thread is closed to new comments.