Searching for near-adjacent strings in pdf
April 9, 2020 3:16 PM   Subscribe

I want to search a pdf for words\strings nearly adjacent to one another as this is at least suggestive of a link. I'd heard I could do this in Astrogrep but regEx seems to be beyond my tiny intellect. AgentRansack also does regEx but again regex is not gelling for me. Is RegEx my answer and if so how do I do this?

I'm searching for all instances where 'ph' and 'transpir*' occur within a sentence. A separate search would be where strings occur within say three lines of each other*. I need a solution that preferably shows the words highlit on the screen, or at least spat out as a small text report.

I have about 30 papers, plus some of their sources, plus some citing papers, so maybe 50-80 papers.

If this worked for multiple pdfs at once that'd be great too, but small steps.

* for anyone interested my aim is determining if there is a causal relationship between soil ph and evapo-transpiration rate, with the aim of designing better plant mixes for rain gardens/SUDS.
posted by unearthed to Computers & Internet (7 answers total) 2 users marked this as a favorite
Have you tried:
I use Agent Ransack all the time on thousands of PDFs (ebooks, my documents scanned with a Fujitsu ix500, etc.)

One thing that confuses folks new to RegEx is that * is a modifier, not a wildcard. Dot is "any character" so .* means any character any number of times. .+ means any character 1 or more times. .? means any character 0 or 1 times.

There are interactive Web sites where you can type in some text, type in a RegEx and just see if it works. That can help flatten the learning curve.

There is also this RegEx reference site and it has a Quick Start page.

Of course this all assumes that your PDFs are searchable, not just scanned images.

You can also feel free to MefiMail me if you wish. Disclaimer, I am not an expert, just someone who uses RegEx frequently at work as a programmer and at home with my electronic files.
posted by forthright at 5:46 PM on April 9, 2020 [1 favorite]

Corpus linguistics tools can help you with this. A concordance will list up all of the occurrences of a search term in context. A collocate tool will then tell you which words are commonly together with it and how strong the connection is. Collocate tools will also generate a view of the two words together in their context. The collocate tool might also reveal something else.

So, search for "ph" get the list of collocates and then click on transpire, transpiration, etc.

One package is AntConc. Free/donation. Very robust software that is widely used.
Watch this tutorial to see if it does what you want. The relevant point is about 4 minutes in.
posted by Gotanda at 5:47 PM on April 9, 2020 [1 favorite]

> (ph.*transpir|transpir.*ph)

One step up from forthright's code would be something like this:
That will find ph followed by up to 300 characters of any sort followed by transpir OR the same thing but in the opposite order.

If you want to vary the maximum allowed number of characters between the ph and the transpir just change the 300 to 400, 500, 100, 1000 or whatever you want. Examples:
You can do much fancier/more complicated things to find sentences or look within a certain number of lines but for any reasonable purpose I can think of, just looking within a certain number of characters is going to be just as good and it is a whole lot simpler.

I made this regex fiddle with the regex above and some sample text so you can try it out. That is usually how people work out and refine regular expressions - they're complicated even for the best of us.
posted by flug at 7:20 PM on April 9, 2020 [1 favorite]

You should put word boundary markers "\b" around the word "ph" to avoid matching "ph" within a word.

Example: (\bph\b.*transpir|transpir.*\bph\b)
This avoids matching something like "the philosophers meeting transpired yesterday".
posted by monotreme at 9:05 PM on April 9, 2020

Pattern recognition is deceptive because it is something the human brain is spectacularly good at. In fact describing patterns is a very messy art.

Regex is very complex but also is probably the simplest way to describe text patterns to a computer. If you want to do text recognition you’re pretty much stuck with it.

There are a lot of sites that will allow you to test your expressions in real time.

The fact that there are a lot of sites reflects the fact that most people are in the same boat. Everybody has trouble getting the expressions correct, so they take a good guess and then fiddle with it until it works right.
posted by Tell Me No Lies at 10:12 PM on April 9, 2020

Response by poster: Amazing - forthright got me started - I couldn't find a way in to regex before so thanks heaps.

Gotanda, that answers a completely different problem that I didn't even have the words to ask a question for.

Thanks flug, I went up and down the numbers a bit with this, very powerful.

So using this:
(ph\b\b.{0,100}transpir|transpir.{0,100}\bph\b )

Got me down to this. And two of these are only file name hits so that gets me from 23 files down to 3!

I'll continue with regex now and use things like that too monotreme.

Yes, I know to be wary with patterns Tell Me No Lies - hiding, revealing, making them, and avoiding seeing non-existent one is much of my work. Most of these papers are still very useful and they can be refiled now.
posted by unearthed at 12:46 AM on April 10, 2020

You probably also want to add something to make the search case-insensitive. There might be 'pH' or 'Transpir' in the data.
posted by zengargoyle at 2:07 PM on April 10, 2020 [1 favorite]

« Older Advice for writing a professional blog post   |   Stay at home and get food delivered, or go out so... Newer »
This thread is closed to new comments.