Want to make a bot web scraper
August 30, 2023 3:28 PM   Subscribe

Can you recommend any resources to make a bot web scraper (Is that redundant)?

In a nutshell, I want it to look for A for each state, from the state government department website. (I do have starting URLs.) If it doesn’t find A for a given state, I want it to look for B. And so on. There are about half a dozen possible things to look for. All this looking up needs to be done every couple of years, so I’d like to automate it.

I have basic coding experience. I took a Python class several years ago, but I haven’t used it since then. I also have some experience with SQL, HTML and CSS (and some other stuff) , in case any of that is helpful.
posted by NotLost to Computers & Internet (11 answers total) 6 users marked this as a favorite
 
With Python, you will want to look into the library BeautifulSoup
posted by kschang at 3:33 PM on August 30, 2023 [4 favorites]


In cases where it's hard to get the content of the page for BeautifulSoup without executing the Javascript, look into Selenium.
posted by foxfirefey at 3:38 PM on August 30, 2023 [3 favorites]


How much consistency is there across states? Are you envisioning a single piece of code that will work for all states, or state-specific logic, or something in between (i.e. a few different groups of similarly formatted states)?
posted by staggernation at 4:15 PM on August 30, 2023


Beautiful Soup is pretty low level for something like this, especially if you are not an experienced programmer. Scrapy is Python based and has higher-level abstractions and features tailored for web scraping, and may be easier to use.
posted by StrawberryPie at 4:19 PM on August 30, 2023 [1 favorite]


Another option may be ParseHub. They offer a free plan, but whether you can take advantage of it depends on the specifics of your situation.
posted by StrawberryPie at 4:24 PM on August 30, 2023 [1 favorite]


Do you want to code it yourself, or is that just where your mind went to first? I ask because I do a lot of web scraping, and use Octoparse for it, and have been very happy with the results.
posted by NotMyselfRightNow at 5:08 PM on August 30, 2023 [2 favorites]


In Linux and Windows there is also the command wget, with its many various switches.

Not 100% aligned with your question but you also might find this article of interest for a legal viewpoint. IANAL.
posted by forthright at 6:44 PM on August 30, 2023


Scraping sites is hard to automate without getting your hands deep and dirty into the responses.

Using wget is fine, but you'll need some way to automate wget, to pass it the requests you need to make. You'll also need some way to process the responses.

So since you have some Python skills, I'll second BeautifulSoup (and headless Selenium, where needed). They are workhorses for this kind of task.

For completeness, I'll also mention Requests for lookups to sites that require authentication. You can pass the output from Requests to BeautifulSoup and other tools for processing.
posted by They sucked his brains out! at 11:37 PM on August 30, 2023 [1 favorite]


Also be aware that if this only needs to be done every couple of years, there's a good chance the html of those pages will change in some crucial way between each occasion. Which will mean you might have to tweak whatever you use each time.
posted by fabius at 5:19 AM on August 31, 2023 [2 favorites]


O'Reilly's Web scraping with Python is a book length discussion on the topic. the book is 5 years old at this point but the principals will mostly apply. Government websites tend to be built using older techniques and not frequently updated so you may not be missing out on might.

fabius' point above is strongly true. I'd change the "might" to "you will have to".
posted by mmascolino at 9:07 AM on August 31, 2023 [1 favorite]


The current easiest tools are in js/node/npm territory and activate a real browser via Chrome Dev Tools. VSCode and PlayWright framework have a record-replay plugin with a helper to make sure you scrape the right text from the right fields of the web page.

curl/wget will pull the content of the web page and leave you to wrangle HTML, Python and Beautiful Soup will give you programmatic access to the raw HTML, Selenium will run a web browser and allow you to get the page as the browser sees it, and PlayWright will record you journey clicking through pages.
posted by k3ninho at 1:38 AM on September 1, 2023


« Older give me your most fucked up little guys   |   What are these wrist supports? Newer »

You are not logged in, either login or create an account to post comments