Web scraping when JS loads data in-line (Kindle highlights)
January 3, 2018 8:52 AM   Subscribe

I'm working on a tool to scrape my Amazon Kindle highlights from the web. It looks like Amazon uses JS so only 50 highlights are loaded, and I can't figure out how to get an HTML file that has all the highlights present.

I'm using the read.amazon.com site to view my notebook. When I first open a page, there are 50 highlights loaded, and if I save the page source to HTML that's exactly what I get.

If I scroll down in the web page, more highlights load. I don't know whether the page replaces previous highlights (i.e., if it shows #2-51) or if it just keeps adding to the page. Regardless, even when I'm looking at my last highlights in my web browser and then save the page source, I still only see the original 50 in the saved HTML file.

My end goal is to get an HTML file that has all my highlights at once. I'd settle for a series of files that had the various chunks of highlights (e.g., 1-50, 51-100, 101-150), and I could even deal with having overlap in the chunks (e.g., 1-50, 45-95).

There's probably an obvious way to do this, but I'm missing it.
posted by philosophygeek to Computers & Internet (9 answers total) 2 users marked this as a favorite
 
Two things I'd try, in this order: first, can you use a developer console to observe the network requests for the follow-on items and see if there's an API hidden there for retrieving them? Second, if you really need the web page you could load it in a headless browser like PhantomJS, simulate some scroll events to trigger loading, and then query the page DOM for items.
posted by migurski at 9:04 AM on January 3, 2018


If you can open up the Network inspector (Ctrl [or Command on a Mac] + Shift + I, then click the "Network" tab), you should be able to see the underlying network call that Amazon is using to load in the next results. Scroll down to invoke the fetching of the next set of books and you should see a new network call pop up in the inspector. Hopefully this will contain some parameters you can fiddle with to load in more / all of the books as JSON without having to scroll. I only have one book on that page so I can't recreate it but there's a call to a "getOwnedContent" endpoint that looks promising.

Alternately you could write some JS that will automatically scroll down the page for you until it can't anymore, and then save that file.

(On preview, what migurski said.)
posted by mustardayonnaise at 9:09 AM on January 3, 2018


Response by poster: I should add that, in addition to specific things to try, I need some support resources or tutorials for how to do those things. I'm only a hobbyist programmer, and for this project I've had to learn from ground zero how to use Python (to parse the web content) and Applescript (to send the results to another Mac app). I don't know anything about Javascript, so I'm understandably overwhelmed and frustrated at being so close and yet so far away.
posted by philosophygeek at 9:19 AM on January 3, 2018


Here's a good video intro to the first suggestion (using the Network inspector), especially about halfway through when they start talking about inspecting AJAX requests. If this strategy works, you would end up with a URL you could hit via Python with something like Requests. No JS required there.

As far as JS goes, there are a few options depending on how deep you want to get. What you're looking to do falls under the general umbrella of "DOM Manipulation," or using JS specifically in the web browser context. There are many resources for that out there, but it's a massive umbrella. Alternatively you could search for specifically what you want, i.e. "automatically scroll webpage javascript," which returns some useful-looking hits.
posted by mustardayonnaise at 9:43 AM on January 3, 2018


Since you're learning python, take a look at at Scrapy.org. It's an awesome python library/toolset for web scraping. It might be throwing you a little bit into the deep end, but it has a decent community and lots of documentation. The scrapy shell is also a pretty good debug tool so you don't get stuck in the endless edit-save-run loop.
posted by cgg at 10:00 AM on January 3, 2018


Maybe something like this Scrape dynamic HTML (YouTube comments) question?

But yeah, some module that uses Selenium or PhantomJS and you load the page and "let it run the JS" and send some scroll down events until you reach the end and wait a bit and then parse the final DOM with whatever tool you like.
posted by zengargoyle at 10:21 AM on January 3, 2018


I'm all for a developer scratching their own itch (guilty, guitly, guilty). If you get frustrated and just want it done (like I did) then check out Bookcision.
posted by tayknight at 10:25 AM on January 3, 2018


FWIW, using Chrome's developer tools...
Going to the notebook page and clicking on a book loads a URL like...

https://read.amazon.com/notebook?asin=B004OR18FG&contentLimitState=&

The HTML that's returned has two hidden elements, kp-notebook-content-limit-state and kp-notebook-annotations-next-page-start where the values from those elements can then construct a URL like...

https://read.amazon.com/notebook?asin=B004OR18FG&contentLimitState=BLAHBLAHBLAH&index=52

I don't have any books with more than 57 annotations. I think it counts highlights and notes as separate entities.
posted by tayknight at 2:15 PM on January 3, 2018


I've used Prerender.io to handle cases like this. It basically works by rendering the page with PhantomJS or headless Chrome and then returning HTML to you.
posted by toxic at 2:38 PM on January 3, 2018


« Older Do I have a bad furnace, bad zoning or both?   |   Using a Boost mobile iPhone in France Newer »
This thread is closed to new comments.