let's scrape this thing clean!
November 1, 2017 12:22 AM   Subscribe

Please help me download ALL of the (dynamically loaded, javascript-served) images from this website.


For my master's thesis, I'd like to download all the of the images from IKEA's online catalog as visual research material. Yes, all of the images. A more technical friend pointed me to a website copier, but I can't seem to get this to work properly, at either the sub-page level or the entire site level. My suspicion is that this is because all the images are dynamically served/loaded using javascript (???), but really I have no idea.

I am aware that I can manually download each image individually through inspecting the source and directly going to that file, but for my sanity, please help me find a better way. I've also tried various chrome bulk-image-downloader plugins, and while I can download a significant portion of images that way, it's still not complete, and I can't figure out what I'm missing without manually looking at each page.

Can you help me scrape IKEA's catalog clean? Please assume minimal technical tinkering ability/time. Bonus points for also grabbing the .mp4's and the pseudo-gifs, which are actually just a series of .jpgs displayed sequentially and/or as you scroll down.
posted by wym to Computers & Internet (20 answers total) 5 users marked this as a favorite
Installing DownThemAll into Firefox will get you a fair way along. You will still need to visit pages you're interested in by hand, but you won't need to go scrobbling through source code to get the links to the media files.
posted by flabdablet at 1:11 AM on November 1, 2017

OK, so I hit the same incompleteness wall as you with DTA, probably due to the use of Javascript to load images.

Having more luck with CacheViewer. Here's the workflow I just used to grab 133 images from the main page of the catalog:

1. Add CacheViewer to Firefox.

2. Browse to http://onlinecatalogue.ikea.com/US/en/IKEA_Catalog/?index.

3. From the Firefox hamburger menu, choose History>Clear Recent History; set Time Range to Clear to Everything; make sure the only thing selected under Details is Cache; click Clear Now.

4. Refresh the Ikea catalog tab, then scroll slowly all the way to the bottom to force all the images to finish getting lazy-loaded.

5. From the Firefox hamburger menu, choose Developer>CacheViewer.

6. In the resulting CacheViewer window, click the Mime Type heading to sort the list of cached files by type.

7. Click the first image/jpeg item; scroll down to the last of the image/svg+xml items and shift-click that to select all cached images; right-click a selected item and choose Save As.

Result: a folder containing all the images that the browser fetched while viewing that catalog page.

I'm sure you could tinker with this workflow to grab everything you're interested in. Just keep an eye on the cache usage indicator at the bottom of the CacheViewer window, and be sure to save your cached images before the cache fills up and Firefox has to start discarding them to make room.
posted by flabdablet at 2:07 AM on November 1, 2017 [1 favorite]

There is a much less complicated way to do what flabdablet recommends. It does not involve installing extensions or inspecting browser caches.

Firefox has a Page Info pane available for any webpage. So do the following:
  1. Browse to http://onlinecatalogue.ikea.com/US/en/IKEA_Catalog/?index.
  2. From the browser menu, select Tools -> Page Info. Alternately, type ctrl-i (Windows/Linux) or cmd-i (Mac)
  3. Click on "Media" tab.
  4. You will see a table of page assets. You can click on the "Type" table header to sort by type. Shift-click-select all the image assets. Click "Save As..."
  5. In the file dialog, pick your target folder. Save.
Then load the next page and try again.
posted by ardgedee at 7:52 AM on November 1, 2017 [1 favorite]

Sorry, now that I'm inspecting my results from that, I see I'm falling short of the 133 that flabdablet got. So his method will be better; mine is good for less-complicated pages.
posted by ardgedee at 7:57 AM on November 1, 2017

ardgedee, I'm pretty sure the media listed in Page Info are essentially the same selection DownThemAll offers for bulk downloading; I'd written off Page Info as well as DTA because the first time I looked at it, the list of media appeared to be far too short.

Further experimentation shows that after scrolling the page slowly to the bottom to force it to finish lazy-loading all its images, Page Info then lists 135 media files while DownThemAll offers 130 under "Pictures and Media" (it appears to be missing four SVG backgrounds and the favicon).

So I'll second the Page Info method, which is indeed less fiddly than using CacheViewer.
posted by flabdablet at 8:45 AM on November 1, 2017

I haven't tried them yet, but would these methods grab all images from all subpages as well, such as from http://onlinecatalogue.ikea.com/US/en/IKEA_Catalog/?dcli06, even though it's no longer the main page?
posted by wym at 10:57 AM on November 1, 2017

Not automatically; you'd have to browse to the subpage concerned by hand first.

On the one you've linked to, I got 33 items with DownThemAll, 36 with View Page Info>Media, and 94 from the CacheViewer method.

The difference seems to be that the browser cache retains an image file for each frame of the animations like the one of the guy adjusting the reading lamp, while both the View Page Info>Media list and DownThemAll can only see the currently displayed frame at any given instant.
posted by flabdablet at 11:43 AM on November 1, 2017

The other good thing about using CacheViewer is that you don't have to go through a select+save cycle for every single page you visit. If you start by clearing the cache, then browse through a whole pile of pages (being sure to scroll to the bottom of each one to force the images to load), and only then open a CacheViewer window, you can save all the cached material from all the pages you've visited in one big gulp (as long as the cache isn't full, which the handy bar graph at the bottom of the window will tell you how close to being it is).
posted by flabdablet at 11:52 AM on November 1, 2017

That would be subject to how the page's Javascript handles images.

Without going too far into the weeds, there are essentially two ways of displaying image files sequentially: The script can insert all the images at once* and make only one visible at a time; or the script can insert and remove images one-at-time. *("at once" won't necessarily mean "nearly simultaneously", but that's where we start hacking at the weeds :)

In the first case you'd be able to access all images with the Page Info method, because those assets are all attached to the page at the time. In the second case Page Info will only be able to provide you the current inserted image from that set.

Since it sounds like Ikea's site is doing the latter (or is using a mix of techniques), it's going to be safer to collect the images from the cache.
posted by ardgedee at 11:53 AM on November 1, 2017

And just to be perfectly clear: if the cache isn't full, then you can be sure that every static image file the browser has ever rendered on the screen from any site visited since the last time it was cleared will be in it; browsers don't evict stuff from caches until they need to or they're specifically told to by the user.

Video and audio playback can use mechanisms that bypass the cache, but for something like this catalog I'd be pretty confident that the cache will catch everything.
posted by flabdablet at 12:04 PM on November 1, 2017

I'm on board with the cache vs page info plan, and this is definitely an improvement over what I was doing (especially with grabbing the lamp guy), but is there a way that I can do this without having to manually navigate to all the subpages?
posted by wym at 12:39 PM on November 1, 2017

I'm not aware of a shrinkwrapped page scraper that can do this, granted it's been a while since this was a thing I had to do. Even archive.org is having trouble adequately scraping modern websites. On the other hand, your work is a repetitive task and should be at least partially automatable.

Since Ikea's site is extensive and is not in a simple tree structure, I strongly recommend hooking up with somebody who can do some client-side scripting. After laying out your goals, you'll have to spend some time discussing scope, because Ikea has a lot of products, regional content (there are subsites for each store in the U.S, for example), and content is subcategorized in multiple ways. There's going to be some fraction of the entire site which should be good enough and still, for example, get you all images from all product pages. This might be a two-step process where the site is scraped to generate a list of the page URLs, from which you pick the ones you need, and then a second script digs into those pages. This would also be the point where your assistant discovers there's a simple heuristic for accessing all the images directly, which could also speed things up a lot, but might also require you to manually sort all the images you grabbed afterwards.
posted by ardgedee at 12:51 PM on November 1, 2017

I now have 1,867 jpgs after manually navigating through the subpages, and using the cacheviewer method. I think it took me about 40 minutes of actual hands on clicking-scrolling-loading time.

Thank you thank you thank you! I didn't realize that a more automated/rigorous process would be so extensive, but this is exactly what I was looking for my current needs.
posted by wym at 2:35 PM on November 1, 2017

I've used OutWit Hub for projects like this. It takes some configuration, but if you can identify the right patterns, you could probably configure OutWit to download things automatically.
posted by reeddavid at 5:52 PM on November 1, 2017

not sure if this helps but:
Google Image Search
posted by pyro979 at 6:57 PM on November 1, 2017

As a librarian who's seen students get into trouble trying to get permission to use images they used in their research when they want to publish their thesis, you might want to think if there's an alternate set of images that aren't copyrighted.
posted by kbuxton at 11:01 PM on November 1, 2017

Yes, absolutely. If your thesis is going to include any of the images themselves, as opposed to just information derived from the images along with sourcing information, then you're absolutely going to need Ikea's permission to republish them. And if they're willing to give you that, they might also be willing to supply you with a machine readable list of image URLs so you don't have to jump through these hoops again whenever they update their catalog.
posted by flabdablet at 3:08 AM on November 2, 2017

I think it took me about 40 minutes of actual hands on clicking-scrolling-loading time.

For what it's worth, that's a lot less time than it would take to research, spec and develop a script to do the same thing.
posted by flabdablet at 3:13 AM on November 2, 2017

Side benefit: your manual clicking created a history that you can also save / reuse for the future, if you need to do it again.
posted by gregglind at 8:56 PM on November 2, 2017 [1 favorite]

In case you do end up needing to do it again, you might want to install ScrollAnywhere as well. This will let you start the scrolling that forces all the images to load with a single mighty middle-button shove, after which the page will continue to scroll at the speed you shoved it until it hits bottom.
posted by flabdablet at 6:21 AM on November 3, 2017

« Older Well actually, we do it *this* way...   |   How to get code from Github to "go" Newer »
This thread is closed to new comments.