How to automate downloading photos from daycare website
July 21, 2021 10:06 AM Subscribe
I'm looking for help to automate downloading several hundred days worth of photos and comments from my daycare provider's website.
Our daycare provider used an online tool called HiMama to share photos and comments of my son during the day. I'm looking for a solution to easily backup the photos and comments from the site. There's no option to batch download the information. When I spoke to customer support, they said the only option was to download each item individually. The problem is that there are a few hundred entries. The site is set up like this:
https://imgur.com/a/k9XYWQo
Is there any kind of tool that I could use to make this process a bit more automatic?
Our daycare provider used an online tool called HiMama to share photos and comments of my son during the day. I'm looking for a solution to easily backup the photos and comments from the site. There's no option to batch download the information. When I spoke to customer support, they said the only option was to download each item individually. The problem is that there are a few hundred entries. The site is set up like this:
https://imgur.com/a/k9XYWQo
Is there any kind of tool that I could use to make this process a bit more automatic?
Response by poster: I realize I should probably have added that I'm using Windows 10 and I have Chrome, Firefox and IE installed. So, anything that works with any of them is fine.
Also, wanted to emphasize that I want to be able to download the comments as well (i.e. not just the images) since they add context to the images. I haven't had a chance to check out the tools suggested so far, but I'll have a look when I'm home from work tonight.
posted by NoneOfTheAbove at 10:22 AM on July 21, 2021
Also, wanted to emphasize that I want to be able to download the comments as well (i.e. not just the images) since they add context to the images. I haven't had a chance to check out the tools suggested so far, but I'll have a look when I'm home from work tonight.
posted by NoneOfTheAbove at 10:22 AM on July 21, 2021
I do something like this as part of a project I'm working on.
I would use Octoparse to do a basic scrape of the comments and image URL for each photo, then use WFdownloader to load up the image URLs to download each of the photos.
posted by NotMyselfRightNow at 10:47 AM on July 21, 2021 [4 favorites]
I would use Octoparse to do a basic scrape of the comments and image URL for each photo, then use WFdownloader to load up the image URLs to download each of the photos.
posted by NotMyselfRightNow at 10:47 AM on July 21, 2021 [4 favorites]
I enjoy scripting this kind of thing, so if WFdownloader proves inadequate and you'd be willing to send me a link to the live site and a set of credentials that work on it, I'd be happy to see what I can come up with in a scripted mass downloader that will work under Windows, probably using PowerShell. I promise not to leak access to your account or anything fetchable from it to anybody else, and also to delete from my own machines anything I download from it during script development and testing. You can reach me via mefi mail, or even more privately via Keybase (on which I am also flabdablet).
posted by flabdablet at 12:31 PM on July 21, 2021 [3 favorites]
posted by flabdablet at 12:31 PM on July 21, 2021 [3 favorites]
I use (Win)HTTrack for this. It self describes as a off line web browser and allows you to basically copy a website or portion to local storage in an automated way. It also allows you to run an update to only download changes (and either keep or delete removed data). It can dynamically rewrite links and put all the images, regardless of directory location on the source, in the same folder or keep the source layout.
It is a bit fiddly but can usually be configured with a few trial runs if the site isn't actively trying to prevent downloads.
posted by Mitheral at 1:19 PM on July 21, 2021
It is a bit fiddly but can usually be configured with a few trial runs if the site isn't actively trying to prevent downloads.
posted by Mitheral at 1:19 PM on July 21, 2021
Best answer: OK, so NoneOfTheAbove did reach out via memail, and I was able to help with some little scripts. I'll document what I did here in case anybody else finds themselves in similar need.
As it turns out, TinyNoneOfTheAbove has not attended the daycare in question for a couple of years, so there was no need to build something that could update a local archive on an ongoing basis; all that was actually required was a one-time static offline copy of the collection of images and notes accumulated over the years of attendance. So I made no attempt to get this happening under Windows, instead using my preferred Linux scripting environment.
First step was to browse to HiMama, log in, and have a bit of a poke around with the browser's network tools to see what resources the site was actually pulling in. This helped me find an API for fetching some JSON describing the journal entries that NoneOfTheAbove was interested in, for which hooray because using jq to work with JSON is way way less fiddly than scraping HTML pages that build their content dynamically using scripts, and jq is installable as a Debian package.
Next thing was to get access to NoneOfTheAbove's HiMama account via curl rather than having to do everything inside the web browser. This was easily done by using the cookies.txt extension for Firefox to export the cookies for the logged in browser page to a file at /tmp/himama/cookies.txt.
As used on the site, the API delivers JSON for journal entries one page at a time, and I found no obvious way to learn ahead of time how many pages existed. However, all of the interesting stuff on each page is contained inside a JSON sub-object named "intervals", and requesting a page with a too-big page number makes the reply contain an empty "intervals" object, so I wrote a little script to start from page 1 and keep appending the resulting JSON to a file until an empty "intervals" object turns up. Here's /tmp/himama/fetch-journal:
Final step was to make a text file to accompany each activity's photo, containing the date the activity was updated, the activity title, and the activity description. Here's /tmp/himama/make-notes:
The resulting stream gets piped into a bash while read loop for breaking into variables and reformatting and writing to the output files. The IFS=$'\n' temporary environment variable setting makes the read builtin use only newlines as field separators, meaning that spaces and tabs that might turn up in the title fields get properly handled; the -d '' option (that's two single quotes, not a double quote) specifies a null string as the line delimiter for read, which read interprets as equivalent to a NUL character, which means that it correctly consumes a whole stream item per read instead of stopping at the first newline as it usually would.
description is the last field consumed by read from each stream item, because that lets it contain newlines that would otherwise be interpreted as field separators.
The - prefix inside the <<-EOF here-document redirection makes it ignore leading tabs, so I can keep my script nicely indented without making the files I'm building end up with weird huge left margins.
The "updated_at" fields from the JSON are all in ISO 8601 format and include timezone offsets. Since I figure NoneOfTheAbove would presumably have been living in the same timezone as the day care, and just want the file timestamps to be simple local times, I'm stripping the timezone offsets off before feeding the fields into date for reformatting, and using date's -u option to make sure my timezone doesn't affect the conversion.
Finally, there's that $r stuck on the ends of all the output lines. That's because NoneOfTheAbove wanted Windows-compatible text files, and it was easier to add return characters explicitly only to line breaks I was creating, since the ones from inside description all seemed to have them already.
So that's pretty much it. 210 photos and 210 notes files auto-retrieved and delivered. Thanks, NoneOfTheAbove - it was a fun little project!
posted by flabdablet at 9:48 AM on July 26, 2021 [1 favorite]
As it turns out, TinyNoneOfTheAbove has not attended the daycare in question for a couple of years, so there was no need to build something that could update a local archive on an ongoing basis; all that was actually required was a one-time static offline copy of the collection of images and notes accumulated over the years of attendance. So I made no attempt to get this happening under Windows, instead using my preferred Linux scripting environment.
First step was to browse to HiMama, log in, and have a bit of a poke around with the browser's network tools to see what resources the site was actually pulling in. This helped me find an API for fetching some JSON describing the journal entries that NoneOfTheAbove was interested in, for which hooray because using jq to work with JSON is way way less fiddly than scraping HTML pages that build their content dynamically using scripts, and jq is installable as a Debian package.
Next thing was to get access to NoneOfTheAbove's HiMama account via curl rather than having to do everything inside the web browser. This was easily done by using the cookies.txt extension for Firefox to export the cookies for the logged in browser page to a file at /tmp/himama/cookies.txt.
As used on the site, the API delivers JSON for journal entries one page at a time, and I found no obvious way to learn ahead of time how many pages existed. However, all of the interesting stuff on each page is contained inside a JSON sub-object named "intervals", and requesting a page with a too-big page number makes the reply contain an empty "intervals" object, so I wrote a little script to start from page 1 and keep appending the resulting JSON to a file until an empty "intervals" object turns up. Here's /tmp/himama/fetch-journal:
#!/bin/bash -x account=REDACTED #numeric, shows up in browser address bar after logging in api=https://www.himama.com/accounts/$account/journal_api page=1 until test '{}' = "$( curl -b cookies.txt -c cookies.txt "$api?page=$page" | tee -a journal.json | jq .intervals )" do let page+=1 doneNext I poked around inside journal.json with assorted ad hoc jq queries until I figured out how to collect all the "activity" sub-objects I was interested in, then extract a list of activity ID numbers and photo URLs, then fetch them. Here's /tmp/himama/fetch-photos:
#!/bin/bash -x <journal.json jq ' .intervals | to_entries[].value[].activity ' | tee activities.json | jq -r ' select(.image.url != null) | "\(.id)\t\(.image.url)" ' | while IFS=$'\t' read -r id url do curl "$url" >$id.jpg doneNote the use of the -r option on jq, which makes it emit plain text rather than JSON-formatted strings when asked to emit strings.
Final step was to make a text file to accompany each activity's photo, containing the date the activity was updated, the activity title, and the activity description. Here's /tmp/himama/make-notes:
#!/bin/bash r=$'\r' <activities.json jq -j ' select(.image.url != null) | "\(.id)\n\(.updated_at)\n\(.title)\n\(.description)\u0000" ' | while IFS=$'\n' read -rd '' id date title description do cat <<-EOF >$id.txt $(date -ud "${date%[+-]*}" +'%Y-%m-%d %H:%M')$r $title$r $r $description$r EOF doneThere are a few subtleties there. First is the use of jq's -j option; this makes jq emit plain text rather than JSON-formatted strings, and also suppresses the newline that it would normally insert between output items. That lets me stick the \u0000 code on the end of what I'm asking jq to emit, so that what it ends up spitting out is a stream of NUL-separated records. I do this because the activity description fields sometimes contain embedded return/newline pairs, making the usual newline delimiter a poor choice for separating stream items. Instead, I use it here to separate fields within the stream items.
The resulting stream gets piped into a bash while read loop for breaking into variables and reformatting and writing to the output files. The IFS=$'\n' temporary environment variable setting makes the read builtin use only newlines as field separators, meaning that spaces and tabs that might turn up in the title fields get properly handled; the -d '' option (that's two single quotes, not a double quote) specifies a null string as the line delimiter for read, which read interprets as equivalent to a NUL character, which means that it correctly consumes a whole stream item per read instead of stopping at the first newline as it usually would.
description is the last field consumed by read from each stream item, because that lets it contain newlines that would otherwise be interpreted as field separators.
The - prefix inside the <<-EOF here-document redirection makes it ignore leading tabs, so I can keep my script nicely indented without making the files I'm building end up with weird huge left margins.
The "updated_at" fields from the JSON are all in ISO 8601 format and include timezone offsets. Since I figure NoneOfTheAbove would presumably have been living in the same timezone as the day care, and just want the file timestamps to be simple local times, I'm stripping the timezone offsets off before feeding the fields into date for reformatting, and using date's -u option to make sure my timezone doesn't affect the conversion.
Finally, there's that $r stuck on the ends of all the output lines. That's because NoneOfTheAbove wanted Windows-compatible text files, and it was easier to add return characters explicitly only to line breaks I was creating, since the ones from inside description all seemed to have them already.
So that's pretty much it. 210 photos and 210 notes files auto-retrieved and delivered. Thanks, NoneOfTheAbove - it was a fun little project!
posted by flabdablet at 9:48 AM on July 26, 2021 [1 favorite]
Response by poster: flabdablet - Thank you so much for doing this. You are fantastic.
posted by NoneOfTheAbove at 4:19 AM on July 27, 2021
posted by NoneOfTheAbove at 4:19 AM on July 27, 2021
Best answer: Another scripting subtlety I forgot to mention: if your usual style is to include option values in the options they're values for (head -n3 style) as mine is, rather than separating them into separate arguments (like head -n 3), you might fall into the same trap as I did the first time I tried to specify NUL as a record delimiter for read and try something like
All arguments that POSIX shells pass to programs they're invoking (builtin or not) are passed in the form of C-style NUL-terminated strings. These have no explicit length indication; a string's length is simply the number of characters you can read from it before hitting its terminating NUL. So when you pass an argument string into which a NUL has been explicitly inserted, the code that consumes the arguments will mistake your embedded NUL for the string's terminator and cut the argument short.
So what read will actually see, if invoked as above, is a -d option without an attached value, and it will then consume the next command line argument and treat the first character of that as the value for the -d option.
The only way to make the shell actually pass NUL to code that's expecting a single-character argument is to pass a separate, empty argument. Both of these work:
When the receiver grabs that argument's "first character", the byte that it will actually grab is the terminating NUL. Arguably that's erroneous behaviour because zero-length strings don't have a first character, and in fact some programs, particularly those written in higher-level languages than C that convert argument strings to a more robust language-native internal form before processing them, would interpret the argument concerned as empty and be unable to extract the intended NUL from it.
Fortunately, the read builtin is not that fussy and therefore can be instructed to grab stuff from NUL-delimited rather than newline-delimited input records.
posted by flabdablet at 7:52 AM on July 27, 2021
read -d$'\0'This looks like it ought to work but it totally doesn't, and it took me a while to figure out why not. It boils down to historical baggage.
All arguments that POSIX shells pass to programs they're invoking (builtin or not) are passed in the form of C-style NUL-terminated strings. These have no explicit length indication; a string's length is simply the number of characters you can read from it before hitting its terminating NUL. So when you pass an argument string into which a NUL has been explicitly inserted, the code that consumes the arguments will mistake your embedded NUL for the string's terminator and cut the argument short.
So what read will actually see, if invoked as above, is a -d option without an attached value, and it will then consume the next command line argument and treat the first character of that as the value for the -d option.
read -d''is equally useless, being completely equivalent to either of
read '-d' read -dbecause of the way shells do quote removal.
The only way to make the shell actually pass NUL to code that's expecting a single-character argument is to pass a separate, empty argument. Both of these work:
read -d '' read -d ""Again, the shell will remove the quotes but it will be removing them from a thing that will keep existing even though it's now been emptied, and will be passed as a separate empty argument consisting only of a terminating NUL.
When the receiver grabs that argument's "first character", the byte that it will actually grab is the terminating NUL. Arguably that's erroneous behaviour because zero-length strings don't have a first character, and in fact some programs, particularly those written in higher-level languages than C that convert argument strings to a more robust language-native internal form before processing them, would interpret the argument concerned as empty and be unable to extract the intended NUL from it.
Fortunately, the read builtin is not that fussy and therefore can be instructed to grab stuff from NUL-delimited rather than newline-delimited input records.
posted by flabdablet at 7:52 AM on July 27, 2021
This thread is closed to new comments.
posted by davcoo at 10:14 AM on July 21, 2021