Experiments

Mass sequential download with wget in command line

January 24, 2010

I wanted to download the whole Andy & Casey web comic, in the event Internet will be shut down, because that comic is AMAZING.

That’s how you can do that using just wget.

for i in $(seq 1 666); do wget http://www.galactanet.com/comic/Strip$i.gif; done

for i in $(seq -w 1 666); do if [ -e Strip$i.gif ] || [ -e Strip$i.jpg ] || [ -e Strip$i.png ]; then [[ 1 == 1 ]]; else echo $i; fi; done

First line downloads all the 666 pages of the comic. Second line checks the downloaded files and prints the names of the missing ones. Download failures happen.

This script works because all pages of the comic have predictable names: they have prefix Strip, then number without any padding, then .gif filename extension.

Do not do that if you don’t want your content easily enumerable. This becomes especially nasty security hole if you allow viewing financial documents via direct URLs by their IDs, and forget to implement proper access control. What I mean is that if I can see my receipt at URL http://example.com/receipts/12345, and you don’t do proper access control, then I can potentially look at somebody’s other receipt at URL http://example.com/receipts/12346, or enumerate all of them via the similar braindead script as above.

In the case of Andy & Casey I’m grateful to the author that I can download all of the strips with such an ease. This work was pleasure to read and when the domain go down I would like to be able to access the comic again in some way.

Next: My EMACS config file at 2011-08-31