Scraping websites with wget and httrack

Scrapes can be useful to take static backups of websites or to catalogue a site before a rebuild. If you do online courses then it can also be useful to have as much of the course material as possible locally. Another use is to download HTML only ebooks for offline reading.

There are two ways that I generally do this - one on the command line with wget and another through the GUI with httrack. By far the easiest if you want an entire site is the wget method so I’ll introduce that first.

I like to use the following command so that a browseable local copy is created. Two of the options that are useful to ensure this are --convert-links and --restrict-file-names=windows. The former converts any links into local relative URLs so that the site can be browsed locally and I am using --restrict-file-names for the purpose of ensuring safe file names. This is particularly relevant when the URLs you’re trying to scrape contain parameters.

wget -H -r --level=5 --restrict-file-names=windows --convert-links -e robots=off http://example.org

The rest of the options can easily be looked up on the wget manual page so I’ll leave that as an exercise for the reader to save some time.

For more complicated scrapes and those that require authentication in particular httrack is very handy. You do need a GUI though and it uses Chrome underneath to request the pages from what I can tell - at least for the WebHTTrack software for linux. There is a Windows version (I have never used it) which seems to have a typical Windows application interface from the screenshots on their site so I can’t be sure what this doing underneath.

The options for httrack are reasonably well documented on their site, but when you start it up it will walk you through a relatively straight forward wizard process anyway. Once it has started you’re able to pause and restart downloads, which is a nice feature and it automatically fixes the URLs so the site is browseable locally.

As I read on the train and complete course work I find both methods very handy - can also be used to download the videos too!