Offline mirror with wget

From Noah.org
Jump to: navigation, search


If you wish to mirror directories then make sure that Apache2 has full directory indexing enabled. Edit the file /etc/apache2/mods-enabled/autoindex.conf and set the following directive:

IndexIgnore *~ *# RCS CVS *,v *,t

Download document and all parts needed to render it

This downloads the given document and all parts needed to view it offline. The number set by --cut-dirs must match the number of parent directories in the URL (dir1 and dir2).

wget --mirror --limit-rate=100k --wait=1 -erobots=off --no-parent --page-requisites --convert-links --no-host-directories --cut-dirs=2 --directory-prefix=OUTPUT_DIR http://www.example.org/dir1/dir2/index.html
--mirror : Mirror is equivalent to "-r -N -l inf --no-remove-listing" (basically, infinite recursion).
--limit-rate : Limit download bandwidth to the given speed in '''Bytes''' per second.
--wait : wait the given number of seconds between each request.
-erobots=off : Ignore (violate?) the The Robots Exclusion Protocol by ignoring robots.txt. Note the single dash in '''-erobots'''.
--no-parent : Do not follow links that ascend to the parent directory. Only follow links that are under the given URL.
--page-requisites : Download all page requisites necessary to display the page (images, CSS, javascript, etc.).
--convert-links : Convert links in the pages so that they work locally relative to the OUTPUT_DIR.
--no-host-directories : Don't create host name directories.
--cut-dirs=n : Remove n directories from the path of the URL. This should equal the number of directory ABOVE the index that you wish to remove from URLs.
--directory-prefix=<OUTPUT_DIR> : Set path to the destination directory where files will be saved.

Download all files by subdirectory (spider a directory)

This downloads the directory dir3 and everything under it.

get --mirror --cut-dirs=2 --no-parent --no-host-directories --reject="index.html*" -e robots=off http://www.example.org/dir1/dir2/dir3/