Offline mirror with wget

From Noah.org
Jump to navigationJump to search


Download document and all parts needed to render it

This downloads the given document and all parts needed to view it offline. The number set by --cut-dirs must match the number of parent directories in the URL (dir1 and dir2).

wget --mirror --limit-rate=100k --wait=1 -erobots=off --no-parent --page-requisites --convert-links --no-host-directories --cut-dirs=2 --directory-prefix=OUTPUT_DIR http://www.example.org/dir1/dir2/index.html
--mirror : Mirror is equivalent to "-r -N -l inf --no-remove-listing" (basically, infinite recursion).
--limit-rate : Limit download bandwidth to the given speed in '''Bytes''' per second.
--wait : wait the given number of seconds between each request.
-erobots=off : Ignore (violate?) the The Robots Exclusion Protocol by ignoring robots.txt. Note the single dash in '''-erobots'''.
--no-parent : Do not follow links that ascend to the parent directory. Only follow links that are under the given URL.
--page-requisites : Download all page requisites necessary to display the page (images, CSS, javascript, etc.).
--convert-links : Convert links in the pages so that they work locally relative to the OUTPUT_DIR.
--no-host-directories : Don't create host name directories.
--cut-dirs=n : Remove n directories from the path of the URL. This should equal the number of directory ABOVE the index that you wish to remove from URLs.
--directory-prefix=<OUTPUT_DIR> : Set path to the destination directory where files will be saved.

Download all files by subdirectory (spider a directory)

This downloads everything under dir3.

wget --recursive --cut-dirs=2 --no-parent --relative --reject="index.html*" -e robots=off http://www.example.org/dir1/dir2/dir3/