Difference between revisions of "Offline mirror with wget"
From Noah.org
Jump to navigationJump to searchm |
|||
Line 24: | Line 24: | ||
== Download all files by subdirectory (spider a directory) == | == Download all files by subdirectory (spider a directory) == | ||
− | This downloads '''dir3'''. | + | This downloads everything under '''dir3'''. |
<pre> | <pre> | ||
− | wget --recursive | + | wget --recursive --cut-dirs=2 --no-parent --relative --level=1 --reject="index.html*" -e robots=off http://www.example.org/dir1/dir2/dir3/ |
</pre> | </pre> |
Revision as of 15:27, 3 March 2014
Download document and all parts needed to render it
This downloads the given document and all parts needed to view it offline. The number set by --cut-dirs must match the number of parent directories in the URL (dir1 and dir2).
wget --mirror --limit-rate=100k --wait=1 -erobots=off --no-parent --page-requisites --convert-links --no-host-directories --cut-dirs=2 --directory-prefix=OUTPUT_DIR http://www.example.org/dir1/dir2/index.html
--mirror : Mirror is equivalent to "-r -N -l inf --no-remove-listing" (basically, infinite recursion). --limit-rate : Limit download bandwidth to the given speed in '''Bytes''' per second. --wait : wait the given number of seconds between each request. -erobots=off : Ignore (violate?) the The Robots Exclusion Protocol by ignoring robots.txt. Note the single dash in '''-erobots'''. --no-parent : Do not follow links that ascend to the parent directory. Only follow links that are under the given URL. --page-requisites : Download all page requisites necessary to display the page (images, CSS, javascript, etc.). --convert-links : Convert links in the pages so that they work locally relative to the OUTPUT_DIR. --no-host-directories : Don't create host name directories. --cut-dirs=n : Remove n directories from the path of the URL. This should equal the number of directory ABOVE the index that you wish to remove from URLs. --directory-prefix=<OUTPUT_DIR> : Set path to the destination directory where files will be saved.
Download all files by subdirectory (spider a directory)
This downloads everything under dir3.
wget --recursive --cut-dirs=2 --no-parent --relative --level=1 --reject="index.html*" -e robots=off http://www.example.org/dir1/dir2/dir3/