Difference between revisions of "Offline mirror with wget"

From Noah.org
Jump to navigationJump to search
Line 26: Line 26:
 
This downloads everything under '''dir3'''.
 
This downloads everything under '''dir3'''.
 
<pre>
 
<pre>
wget --recursive --cut-dirs=2 --no-parent --relative --level=1 --reject="index.html*" -e robots=off http://www.example.org/dir1/dir2/dir3/
+
wget --recursive --cut-dirs=2 --no-parent --relative --reject="index.html*" -e robots=off http://www.example.org/dir1/dir2/dir3/
 
</pre>
 
</pre>

Revision as of 15:28, 3 March 2014


Download document and all parts needed to render it

This downloads the given document and all parts needed to view it offline. The number set by --cut-dirs must match the number of parent directories in the URL (dir1 and dir2).

wget --mirror --limit-rate=100k --wait=1 -erobots=off --no-parent --page-requisites --convert-links --no-host-directories --cut-dirs=2 --directory-prefix=OUTPUT_DIR http://www.example.org/dir1/dir2/index.html
--mirror : Mirror is equivalent to "-r -N -l inf --no-remove-listing" (basically, infinite recursion).
--limit-rate : Limit download bandwidth to the given speed in '''Bytes''' per second.
--wait : wait the given number of seconds between each request.
-erobots=off : Ignore (violate?) the The Robots Exclusion Protocol by ignoring robots.txt. Note the single dash in '''-erobots'''.
--no-parent : Do not follow links that ascend to the parent directory. Only follow links that are under the given URL.
--page-requisites : Download all page requisites necessary to display the page (images, CSS, javascript, etc.).
--convert-links : Convert links in the pages so that they work locally relative to the OUTPUT_DIR.
--no-host-directories : Don't create host name directories.
--cut-dirs=n : Remove n directories from the path of the URL. This should equal the number of directory ABOVE the index that you wish to remove from URLs.
--directory-prefix=<OUTPUT_DIR> : Set path to the destination directory where files will be saved.

Download all files by subdirectory (spider a directory)

This downloads everything under dir3.

wget --recursive --cut-dirs=2 --no-parent --relative --reject="index.html*" -e robots=off http://www.example.org/dir1/dir2/dir3/