Difference between revisions of "Offline mirror with wget"
From Noah.org
Jump to navigationJump to searchm |
|||
Line 1: | Line 1: | ||
[[Category:Engineering]] | [[Category:Engineering]] | ||
+ | |||
+ | If you wish to mirror directories then make sure that Apache2 has full directory indexing enabled. Edit the file '''/etc/apache2/mods-enabled/autoindex.conf''' and set the following directive: | ||
+ | <pre> | ||
+ | IndexIgnore *~ *# RCS CVS *,v *,t | ||
+ | </pre> | ||
== Download document and all parts needed to render it == | == Download document and all parts needed to render it == | ||
Line 24: | Line 29: | ||
== Download all files by subdirectory (spider a directory) == | == Download all files by subdirectory (spider a directory) == | ||
− | This downloads | + | This downloads the directory '''dir3''' and everything under it. |
<pre> | <pre> | ||
− | + | get --mirror --cut-dirs=2 --no-parent --no-host-directories --reject="index.html*" -e robots=off http://www.example.org/dir1/dir2/dir3/ | |
</pre> | </pre> |
Latest revision as of 15:43, 3 March 2014
If you wish to mirror directories then make sure that Apache2 has full directory indexing enabled. Edit the file /etc/apache2/mods-enabled/autoindex.conf and set the following directive:
IndexIgnore *~ *# RCS CVS *,v *,t
Download document and all parts needed to render it
This downloads the given document and all parts needed to view it offline. The number set by --cut-dirs must match the number of parent directories in the URL (dir1 and dir2).
wget --mirror --limit-rate=100k --wait=1 -erobots=off --no-parent --page-requisites --convert-links --no-host-directories --cut-dirs=2 --directory-prefix=OUTPUT_DIR http://www.example.org/dir1/dir2/index.html
--mirror : Mirror is equivalent to "-r -N -l inf --no-remove-listing" (basically, infinite recursion). --limit-rate : Limit download bandwidth to the given speed in '''Bytes''' per second. --wait : wait the given number of seconds between each request. -erobots=off : Ignore (violate?) the The Robots Exclusion Protocol by ignoring robots.txt. Note the single dash in '''-erobots'''. --no-parent : Do not follow links that ascend to the parent directory. Only follow links that are under the given URL. --page-requisites : Download all page requisites necessary to display the page (images, CSS, javascript, etc.). --convert-links : Convert links in the pages so that they work locally relative to the OUTPUT_DIR. --no-host-directories : Don't create host name directories. --cut-dirs=n : Remove n directories from the path of the URL. This should equal the number of directory ABOVE the index that you wish to remove from URLs. --directory-prefix=<OUTPUT_DIR> : Set path to the destination directory where files will be saved.
Download all files by subdirectory (spider a directory)
This downloads the directory dir3 and everything under it.
get --mirror --cut-dirs=2 --no-parent --no-host-directories --reject="index.html*" -e robots=off http://www.example.org/dir1/dir2/dir3/