Difference between revisions of "Offline mirror with wget"

From Noah.org
Jump to navigationJump to search
m
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
[[Category:Engineering]]
 
[[Category:Engineering]]
  
This downloads the given document and all parts it needs to be viewed offline. The number set by --cut-dirs must match the number of parent directories in the URL (dir1 and dir2).
+
If you wish to mirror directories then make sure that Apache2 has full directory indexing enabled. Edit the file '''/etc/apache2/mods-enabled/autoindex.conf''' and set the following directive:
 +
<pre>
 +
IndexIgnore *~ *# RCS CVS *,v *,t
 +
</pre>
 +
 
 +
== Download document and all parts needed to render it ==
 +
 
 +
This downloads the given document and all parts needed to view it offline. The number set by --cut-dirs must match the number of parent directories in the URL (dir1 and dir2).
 +
 
 +
<pre>
 +
wget --mirror --limit-rate=100k --wait=1 -erobots=off --no-parent --page-requisites --convert-links --no-host-directories --cut-dirs=2 --directory-prefix=OUTPUT_DIR http://www.example.org/dir1/dir2/index.html
 +
</pre>
  
 
<pre>
 
<pre>
wget -m -np -p -P OUTPUT_DIR -k -nH --cut-dirs=2 http://www.example.org/dir1/dir2/index.html
+
--mirror : Mirror is equivalent to "-r -N -l inf --no-remove-listing" (basically, infinite recursion).
 +
--limit-rate : Limit download bandwidth to the given speed in '''Bytes''' per second.
 +
--wait : wait the given number of seconds between each request.
 +
-erobots=off : Ignore (violate?) the The Robots Exclusion Protocol by ignoring robots.txt. Note the single dash in '''-erobots'''.
 +
--no-parent : Do not follow links that ascend to the parent directory. Only follow links that are under the given URL.
 +
--page-requisites : Download all page requisites necessary to display the page (images, CSS, javascript, etc.).
 +
--convert-links : Convert links in the pages so that they work locally relative to the OUTPUT_DIR.
 +
--no-host-directories : Don't create host name directories.
 +
--cut-dirs=n : Remove n directories from the path of the URL. This should equal the number of directory ABOVE the index that you wish to remove from URLs.
 +
--directory-prefix=<OUTPUT_DIR> : Set path to the destination directory where files will be saved.
 
</pre>
 
</pre>
  
 +
== Download all files by subdirectory (spider a directory) ==
 +
 +
This downloads the directory '''dir3''' and everything under it.
 
<pre>
 
<pre>
-m : Mirror. It is currently equivalent to -r -N -l inf --no-remove-listing.
+
get --mirror --cut-dirs=2 --no-parent --no-host-directories --reject="index.html*" -e robots=off http://www.example.org/dir1/dir2/dir3/
-np : no parent files - only download files that are under the given URL.
 
-p : Download all page requisites to display the page.
 
-P <OUTPUT_DIR> : path to the mirror site destination
 
-k : convert links in the pages so that they work under the OUTPUT_DIR.
 
-nH : don't create host name directory.
 
--cut-dirs=n : remove n directories from the path of the URL.
 
 
</pre>
 
</pre>

Latest revision as of 15:43, 3 March 2014


If you wish to mirror directories then make sure that Apache2 has full directory indexing enabled. Edit the file /etc/apache2/mods-enabled/autoindex.conf and set the following directive:

IndexIgnore *~ *# RCS CVS *,v *,t

Download document and all parts needed to render it

This downloads the given document and all parts needed to view it offline. The number set by --cut-dirs must match the number of parent directories in the URL (dir1 and dir2).

wget --mirror --limit-rate=100k --wait=1 -erobots=off --no-parent --page-requisites --convert-links --no-host-directories --cut-dirs=2 --directory-prefix=OUTPUT_DIR http://www.example.org/dir1/dir2/index.html
--mirror : Mirror is equivalent to "-r -N -l inf --no-remove-listing" (basically, infinite recursion).
--limit-rate : Limit download bandwidth to the given speed in '''Bytes''' per second.
--wait : wait the given number of seconds between each request.
-erobots=off : Ignore (violate?) the The Robots Exclusion Protocol by ignoring robots.txt. Note the single dash in '''-erobots'''.
--no-parent : Do not follow links that ascend to the parent directory. Only follow links that are under the given URL.
--page-requisites : Download all page requisites necessary to display the page (images, CSS, javascript, etc.).
--convert-links : Convert links in the pages so that they work locally relative to the OUTPUT_DIR.
--no-host-directories : Don't create host name directories.
--cut-dirs=n : Remove n directories from the path of the URL. This should equal the number of directory ABOVE the index that you wish to remove from URLs.
--directory-prefix=<OUTPUT_DIR> : Set path to the destination directory where files will be saved.

Download all files by subdirectory (spider a directory)

This downloads the directory dir3 and everything under it.

get --mirror --cut-dirs=2 --no-parent --no-host-directories --reject="index.html*" -e robots=off http://www.example.org/dir1/dir2/dir3/