wget - "World Wide Web" 与 "get",是一个从网络上自动下载文件的自由工具,支持通过HTTP、HTTPS、FTP 三个常见的TCP/IP协议下载,并可以使用HTTP代理.
wget 可以跟踪 HTML 页面上的链接依次下载来创建远程服务器的本地版本,完全重建原始站点的目录结构. 这常称作 "递归下载".
wget 可以在下载时,同时将链接转换成指向本地文件,以方便离线浏览.
wget 非常稳定,即使在带宽很窄和网络不稳定时,适应性也很好.
如果是由于网络原因下载失败,wget 会不断的尝试,直到整个文件下载完毕.
如果是服务器打断下载过程,wget 会再次联到服务器上从停止的地方继续下载(断点下载). 对于限定了链接时间的服务器上下载大文件非常有用.
wget 功能强大,且使用比较简单.
1. 常见 wget 命令
wget 用法:
wget(选项)(URL)
常用 (选项): (大小写敏感)
-a<日志文件>:在指定的日志文件中记录资料的执行过程;
–append-output=FILE 把记录追加到FILE文件中;
-A<后缀名>:指定要下载文件的后缀名,多个后缀名之间使用逗号进行分隔;
-b:进行后台的方式运行wget;
-B<连接地址>:设置参考的连接地址的基地地址;
-c:继续执行上次终端的任务;断点续传
-C<标志>:设置服务器数据块功能标志on为激活,off为关闭,默认值为on;
-d:调试模式运行指令;
-D<域名列表>:设置顺着的域名列表,域名之间用“,”分隔;
-e<指令>:作为文件“.wgetrc”中的一部分执行指定的指令;
-h:显示指令帮助信息;
-i<文件>:从指定文件获取要下载的URL地址;
-k:表示将下载的网页里的链接修改为本地链接;
-l<目录列表>:设置顺着的目录列表,多个目录用“,”分隔;
-L:仅顺着关联的连接;
-m:–mirror 等价于 -r -N -l inf -nr;
-nc:文件存在时,下载文件不覆盖原有文件;
-nd:递归下载时不创建一层一层的目录,把所有的文件下载到当前目录;
-nh:不查询主机名称;
-np:不要追溯到父目录
-nv:下载时只显示更新和出错信息,不显示指令的详细执行过程;
-N:不要重新下载文件除非比本地文件新;
-o:将log日志指定保存到文件(新建一个文件);
-p:获得所有显示网页所需的元素;
-P:指定下载目录;
-q:不显示指令执行过程;
-r:递归下载;
-v:显示详细执行过程;
-V:显示版本信息;
--passive-ftp:使用被动模式PASV连接FTP服务器;
--follow-ftp:从HTML文件中下载FTP连接文件。
2. 实例
2.1 下载单个文件
wget http://www.linuxde.net/testfile.zip
从网络下载一个文件并保存在当前目录,在下载的过程中会显示进度条,包含(下载完成百分比,已经下载的字节,当前下载速度,剩余下载时间).
2.2 下载并以不同的文件名保存
wget -O wordpress.zip http://www.linuxde.net/download.aspx?id=1080
wget 默认会以最后一个符合/
的后面的字符来命令,对于动态链接的下载通常文件名会不正确.
错误:下面的例子会下载一个文件并以名称download.aspx?id=1080
保存:
wget http://www.linuxde.net/download?id=1
即使下载的文件是zip格式,它仍然以download.php?id=1080
命令.
正确:为了解决这个问题,我们可以使用参数-O
来指定一个文件名:
wget -O wordpress.zip http://www.linuxde.net/download.aspx?id=1080
2.3 限速下载
wget --limit-rate=300k http://www.linuxde.net/testfile.zip
wget 默认会占用全部可能的宽带下载.
但是当准备下载一个大文件,而还需要下载其它文件时就有必要限速了.
2.4 断点续传
wget -c http://www.linuxde.net/testfile.zip
使用wget -c
重新启动下载中断的文件,有利于下载大文件.
当网络等原因突然中断下载时,可以继续接着下载而不是重新下载一个文件.
2.5 后台下载
wget -b http://www.linuxde.net/testfile.zip
Continuing in background, pid 1840.
Output will be written to `wget-log'.
对于下载非常大的文件的时候,可以使用参数-b
进行后台下载,可以使用以下命令来察看下载进度:
tail -f wget-log
2.6 伪装代理名称下载
wget --user-agent="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.204 Safari/534.16" \
http://www.linuxde.net/testfile.zip
有些网站能通过根据判断代理名称不是浏览器而拒绝下载请求.
不过可以通过--user-agent
参数伪装.
2.7 测试下载链接
当打算进行定时下载,应该在预定时间测试下载链接是否有效.
可以增加--spider
参数进行检查.
wget --spider URL
如果下载链接正确,将会显示:
Spider mode enabled. Check if remote file exists.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Remote file exists and could contain further links,
but recursion is disabled -- not retrieving.
这保证了下载能在预定的时间进行,但当给错了一个链接时,将会显示如下错误:
wget --spider url
Spider mode enabled. Check if remote file exists.
HTTP request sent, awaiting response... 404 Not Found
Remote file does not exist -- broken link!!!
可以在以下几种情况下使用--spider
参数:
- 定时下载之前进行检查
- 间隔检测网站是否可用
- 检查网站页面的死链接
2.8 增加重试次数
wget --tries=40 URL
如果网络有问题或下载一个大文件也有可能失败.
wget默认重试20次连接下载文件.
如果需要,可以使用--tries
增加重试次数.
2.9 下载多个文件
wget -i filelist.txt
首先,保存一份下载链接文件:
cat > filelist.txt
url1
url2
url3
url4
接着使用这个文件和参数-i
下载。
2.10 镜像网站
wget --mirror -p --convert-links -P ./LOCAL URL
下载整个网站到本地.
--miror
开户镜像下载.-p
下载所有为了html页面显示正常的文件.--convert-links
下载后,转换成本地的链接.-P ./LOCAL
保存所有文件和目录到本地指定目录.
2.11 过滤指定格式下载
wget --reject=gif ur
下载一个网站,但不希望下载图片,可以使用这条命令.
2.12 把下载信息存入日志文件
wget -o download.log URL
不希望下载信息直接显示在终端而是在一个日志文件,可以使用.
2.13 限制总下载文件大小
wget -Q5m -i filelist.txt
当想要下载的文件超过 5M 而退出下载.
注意:这个参数对单个文件下载不起作用,只能递归下载时才有效.
2.14 下载指定格式文件
wget -r -A.pdf url
可以在以下情况使用该功能:
- 下载一个网站的所有图片.
- 下载一个网站的所有视频.
- 下载一个网站的所有PDF文件.
2.15 FTP下载
wget ftp-url
wget --ftp-user=USERNAME --ftp-password=PASSWORD url
可以使用wget来完成ftp链接的下载.
使用wget匿名ftp下载:
wget ftp-url
使用 wget 用户名和密码认证的ftp下载:
wget --ftp-user=USERNAME --ftp-password=PASSWORD url
3. wget -h 全部命令
wget -h # 显示帮助选项,输出可选选项
Startup:
-V, --version display the version of Wget and exit
-h, --help print this help
-b, --background go to background after startup
-e, --execute=COMMAND execute a `.wgetrc'-style command
Logging and input file:
-o, --output-file=FILE log messages to FILE
-a, --append-output=FILE append messages to FILE
-d, --debug print lots of debugging information
-q, --quiet quiet (no output)
-v, --verbose be verbose (this is the default)
-nv, --no-verbose turn off verboseness, without being quiet
--report-speed=TYPE output bandwidth as TYPE. TYPE can be bits
-i, --input-file=FILE download URLs found in local or external FILE
-F, --force-html treat input file as HTML
-B, --base=URL resolves HTML input-file links (-i -F)
relative to URL
--config=FILE specify config file to use
--no-config do not read any config file
--rejected-log=FILE log reasons for URL rejection to FILE
Download:
-t, --tries=NUMBER set number of retries to NUMBER (0 unlimits)
--retry-connrefused retry even if connection is refused
-O, --output-document=FILE write documents to FILE
-nc, --no-clobber skip downloads that would download to
existing files (overwriting them)
-c, --continue resume getting a partially-downloaded file
--start-pos=OFFSET start downloading from zero-based position OFFSET
--progress=TYPE select progress gauge type
--show-progress display the progress bar in any verbosity mode
-N, --timestamping don't re-retrieve files unless newer than
local
--no-if-modified-since don't use conditional if-modified-since get
requests in timestamping mode
--no-use-server-timestamps don't set the local file's timestamp by
the one on the server
-S, --server-response print server response
--spider don't download anything
-T, --timeout=SECONDS set all timeout values to SECONDS
--dns-timeout=SECS set the DNS lookup timeout to SECS
--connect-timeout=SECS set the connect timeout to SECS
--read-timeout=SECS set the read timeout to SECS
-w, --wait=SECONDS wait SECONDS between retrievals
--waitretry=SECONDS wait 1..SECONDS between retries of a retrieval
--random-wait wait from 0.5*WAIT...1.5*WAIT secs between retrievals
--no-proxy explicitly turn off proxy
-Q, --quota=NUMBER set retrieval quota to NUMBER
--bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host
--limit-rate=RATE limit download rate to RATE
--no-dns-cache disable caching DNS lookups
--restrict-file-names=OS restrict chars in file names to ones OS allows
--ignore-case ignore case when matching files/directories
-4, --inet4-only connect only to IPv4 addresses
-6, --inet6-only connect only to IPv6 addresses
--prefer-family=FAMILY connect first to addresses of specified family,
one of IPv6, IPv4, or none
--user=USER set both ftp and http user to USER
--password=PASS set both ftp and http password to PASS
--ask-password prompt for passwords
--no-iri turn off IRI support
--local-encoding=ENC use ENC as the local encoding for IRIs
--remote-encoding=ENC use ENC as the default remote encoding
--unlink remove file before clobber
Directories:
-nd, --no-directories don't create directories
-x, --force-directories force creation of directories
-nH, --no-host-directories don't create host directories
--protocol-directories use protocol name in directories
-P, --directory-prefix=PREFIX save files to PREFIX/..
--cut-dirs=NUMBER ignore NUMBER remote directory components
HTTP options:
--http-user=USER set http user to USER
--http-password=PASS set http password to PASS
--no-cache disallow server-cached data
--default-page=NAME change the default page name (normally
this is 'index.html'.)
-E, --adjust-extension save HTML/CSS documents with proper extensions
--ignore-length ignore 'Content-Length' header field
--header=STRING insert STRING among the headers
--max-redirect maximum redirections allowed per page
--proxy-user=USER set USER as proxy username
--proxy-password=PASS set PASS as proxy password
--referer=URL include 'Referer: URL' header in HTTP request
--save-headers save the HTTP headers to file
-U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION
--no-http-keep-alive disable HTTP keep-alive (persistent connections)
--no-cookies don't use cookies
--load-cookies=FILE load cookies from FILE before session
--save-cookies=FILE save cookies to FILE after session
--keep-session-cookies load and save session (non-permanent) cookies
--post-data=STRING use the POST method; send STRING as the data
--post-file=FILE use the POST method; send contents of FILE
--method=HTTPMethod use method "HTTPMethod" in the request
--body-data=STRING send STRING as data. --method MUST be set
--body-file=FILE send contents of FILE. --method MUST be set
--content-disposition honor the Content-Disposition header when
choosing local file names (EXPERIMENTAL)
--content-on-error output the received content on server errors
--auth-no-challenge send Basic HTTP authentication information
without first waiting for the server's
challenge
HTTPS (SSL/TLS) options:
--secure-protocol=PR choose secure protocol, one of auto, SSLv2,
SSLv3, TLSv1 and PFS
--https-only only follow secure HTTPS links
--no-check-certificate don't validate the server's certificate
--certificate=FILE client certificate file
--certificate-type=TYPE client certificate type, PEM or DER
--private-key=FILE private key file
--private-key-type=TYPE private key type, PEM or DER
--ca-certificate=FILE file with the bundle of CAs
--ca-directory=DIR directory where hash list of CAs is stored
--crl-file=FILE file with bundle of CRLs
--random-file=FILE file with random data for seeding the SSL PRNG
--egd-file=FILE file naming the EGD socket with random data
HSTS options:
--no-hsts disable HSTS
--hsts-file path of HSTS database (will override default)
FTP options:
--ftp-user=USER set ftp user to USER
--ftp-password=PASS set ftp password to PASS
--no-remove-listing don't remove '.listing' files
--no-glob turn off FTP file name globbing
--no-passive-ftp disable the "passive" transfer mode
--preserve-permissions preserve remote file permissions
--retr-symlinks when recursing, get linked-to files (not dir)
FTPS options:
--ftps-implicit use implicit FTPS (default port is 990)
--ftps-resume-ssl resume the SSL/TLS session started in the control connection when
opening a data connection
--ftps-clear-data-connection cipher the control channel only; all the data will be in plaintext
--ftps-fallback-to-ftp fall back to FTP if FTPS is not supported in the target server
WARC options:
--warc-file=FILENAME save request/response data to a .warc.gz file
--warc-header=STRING insert STRING into the warcinfo record
--warc-max-size=NUMBER set maximum size of WARC files to NUMBER
--warc-cdx write CDX index files
--warc-dedup=FILENAME do not store records listed in this CDX file
--no-warc-compression do not compress WARC files with GZIP
--no-warc-digests do not calculate SHA1 digests
--no-warc-keep-log do not store the log file in a WARC record
--warc-tempdir=DIRECTORY location for temporary files created by the
WARC writer
Recursive download:
-r, --recursive specify recursive download
-l, --level=NUMBER maximum recursion depth (inf or 0 for infinite)
--delete-after delete files locally after downloading them
-k, --convert-links make links in downloaded HTML or CSS point to
local files
--convert-file-only convert the file part of the URLs only (usually known as the basename)
--backups=N before writing file X, rotate up to N backup files
-K, --backup-converted before converting file X, back up as X.orig
-m, --mirror shortcut for -N -r -l inf --no-remove-listing
-p, --page-requisites get all images, etc. needed to display HTML page
--strict-comments turn on strict (SGML) handling of HTML comments
Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions
-R, --reject=LIST comma-separated list of rejected extensions
--accept-regex=REGEX regex matching accepted URLs
--reject-regex=REGEX regex matching rejected URLs
--regex-type=TYPE regex type (posix|pcre)
-D, --domains=LIST comma-separated list of accepted domains
--exclude-domains=LIST comma-separated list of rejected domains
--follow-ftp follow FTP links from HTML documents
--follow-tags=LIST comma-separated list of followed HTML tags
--ignore-tags=LIST comma-separated list of ignored HTML tags
-H, --span-hosts go to foreign hosts when recursive
-L, --relative follow relative links only
-I, --include-directories=LIST list of allowed directories
--trust-server-names use the name specified by the redirection
URL's last component
-X, --exclude-directories=LIST list of excluded directories
-np, --no-parent don't ascend to the parent directory