curl http://site.{one,two,three}.com
curl ftp://ftp.numericals.com/file[1-100].txt
curl ftp://ftp.numericals.com/file[001-100].txt
curl http://any.org/archive[1996-1999]/vol[1-4]/part{a,b,c}.html
curl http://www.numericals.com/file[1-100:10].txt
curl http://www.letters.com/file[a-z:2].txt
curl --trace-ascii debugdump.txt http://www.example.com/
curl http://curl.haxx.se
curl "http://www.hotmail.com/when/junk.cgi?birthyear=1905&press=OK"
curl --data "birthyear=1905&press=%20OK%20" http://www.example.com/when.cgi
curl --data-urlencode "name=I am Daniel" http://www.example.com
curl --form upload=@localfilename --form press=OK [URL]
curl --upload-file uploadfile http://www.example.com/receive.cgi
curl --user name:password http://www.example.com
curl --proxy-user proxyuser:proxypassword curl.haxx.se
curl --referer http://www.example.come http://www.example.com
curl --user-agent "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" [URL]
curl --user-agent "Mozilla/4.73 [en] (X11; U; Linux 2.2.15 i686)" [URL]
curl --location http://www.example.com
curl --cookie "name=Daniel" http://www.example.com
curl --dump-header headers_and_cookies http://www.example.com
curl --cookie stored_cookies_in_file http://www.example.com
curl --cookie nada --location http://www.example.com
curl --cookie cookies.txt --cookie-jar newcookies.txt http://www.example.com
curl https://secure.example.com
curl --cert mycert.pem https://secure.example.com
curl --data "<xml>" --header "Content-Type: text/xml" --request PROPFIND url.com
curl --header "Host:" http://www.example.com
curl --header "Destination: http://nowhere" http://example.com
======================================================================
Online: http://curl.haxx.se/docs/httpscripting.html
Date: Jan 19, 2011
The Art Of Scripting HTTP Requests Using Curl
=============================================
本文档假设你熟悉HTML和基本的网络操作.
编写脚本对打造一个良好的计算机系统是必需的. Unix 可通过shell脚本进行扩展, 不同
的工具可执行各种自动化命令, 这是脚本为什么如此成功的原因之一.
日益增加的应用程序, 使得"HTTP Scripting"需求越来越平凡. 获取网站数据给伪造用户,
提交或上传数据到web服务器, 也日渐重要.
Curl 是一个命令行工具, 用于各种URL处理和转换, 但这部分文档关注仅是HTTP
请求. 我假设你已经知道如何使用 'curl --help' 或 'curl --manual' 获取curl
的基本信息.
Curl 并不能为你做任何事情. 它可发送请求, 获取数据, 发送数据, 获取信息. 你可能
需要使用各种脚本语言 或 重复查阅手册, 将这些内容整合到一起.
1. The HTTP Protocol
HTTP 是一种用于从服务器获取数据的协议. 它是基于TCP/IP构建的一种简单协议. 这种
协议允许客户端使用一些不同的方法从服务器获取数据, 如下所示.
HTTP , 是纯文本行, 从客户端发送到服务器, 用于请求特定操作, 在请求数据传到客户
端之前, 服务器会先返回一些纯文本行.
客户端, curl, 发送一个HTTP请求. 请求包含一个方法 (类似
GET, POST, HEAD 等), 一些请求头 和 请求主体. HTTP 服务器响应附带状态返回(说明
是否工作的很好), 响应头 和 响应主体. "body" 部分就是你请求的纯数据, 类似 HTML
或 image 等.
1.1 See the Protocol
使用 curl 的选项 --verbose (-v 短选项) 会显示curl发送给服务器的命令, 以及一些
其他的信息.
--verbose 常用于调试 或 理解 curl与服务器之间的交互过程.
有时 --verbose 还不够. --trace 和--trace-ascii 能够显示更详细的信息, 包括curl
操作过程中发送和接收的所有信息. 使用如下:
curl --trace-ascii debugdump.txt http://www.example.com/
2. URL
统一资源定位符, 用于指定互联网资源地址. 例如: http://curl.haxx.se 和
https://yourbank.com.
3. GET a page
HTTP 最简单常用的操作, 就是访问一个URL地址. URL 可指向一个web页面, 一张图片
或 一个文件. 客户端向服务器发送一个HTTP请求, 并返回请求的文档.
curl http://curl.haxx.se
终端执行上述命令, 可获取整个web页面.
所有的HTTP应答都包含一组隐藏的响应头, 使用 curl 的 --include(-i) 可以显示响
应头和文档内容. 你也可以使用选项--head(-I)(curl会发送一个HEAD请求), 只获取
响应头.
4. Forms
表单, 是网站向用户提供带输入域的HTML页面一种常见方式, 用于接收用户数据, 输入
完成后, 按下 'OK' 或 'submit' 按钮, 向服务器发送数据. 服务器根据用户提交的数
据决定如何操作. 类似利用输入字查询数据库, 或向bug跟踪系统中添加信息, 地图地
址显示, 或使用登陆提示, 验证用户的哪些操作被允许.
当然, 必需有服务端程序用于接收你发送的数据. 你不能凭空捏造.
4.1 GET
GET-form 使用的是 GET 方法, 如下面HTML描述:
<form method="GET" action="junk.cgi">
<input type=text name="birthyear">
<input type=submit name=press value="OK">
</form>
使用你最喜欢的浏览器, 页面会显示一个输入框. 如果输入 '1905', 并按下 OK 按钮,
浏览器会创建一个GET请求的连接. 原始URL会附加"junk.cgi?birthyear=1905&press=OK".
如果原始表单页面地址是 "www.hotmail.com/when/birth.html", 访问后, 第二个地址变为
"www.hotmail.com/when/junk.cgi?birthyear=1905&press=OK".
大部分搜索引擎, 以这种方式工作.
curl 能够为你完成这项任务:
curl "http://www.hotmail.com/when/junk.cgi?birthyear=1905&press=OK"
4.2 POST
GET 方法会将所有输入域名称都显示在URL. 这种情况, 你收藏指定页面很方便, 弊端
就是输入域的内容可以在URL上看见.
HTTP 协议还提供了POST方法. 该方法可以从URL中分离数据, 这样在URL地址域你就看
不到任何内容.
POST与GET表单在形式上很类似:
<form method="POST" action="junk.cgi">
<input type=text name="birthyear">
<input type=submit name=press value=" OK ">
</form>
curl 下面方法, 可帮助你提交数据:
curl --data "birthyear=1905&press=%20OK%20" http://www.example.com/when.cgi
该类型的 POST 对应的 Content-Type 为 application/x-www-form-urlencoded .
发送到服务器的数据, 必须经过编码, curl 不会为你这么做. 例如, 如果你想要数据
包含空格, 需要使用 %20 替换空格. 如果你的数据不遵循这个规则, 那么可能会导致
错误.
新版的 curl 可以 url-encode POST 数据, 如下:
curl --data-urlencode "name=I am Daniel" http://www.example.com
4.3 File Upload POST
1995 年, RFC 1867 提出一种额外的方法用于HTTP post数据.
该方法用于支持文件上传. 下面的HTML表单允许用户上传文件:
<form method="POST" enctype='multipart/form-data' action="upload.cgi">
<input type=file name=upload>
<input type=submit name=press value="OK">
</form>
Content-Type 类型是 multipart/form-data.
curl 可使用下面方法提交数据:
curl --form upload=@localfilename --form press=OK [URL]
4.4 Hidden Fields
通常基于HTML的应用程序, 可通过表单中的隐藏域传递状态信息. 隐藏域不会显示,
它们向其他表单一样传递.
例如:
<form method="POST" action="foobar.cgi">
<input type=text name="birthyear">
<input type=hidden name="person" value="daniel">
<input type=submit name="press" value="OK">
</form>
curl 处理时, 不用考虑该域是否为隐藏域. 如下:
curl --data "birthyear=1905&press=OK&person=daniel" [URL]
4.5 Figure Out What A POST Looks Like
当你使用curl填写表单发送数据时, 你肯定对浏览器完成的POST操作很感兴趣.
一种简单的办法可用于查看该过程, 保存表单HTML到本地磁盘, 修改'method' 为
GET, 接着按下发送键(可按照自己的需求修改action)
你可以清楚地看到URL后面附加的数据.
5. PUT
上传数据到服务器的最好办法, 就是使用PUT方法.
curl 上传文件到HTTP服务器:
curl --upload-file uploadfile http://www.example.com/receive.cgi
6. HTTP Authentication
HTTP 认证, 告诉服务器用户名和密码, 以便获取请求内容. HTTP 基本认证(默认)是
基于明文的, 这意味着发送的用户名和密码只经过模糊简单的处理, 它仍可以被嗅探.
curl 使用用户和密码, 进行认证:
curl --user name:password http://www.example.com
网站如果使用了不同的验证机制(检测服务器返回的头), 选项 --ntlm, --digest,
--negotiate, --anyauth 可能适合你.
有时候HTTP请求, 经过代理服务器, 这在很多公司都很常见. 访问互联网可能需要用
户名和密码, curl 使用方法如下:
curl --proxy-user proxyuser:proxypassword curl.haxx.se
如果代理需要NTLM方法认证, 请使用--proxy-ntlm, 如果需要Digest, 请
使用--proxy-digest.
如果你使用user+password选项, 但是你忘记输入密码, curl会提示你.
注意, 程序运行时, 它的参数可能可以通过列举进程列表获取. 因此, 如果你将用户和
密码作为命令行选项, 其他的用户可能会看到你的密码
不用担心HTTP的验证过程, 很多网站不会采用这种这种验证方式. 详情请看后面的
Web Login 章节.
7. Referer
HTTP 请求可能会包含一个 'referer' 域(没错, 它拼写错误), 用于指定来源链接.
一些程序/脚本会检查referer域, 以确定是否来源于其他的网站或者未知页面. 这种
检测方法不太可靠, 很容易欺骗, 但是还是有很多人用. curl 可帮你指定 referer-
field.
curl 可帮你指定 referer-field:
curl --referer http://www.example.come http://www.example.com
8. User Agent
与 referer类似, 所有HTTP 请求可能都会有 User-Agent. 很多应用程序会根据该选项
决策如何显示页面内容. 傻的程序员通常会为浏览器定制页面. 他们通常会使用各种不
同的javascript, vbscript 等.
有时候, 你会发现 curl 返回的内容与你的浏览器不同. 然后你发现设置 User Agent
可以愚弄服务器.
curl 模拟Windows 2000 浏览器 Internet Explorer 5:
curl --user-agent "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" [URL]
或模拟Linux 浏览器 Netscape 4.73:
curl --user-agent "Mozilla/4.73 [en] (X11; U; Linux 2.2.15 i686)" [URL]
9. Redirects
当你向服务器请求资源时, 服务器应答内容可能包含一个暗示, 告诉浏览器跳转到下一个
页面, 或输出新内容的页面. header 中用于指定重定向的是Location:.
默认情况下, Curl 不会重定向, 只会以简单的方式显示所有HTTP应答.
curl 跟随一个 Location:
curl --location http://www.example.com
如果使用POST提交数据, 需立即重定向到另一个页面, 可联合使用 --location (-L) ,
--data/--form.
10. Cookies
The way the web browsers do "client side state control" is by using
cookies. Cookies are just names with associated contents. The cookies are
sent to the client by the server. The server tells the client for what path
and host name it wants the cookie sent back, and it also sends an expiration
date and a few more properties.
When a client communicates with a server with a name and path as previously
specified in a received cookie, the client sends back the cookies and their
contents to the server, unless of course they are expired.
Many applications and servers use this method to connect a series of requests
into a single logical session. To be able to use curl in such occasions, we
must be able to record and send back cookies the way the web application
expects them. The same way browsers deal with them.
The simplest way to send a few cookies to the server when getting a page with
curl is to add them on the command line like:
curl --cookie "name=Daniel" http://www.example.com
Cookies are sent as common HTTP headers. This is practical as it allows curl
to record cookies simply by recording headers. Record cookies with curl by
using the --dump-header (-D) option like:
curl --dump-header headers_and_cookies http://www.example.com
(Take note that the --cookie-jar option described below is a better way to
store cookies.)
Curl has a full blown cookie parsing engine built-in that comes to use if you
want to reconnect to a server and use cookies that were stored from a
previous connection (or handicrafted manually to fool the server into
believing you had a previous connection). To use previously stored cookies,
you run curl like:
curl --cookie stored_cookies_in_file http://www.example.com
Curl's "cookie engine" gets enabled when you use the --cookie option. If you
only want curl to understand received cookies, use --cookie with a file that
doesn't exist. Example, if you want to let curl understand cookies from a
page and follow a location (and thus possibly send back cookies it received),
you can invoke it like:
curl --cookie nada --location http://www.example.com
Curl has the ability to read and write cookie files that use the same file
format that Netscape and Mozilla do. It is a convenient way to share cookies
between browsers and automatic scripts. The --cookie (-b) switch
automatically detects if a given file is such a cookie file and parses it,
and by using the --cookie-jar (-c) option you'll make curl write a new cookie
file at the end of an operation:
curl --cookie cookies.txt --cookie-jar newcookies.txt http://www.example.com
11. HTTPS
There are a few ways to do secure HTTP transfers. The by far most common
protocol for doing this is what is generally known as HTTPS, HTTP over
SSL. SSL encrypts all the data that is sent and received over the network and
thus makes it harder for attackers to spy on sensitive information.
SSL (or TLS as the latest version of the standard is called) offers a
truckload of advanced features to allow all those encryptions and key
infrastructure mechanisms encrypted HTTP requires.
Curl supports encrypted fetches thanks to the freely available OpenSSL
libraries. To get a page from a HTTPS server, simply run curl like:
curl https://secure.example.com
11.1 Certificates
In the HTTPS world, you use certificates to validate that you are the one
you claim to be, as an addition to normal passwords. Curl supports client-
side certificates. All certificates are locked with a pass phrase, which you
need to enter before the certificate can be used by curl. The pass phrase
can be specified on the command line or if not, entered interactively when
curl queries for it. Use a certificate with curl on a HTTPS server like:
curl --cert mycert.pem https://secure.example.com
curl also tries to verify that the server is who it claims to be, by
verifying the server's certificate against a locally stored CA cert
bundle. Failing the verification will cause curl to deny the connection. You
must then use --insecure (-k) in case you want to tell curl to ignore that
the server can't be verified.
More about server certificate verification and ca cert bundles can be read
in the SSLCERTS document, available online here:
http://curl.haxx.se/docs/sslcerts.html
12. Custom Request Elements
Doing fancy stuff, you may need to add or change elements of a single curl
request.
For example, you can change the POST request to a PROPFIND and send the data
as "Content-Type: text/xml" (instead of the default Content-Type) like this:
curl --data "<xml>" --header "Content-Type: text/xml" --request PROPFIND url.com
You can delete a default header by providing one without content. Like you
can ruin the request by chopping off the Host: header:
curl --header "Host:" http://www.example.com
You can add headers the same way. Your server may want a "Destination:"
header, and you can add it:
curl --header "Destination: http://nowhere" http://example.com
13. Web Login
While not strictly just HTTP related, it still cause a lot of people problems
so here's the executive run-down of how the vast majority of all login forms
work and how to login to them using curl.
It can also be noted that to do this properly in an automated fashion, you
will most certainly need to script things and do multiple curl invokes etc.
First, servers mostly use cookies to track the logged-in status of the
client, so you will need to capture the cookies you receive in the
responses. Then, many sites also set a special cookie on the login page (to
make sure you got there through their login page) so you should make a habit
of first getting the login-form page to capture the cookies set there.
Some web-based login systems features various amounts of javascript, and
sometimes they use such code to set or modify cookie contents. Possibly they
do that to prevent programmed logins, like this manual describes how to...
Anyway, if reading the code isn't enough to let you repeat the behavior
manually, capturing the HTTP requests done by your browers and analyzing the
sent cookies is usually a working method to work out how to shortcut the
javascript need.
In the actual <form> tag for the login, lots of sites fill-in random/session
or otherwise secretly generated hidden tags and you may need to first capture
the HTML code for the login form and extract all the hidden fields to be able
to do a proper login POST. Remember that the contents need to be URL encoded
when sent in a normal POST.
14. Debug
Many times when you run curl on a site, you'll notice that the site doesn't
seem to respond the same way to your curl requests as it does to your
browser's.
Then you need to start making your curl requests more similar to your
browser's requests:
* Use the --trace-ascii option to store fully detailed logs of the requests
for easier analyzing and better understanding
* Make sure you check for and use cookies when needed (both reading with
--cookie and writing with --cookie-jar)
* Set user-agent to one like a recent popular browser does
* Set referer like it is set by the browser
* If you use POST, make sure you send all the fields and in the same order as
the browser does it. (See chapter 4.5 above)
A very good helper to make sure you do this right, is the LiveHTTPHeader tool
that lets you view all headers you send and receive with Mozilla/Firefox
(even when using HTTPS).
A more raw approach is to capture the HTTP traffic on the network with tools
such as ethereal or tcpdump and check what headers that were sent and
received by the browser. (HTTPS makes this technique inefficient.)
15. References
RFC 2616 is a must to read if you want in-depth understanding of the HTTP
protocol.
RFC 3986 explains the URL syntax.
RFC 2109 defines how cookies are supposed to work.
RFC 1867 defines the HTTP post upload format.
http://curl.haxx.se is the home of the cURL project