网易杭州研究院的人还是挺牛的,我的水平与他们相差若干个等级,所以有所不对的地方,敬请见谅。
测试了一下nginx ajp模块,发现可能有如下问题(有些问题并不是这个模块带来的):
1、潜在bug,ngx_http_ajp.c:359行%s输出格式,但r->method不是字符串类型,而是ngx_uint_t类型,所以有可能会导致进程崩溃的情况发生
357 if ((method = sc_for_req_method_by_id(r)) == UNKNOWN_METHOD) {
358 ngx_log_error(NGX_LOG_ERR, log, 0,
359 "ajp_marshal_into_msgb - No such method %s", r->method);
360 return NGX_ERROR;
361 }
2、我用jackrabbit开源项目测试了一下,发现访问populate.jsp的时候,存在着一个小问题
我们先用apache测试一下(走ajp协议)
apache反应如下:
00:00:02.293 150.543 1649 151 POST 200 text/html http://test.163.com:8022/jackrabbit-webapp-2.2.5/populate.jsp
从150s终于显示页面,开始显示进度条
从wireshark可以看出,不管是啥,ajp协议都是先传递信息给tomcat,之后tomcat传递给apache,中间没有交叉
00:20:25.676 150.803 1649 151 POST 200 text/html http://xxx:8022/jackrabbit-webapp-2.2.5/populate.jsp
...
7 0.001490 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=1 Ack=1650 Win=11680 Len=0
8 150.673242 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=1 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
...
16 150.677218 61.135.255.86 61.135.250.217 TCP hpvmmdata > oa-system [ACK] Seq=1650 Ack=8246 Win=65535 Len=0
17 387.459848 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=8246 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
....
26 454.233836 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=16270 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
...
35 510.252836 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=24292 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
...
44 525.567410 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=32329 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
...
53 527.899546 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=40399 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
...
62 556.146207 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=48461 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
...
71 572.829348 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=56498 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
...
79 640.164545 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=64551 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
...
87 708.108726 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=72627 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
...
96 796.132990 61.135.250.217 61.135.255.86 TCP oa-system > hpvmmdata [ACK] Seq=80643 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
nginx访问tomcat (ajp):
00:08:28.707 501.640 1649 180 POST 200 text/html http://test.163.com:9088/jackrabbit-webapp-2.2.5/populate.jsp
501s后直接显示最终结果
即没有apache在time上的特殊处理
01:16:09.821 * 1630/1714 * POST * * http://test.163.com:9088/jackrabbit-webapp-2.2.5/populate.jsp
直接访问tomcat:
正常显示,无长久等待问题
01:12:04.101 2.428 1714 110 POST 200 text/html http://test.163.com:30002/jackrabbit-webapp-2.2.5/populate.jsp
由此可见,apache ajp模块是有问题的,nginx中的ajp也是有问题的,并没有起到代理的作用。
这些问题都是没有及时flush内容到客户端导致的,这可能是ajp协议本身设计上的原因。
3、对于nginx访问tomcat(ajp),在多进程模式下,是无法保持keepalive的
16:52:17.124950 IP localhost.localdomain.57859 > localhost.localdomain.30004: S 3493885978:3493885978(0) win 32767 <mss 16396,sackOK,timestamp 2907053443 0,nop,wscale 7>
16:52:17.124971 IP localhost.localdomain.30004 > localhost.localdomain.57859: S 3496097446:3496097446(0) ack 3493885979 win 32767 <mss 16396,sackOK,timestamp 2907053443 2907053443,nop,wscale 7>
16:52:17.124982 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 1 win 256 <nop,nop,timestamp 2907053443 2907053443>
16:52:17.125014 IP localhost.localdomain.57859 > localhost.localdomain.30004: P 1:1279(1278) ack 1 win 256 <nop,nop,timestamp 2907053443 2907053443>
16:52:17.125026 IP localhost.localdomain.30004 > localhost.localdomain.57859: . ack 1279 win 256 <nop,nop,timestamp 2907053443 2907053443>
16:52:17.126266 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 1:64(63) ack 1279 win 256 <nop,nop,timestamp 2907053444 2907053443>
16:52:17.126277 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 64 win 256 <nop,nop,timestamp 2907053444 2907053444>
16:52:17.126296 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 64:82(18) ack 1279 win 256 <nop,nop,timestamp 2907053444 2907053444>
16:52:17.126304 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 82 win 256 <nop,nop,timestamp 2907053444 2907053444>
16:52:17.126319 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 82:88(6) ack 1279 win 256 <nop,nop,timestamp 2907053444 2907053444>
16:52:17.126326 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 88 win 256 <nop,nop,timestamp 2907053444 2907053444>
16:52:19.725694 IP localhost.localdomain.57859 > localhost.localdomain.30004: P 1279:2557(1278) ack 88 win 256 <nop,nop,timestamp 2907056044 2907053444>
16:52:19.726362 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 88:151(63) ack 2557 win 256 <nop,nop,timestamp 2907056044 2907056044>
16:52:19.726376 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 151 win 256 <nop,nop,timestamp 2907056044 2907056044>
16:52:19.726398 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 151:169(18) ack 2557 win 256 <nop,nop,timestamp 2907056044 2907056044>
16:52:19.726405 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 169 win 256 <nop,nop,timestamp 2907056045 2907056044>
16:52:19.726422 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 169:175(6) ack 2557 win 256 <nop,nop,timestamp 2907056045 2907056045>
16:52:19.726430 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 175 win 256 <nop,nop,timestamp 2907056045 2907056045>
16:52:23.428126 IP localhost.localdomain.57859 > localhost.localdomain.30004: P 2557:3835(1278) ack 175 win 256 <nop,nop,timestamp 2907059747 2907056045>
16:52:23.428568 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 175:238(63) ack 3835 win 256 <nop,nop,timestamp 2907059747 2907059747>
16:52:23.428582 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 238 win 256 <nop,nop,timestamp 2907059747 2907059747>
16:52:23.428600 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 238:256(18) ack 3835 win 256 <nop,nop,timestamp 2907059747 2907059747>
16:52:23.428610 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 256 win 256 <nop,nop,timestamp 2907059747 2907059747>
16:52:23.428624 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 256:262(6) ack 3835 win 256 <nop,nop,timestamp 2907059747 2907059747>
16:52:23.428632 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 262 win 256 <nop,nop,timestamp 2907059747 2907059747>
16:52:23.428821 IP localhost.localdomain.57859 > localhost.localdomain.30004: F 3835:3835(0) ack 262 win 256 <nop,nop,timestamp 2907059747 2907059747>
16:52:23.428857 IP localhost.localdomain.30004 > localhost.localdomain.57859: F 262:262(0) ack 3836 win 256 <nop,nop,timestamp 2907059748 2907059747>
16:52:23.428870 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 263 win 256 <nop,nop,timestamp 2907059748 2907059748>
这里的特征就是与后端的连接支持不了几个请求就会被nginx主动关闭,加了accept_metex off;就能够避免此问题。经过我测试,这是ngx_http_upstream_keepalive_module的普遍问题,并不是ajp模块自身的问题。
当时在调试广告投放系统的时候,就发现nginx与后端的连接,虽然按照keepalive说明文档做了,仍然维持不了连接,出现大量time_wait。希望这仅仅是我的误解,而不是真正的问题
4、这个不知道是不是问题,从strace查看,发现多出一个EPOLLOUT事件
epoll_wait(3, {{EPOLLIN, {u32=3085643784, u64=13832229564705353736}}}, 4096, -1) = 1
gettimeofday({1299826536, 530432}, NULL) = 0
accept(5, {sa_family=AF_INET, sin_port=htons(3421), sin_addr=inet_addr("61.135.255.86")}, [16]) = 6
ioctl(6, FIONBIO, [1]) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLET, {u32=3085643872, u64=13832226094371778656}}) = 0
epoll_wait(3, {{EPOLLIN, {u32=3085643872, u64=13832226094371778656}}}, 4096, 5000) = 1
gettimeofday({1299826536, 534502}, NULL) = 0
recv(6, "GET /test.jsp HTTP/1.1/r/nHost: te"..., 1024, 0) = 1024
recv(6, "; Province=010; City=010; __utmc"..., 7564, 0) = 117
epoll_ctl(3, EPOLL_CTL_MOD, 6, {EPOLLIN|EPOLLOUT|EPOLLET, {u32=3085643872, u64=13832224891780935776}}) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
ioctl(7, FIONBIO, [1]) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLET, {u32=3085643960, u64=13832223603290747064}}) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(20004), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_wait(3, {{EPOLLOUT, {u32=3085643960, u64=13832223603290747064}}, {EPOLLOUT, {u32=3085643872, u64=13832224891780935776}}}, 4096, 60000) = 2
gettimeofday({1299826536, 535060}, NULL) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
writev(7, [{"/0224/4/177/2/2/0/10HTTP/1.1/0/0/t/test.jsp/0/0/r6"..., 1155}], 1) = 1155 //向后端发送请求
recv(6, 0xbff5f143, 1, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, {{EPOLLIN|EPOLLOUT, {u32=3085643960, u64=13832223603290747064}}}, 4096, 60000) = 1
gettimeofday({1299826536, 537102}, NULL) = 0
recv(7, "AB/0;/4/0/310/0/2OK/0/0/2/0/fContent-Type/0/0/tt"..., 4096, 0) = 87 //读取后端数据
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7de0000
write(1, "key=Content-Type,value=text/html"..., 33key=Content-Type,value=text/html
) = 33
write(1, "key=Content-Length,value=10/n", 28key=Content-Length,value=10
) = 28
writev(6, [{"HTTP/1.1 200 OK/r/nServer: nginx/r/n"..., 156}, {"2 + 2 = 4/n", 10}], 2) = 166 //返回数据给客户端
setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0
recv(6, 0x9881c40, 1024, 0) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, {{EPOLLOUT, {u32=3085643960, u64=13832223603290747064}}}, 4096, 60000) = 1 //为啥多出这个事件???
gettimeofday({1299826536, 537666}, NULL) = 0
epoll_wait(3,
{{EPOLLIN|EPOLLOUT, {u32=3085643873, u64=57891489311305825}}}, 4096, 60000) = 1
gettimeofday({1299826679, 729635}, NULL) = 0
recv(6, "GET /test.jsp HTTP/1.1/r/nHost: te"..., 1024, 0) = 1024
recv(6, "; Province=010; City=010; __utmc"..., 7564, 0) = 117
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
writev(7, [{"/0224/4/177/2/2/0/10HTTP/1.1/0/0/t/test.jsp/0/0/r6"..., 1155}], 1) = 1155 //发送请求给后端
recv(7, 0x9892420, 4096, 0) = -1 EAGAIN (Resource temporarily unavailable) //及时去读取了
recv(6, 0xbff5f143, 1, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, {{EPOLLIN|EPOLLOUT, {u32=3085643960, u64=13832223603290747064}}}, 4096, 60000) = 1
gettimeofday({1299826679, 731080}, NULL) = 0
recv(7, "AB/0;/4/0/310/0/2OK/0/0/2/0/fContent-Type/0/0/tt"..., 4096, 0) = 87
write(1, "key=Content-Type,value=text/html"..., 33key=Content-Type,value=text/html
) = 33
write(1, "key=Content-Length,value=10/n", 28key=Content-Length,value=10
) = 28
writev(6, [{"HTTP/1.1 200 OK/r/nServer: nginx/r/n"..., 156}, {"2 + 2 = 4/n", 10}], 2) = 166
recv(6, 0x9881c40, 1024, 0) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, {{EPOLLOUT, {u32=3085643960, u64=13832223603290747064}}}, 4096, 60000) = 1 //为啥多出这个事件
gettimeofday({1299826679, 732019}, NULL) = 0
epoll_wait(3,