我对nginx_ajp_module模块的测试心得

甄佐
2023-12-01

网易杭州研究院的人还是挺牛的,我的水平与他们相差若干个等级,所以有所不对的地方,敬请见谅。

 

测试了一下nginx ajp模块,发现可能有如下问题(有些问题并不是这个模块带来的):

 

1、潜在bug,ngx_http_ajp.c:359行%s输出格式,但r->method不是字符串类型,而是ngx_uint_t类型,所以有可能会导致进程崩溃的情况发生

    357     if ((method = sc_for_req_method_by_id(r)) == UNKNOWN_METHOD) {
    358         ngx_log_error(NGX_LOG_ERR, log, 0,
    359                 "ajp_marshal_into_msgb - No such method %s", r->method);
    360         return NGX_ERROR;
    361     }

 

 

 

 

2、我用jackrabbit开源项目测试了一下,发现访问populate.jsp的时候,存在着一个小问题

 

    我们先用apache测试一下(走ajp协议)

    apache反应如下:
        00:00:02.293    150.543    1649    151    POST    200    text/html    http://test.163.com:8022/jackrabbit-webapp-2.2.5/populate.jsp
        从150s终于显示页面,开始显示进度条
        从wireshark可以看出,不管是啥,ajp协议都是先传递信息给tomcat,之后tomcat传递给apache,中间没有交叉
       
        00:20:25.676    150.803    1649    151    POST    200    text/html    http://xxx:8022/jackrabbit-webapp-2.2.5/populate.jsp
       
        ...
        7    0.001490    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=1 Ack=1650 Win=11680 Len=0
        8    150.673242    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=1 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
        ...
        16    150.677218    61.135.255.86    61.135.250.217    TCP    hpvmmdata > oa-system [ACK] Seq=1650 Ack=8246 Win=65535 Len=0
        17    387.459848    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=8246 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
        ....
        26    454.233836    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=16270 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
        ...
        35    510.252836    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=24292 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
        ...
        44    525.567410    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=32329 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
        ...
        53    527.899546    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=40399 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
        ...
        62    556.146207    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=48461 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
        ...
        71    572.829348    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=56498 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
        ...
        79    640.164545    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=64551 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
        ...
        87    708.108726    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=72627 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]
        ...
        96    796.132990    61.135.250.217    61.135.255.86    TCP    oa-system > hpvmmdata [ACK] Seq=80643 Ack=1650 Win=11680 Len=1460[Packet size limited during capture]

 

 

 

 

   nginx访问tomcat (ajp):

       00:08:28.707    501.640    1649    180    POST    200    text/html    http://test.163.com:9088/jackrabbit-webapp-2.2.5/populate.jsp
        501s后直接显示最终结果
        即没有apache在time上的特殊处理
       
        01:16:09.821    *    1630/1714    *    POST    *    *    http://test.163.com:9088/jackrabbit-webapp-2.2.5/populate.jsp

 

 

 

 

      直接访问tomcat:

 

        正常显示,无长久等待问题
        01:12:04.101    2.428    1714    110    POST    200    text/html    http://test.163.com:30002/jackrabbit-webapp-2.2.5/populate.jsp

 


       由此可见,apache ajp模块是有问题的,nginx中的ajp也是有问题的,并没有起到代理的作用。

       这些问题都是没有及时flush内容到客户端导致的,这可能是ajp协议本身设计上的原因。

 

 

3、对于nginx访问tomcat(ajp),在多进程模式下,是无法保持keepalive的

 

        16:52:17.124950 IP localhost.localdomain.57859 > localhost.localdomain.30004: S 3493885978:3493885978(0) win 32767 <mss 16396,sackOK,timestamp 2907053443 0,nop,wscale 7>
        16:52:17.124971 IP localhost.localdomain.30004 > localhost.localdomain.57859: S 3496097446:3496097446(0) ack 3493885979 win 32767 <mss 16396,sackOK,timestamp 2907053443 2907053443,nop,wscale 7>
        16:52:17.124982 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 1 win 256 <nop,nop,timestamp 2907053443 2907053443>
        16:52:17.125014 IP localhost.localdomain.57859 > localhost.localdomain.30004: P 1:1279(1278) ack 1 win 256 <nop,nop,timestamp 2907053443 2907053443>
        16:52:17.125026 IP localhost.localdomain.30004 > localhost.localdomain.57859: . ack 1279 win 256 <nop,nop,timestamp 2907053443 2907053443>
        16:52:17.126266 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 1:64(63) ack 1279 win 256 <nop,nop,timestamp 2907053444 2907053443>
        16:52:17.126277 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 64 win 256 <nop,nop,timestamp 2907053444 2907053444>
        16:52:17.126296 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 64:82(18) ack 1279 win 256 <nop,nop,timestamp 2907053444 2907053444>
        16:52:17.126304 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 82 win 256 <nop,nop,timestamp 2907053444 2907053444>
        16:52:17.126319 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 82:88(6) ack 1279 win 256 <nop,nop,timestamp 2907053444 2907053444>
        16:52:17.126326 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 88 win 256 <nop,nop,timestamp 2907053444 2907053444>
       
       
        16:52:19.725694 IP localhost.localdomain.57859 > localhost.localdomain.30004: P 1279:2557(1278) ack 88 win 256 <nop,nop,timestamp 2907056044 2907053444>
        16:52:19.726362 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 88:151(63) ack 2557 win 256 <nop,nop,timestamp 2907056044 2907056044>
        16:52:19.726376 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 151 win 256 <nop,nop,timestamp 2907056044 2907056044>
        16:52:19.726398 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 151:169(18) ack 2557 win 256 <nop,nop,timestamp 2907056044 2907056044>
        16:52:19.726405 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 169 win 256 <nop,nop,timestamp 2907056045 2907056044>
        16:52:19.726422 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 169:175(6) ack 2557 win 256 <nop,nop,timestamp 2907056045 2907056045>
        16:52:19.726430 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 175 win 256 <nop,nop,timestamp 2907056045 2907056045>
       
       
        16:52:23.428126 IP localhost.localdomain.57859 > localhost.localdomain.30004: P 2557:3835(1278) ack 175 win 256 <nop,nop,timestamp 2907059747 2907056045>
        16:52:23.428568 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 175:238(63) ack 3835 win 256 <nop,nop,timestamp 2907059747 2907059747>
        16:52:23.428582 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 238 win 256 <nop,nop,timestamp 2907059747 2907059747>
        16:52:23.428600 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 238:256(18) ack 3835 win 256 <nop,nop,timestamp 2907059747 2907059747>
        16:52:23.428610 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 256 win 256 <nop,nop,timestamp 2907059747 2907059747>
        16:52:23.428624 IP localhost.localdomain.30004 > localhost.localdomain.57859: P 256:262(6) ack 3835 win 256 <nop,nop,timestamp 2907059747 2907059747>
        16:52:23.428632 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 262 win 256 <nop,nop,timestamp 2907059747 2907059747>
        16:52:23.428821 IP localhost.localdomain.57859 > localhost.localdomain.30004: F 3835:3835(0) ack 262 win 256 <nop,nop,timestamp 2907059747 2907059747>
        16:52:23.428857 IP localhost.localdomain.30004 > localhost.localdomain.57859: F 262:262(0) ack 3836 win 256 <nop,nop,timestamp 2907059748 2907059747>
        16:52:23.428870 IP localhost.localdomain.57859 > localhost.localdomain.30004: . ack 263 win 256 <nop,nop,timestamp 2907059748 2907059748>

 

这里的特征就是与后端的连接支持不了几个请求就会被nginx主动关闭,加了accept_metex off;就能够避免此问题。经过我测试,这是ngx_http_upstream_keepalive_module的普遍问题,并不是ajp模块自身的问题。

当时在调试广告投放系统的时候,就发现nginx与后端的连接,虽然按照keepalive说明文档做了,仍然维持不了连接,出现大量time_wait。希望这仅仅是我的误解,而不是真正的问题

 

 

4、这个不知道是不是问题,从strace查看,发现多出一个EPOLLOUT事件

 

 

    epoll_wait(3, {{EPOLLIN, {u32=3085643784, u64=13832229564705353736}}}, 4096, -1) = 1
    gettimeofday({1299826536, 530432}, NULL) = 0
    accept(5, {sa_family=AF_INET, sin_port=htons(3421), sin_addr=inet_addr("61.135.255.86")}, [16]) = 6
    ioctl(6, FIONBIO, [1])                  = 0
    epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLET, {u32=3085643872, u64=13832226094371778656}}) = 0
    epoll_wait(3, {{EPOLLIN, {u32=3085643872, u64=13832226094371778656}}}, 4096, 5000) = 1
    gettimeofday({1299826536, 534502}, NULL) = 0
    recv(6, "GET /test.jsp HTTP/1.1/r/nHost: te"..., 1024, 0) = 1024
    recv(6, "; Province=010; City=010; __utmc"..., 7564, 0) = 117
    epoll_ctl(3, EPOLL_CTL_MOD, 6, {EPOLLIN|EPOLLOUT|EPOLLET, {u32=3085643872, u64=13832224891780935776}}) = 0
    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
    ioctl(7, FIONBIO, [1])                  = 0
    epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLET, {u32=3085643960, u64=13832223603290747064}}) = 0
    connect(7, {sa_family=AF_INET, sin_port=htons(20004), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
    epoll_wait(3, {{EPOLLOUT, {u32=3085643960, u64=13832223603290747064}}, {EPOLLOUT, {u32=3085643872, u64=13832224891780935776}}}, 4096, 60000) = 2
    gettimeofday({1299826536, 535060}, NULL) = 0
    getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
    writev(7, [{"/0224/4/177/2/2/0/10HTTP/1.1/0/0/t/test.jsp/0/0/r6"..., 1155}], 1) = 1155    //向后端发送请求
    recv(6, 0xbff5f143, 1, MSG_PEEK)        = -1 EAGAIN (Resource temporarily unavailable) 
    epoll_wait(3, {{EPOLLIN|EPOLLOUT, {u32=3085643960, u64=13832223603290747064}}}, 4096, 60000) = 1
    gettimeofday({1299826536, 537102}, NULL) = 0
    recv(7, "AB/0;/4/0/310/0/2OK/0/0/2/0/fContent-Type/0/0/tt"..., 4096, 0) = 87           //读取后端数据
    fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0
    mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7de0000
    write(1, "key=Content-Type,value=text/html"..., 33key=Content-Type,value=text/html
    ) = 33
    write(1, "key=Content-Length,value=10/n", 28key=Content-Length,value=10
    ) = 28
    writev(6, [{"HTTP/1.1 200 OK/r/nServer: nginx/r/n"..., 156}, {"2 + 2 = 4/n", 10}], 2) = 166 //返回数据给客户端
    setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    recv(6, 0x9881c40, 1024, 0)             = -1 EAGAIN (Resource temporarily unavailable)
    epoll_wait(3, {{EPOLLOUT, {u32=3085643960, u64=13832223603290747064}}}, 4096, 60000) = 1  //为啥多出这个事件???
    gettimeofday({1299826536, 537666}, NULL) = 0
    epoll_wait(3,

       
    {{EPOLLIN|EPOLLOUT, {u32=3085643873, u64=57891489311305825}}}, 4096, 60000) = 1
    gettimeofday({1299826679, 729635}, NULL) = 0
    recv(6, "GET /test.jsp HTTP/1.1/r/nHost: te"..., 1024, 0) = 1024
    recv(6, "; Province=010; City=010; __utmc"..., 7564, 0) = 117
    getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
    writev(7, [{"/0224/4/177/2/2/0/10HTTP/1.1/0/0/t/test.jsp/0/0/r6"..., 1155}], 1) = 1155  //发送请求给后端
    recv(7, 0x9892420, 4096, 0)             = -1 EAGAIN (Resource temporarily unavailable)  //及时去读取了
    recv(6, 0xbff5f143, 1, MSG_PEEK)        = -1 EAGAIN (Resource temporarily unavailable)
    epoll_wait(3, {{EPOLLIN|EPOLLOUT, {u32=3085643960, u64=13832223603290747064}}}, 4096, 60000) = 1
    gettimeofday({1299826679, 731080}, NULL) = 0
    recv(7, "AB/0;/4/0/310/0/2OK/0/0/2/0/fContent-Type/0/0/tt"..., 4096, 0) = 87
    write(1, "key=Content-Type,value=text/html"..., 33key=Content-Type,value=text/html
    ) = 33
    write(1, "key=Content-Length,value=10/n", 28key=Content-Length,value=10
    ) = 28
    writev(6, [{"HTTP/1.1 200 OK/r/nServer: nginx/r/n"..., 156}, {"2 + 2 = 4/n", 10}], 2) = 166
    recv(6, 0x9881c40, 1024, 0)             = -1 EAGAIN (Resource temporarily unavailable)
    epoll_wait(3, {{EPOLLOUT, {u32=3085643960, u64=13832223603290747064}}}, 4096, 60000) = 1  //为啥多出这个事件
    gettimeofday({1299826679, 732019}, NULL) = 0
    epoll_wait(3,

 

 类似资料: