如何分析linux tcp/ip 丢包问题

root@spc:/home/vec/dev_document/dropwatch/drop_watch/src# cat /proc/net/dev
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
    lo:   29142     323    0    0    0     0          0         0    29142     323    0    0    0     0       0          0
wlp3s0: 35148233   39485    0 1226    0     0          0         0  4937381   36609    0    0    0     0       0          0
enp4s0:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0

1.2 softnet_stat: 各CPU RX backlog统计信息

root@spc:/home/vec/dev_document/dropwatch/drop_watch/src# cat /proc/net/softnet_stat
000001b0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000094 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00009a0f 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000089 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

各字段含义：
seq_printf(seq,
       "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
       sd->processed, sd->dropped, sd->time_squeeze, 0,
       0, 0, 0, 0, /* was fastroute */
       sd->cpu_collision, sd->received_rps, flow_limit_count);

Each line of /proc/net/softnet_stat corresponds to a struct softnet_data structure, of which there is 1 per CPU.
 
The values are separated by a single space and are displayed in hexadecimal

The first value, sd->processed, is the number of network frames processed. This can be more than the total
 number of network frames received if you are using ethernet bonding. There are cases where the ethernet
 bonding driver will trigger network data to be re-processed, which would increment the sd->processed count
 more than once for the same packet.
 
The second value, sd->dropped, is the number of network frames dropped because there was no room on the
 processing queue. More on this later.
 
The third value, sd->time_squeeze, is (as we saw) the number of times the net_rx_action loop terminated
 because the budget was consumed or the time limit was reached, but more work could have been. Increasing
 the budget as explained earlier can help reduce this.
 
The next 5 values are always 0.

The ninth value, sd->cpu_collision, is a count of the number of times a collision occurred when trying to
 obtain a device lock when transmitting packets. This article is about receive, so this statistic will not be seen below.
 
The tenth value, sd->received_rps, is a count of the number of times this CPU has been woken up to process
 packets via an Inter-processor Interrupt
 
The last value, flow_limit_count, is a count of the number of times the flow limit has been reached.
 Flow limiting is an optional Receive Packet Steering feature that will be examined shortly.

1、每一行表示每个cpu的softnat_data统计数据；
2、第1列表示该cpu收到的包个数；
3、第2列表示因softnet_data的输入队列满而丢弃的数据包个数（input_pkt_queue，队列长度最大值可通过/proc/sys/net/core/netdev_max_backlog调整）；
4、第3列表示软中断一次取走netdev_budget个数据包，或取数据包时间超过2ms的次数；
5、第4~8列固定为0，没有意义；
6、第9列表示发送数据包时，对应的队列被锁住的次数；
7、表示开启rps时，该cpu向其它cpu发送的ipi中断个数；

Note

默认每个cpu的RX backlog是1000。

当netdev driver使用的非NAPI或开启了RPS时，就会用到backlog。

当进行高RX tput测试时，可以查看是否有在backlog这里丢包。

root@spc:/home/vec/dev_document/dropwatch/drop_watch/src# sysctl net.core.netdev_max_backlog

net.core.netdev_max_backlog = 1000

1.3 snmp: 各层协议的收发信息

root@spc:/home/vec/dev_document/dropwatch/drop_watch/src# cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 2 64 42966 0 2 0 0 0 40770 50253 20 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InCsumErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 40 0 0 40 0 0 0 0 0 0 0 0 0 0 40 0 40 0 0 0 0 0 0 0 0 0 0
IcmpMsg: InType3 OutType3
IcmpMsg: 40 40
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts InCsumErrors
Tcp: 1 200 120000 -1 38 11 2 0 3 39093 49237 386 0 22 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
Udp: 1196 40 0 594 0 0 0 576
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
UdpLite: 0 0 0 0 0 0 0 0

2. ifconfig

root@spc:/home/vec/dev_document/dropwatch/drop_watch/src# ifconfig
enp4s0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 30:85:a9:2a:48:06  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

----
Note
    /net/core/dev.c

    struct rtnl_link_stats64 *dev_get_stats(struct net_device *dev,
    					struct rtnl_link_stats64 *storage)
    {
    	conststruct net_device_ops *ops = dev->netdev_ops;
    
    	/* 网卡有注册  ndo_get_stats64该函数*/
    	if (ops->ndo_get_stats64) {
    		memset(storage, 0, sizeof(*storage));
    		ops->ndo_get_stats64(dev, storage);
    
    	/* 网卡有注册 ndo_get_stats 该函数*/
    	} elseif (ops->ndo_get_stats) {
    		netdev_stats_to_stats64(storage, ops->ndo_get_stats(dev));
    
    	/* 直接通过 dev->stats 获取, 该结构正在被遗弃。*/
    	} else {
    		netdev_stats_to_stats64(storage, &dev->stats);
    	}
    
    	/* 即 这里还要加上 kernel drop的packet - /net/core/dev.c 。 
           1. enqueue_to_backlog(): 当backlog不够时。atomic_long_inc(&skb->dev->rx_dropped)
    (同时也会更新 sd->dropped++)
           2. __netif_receive_skb_core() -- 不认识的packet。atomic_long_inc(&skb->dev->rx_dropped);
        */
    	storage->rx_dropped += (unsignedlong)atomic_long_read(&dev->rx_dropped);
    	storage->tx_dropped += (unsignedlong)atomic_long_read(&dev->tx_dropped);
    	storage->rx_nohandler += (unsignedlong)atomic_long_read(&dev->rx_nohandler);
    	return storage;
    }

3. ip

root@spc:/home/vec/dev_document/dropwatch/drop_watch/src# ip -s -s link ls enp4s0
2: enp4s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 30:85:a9:2a:48:06 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    RX errors: length   crc     frame   fifo    missed
               0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0
    TX errors: aborted  fifo   window heartbeat transns
               0        0       0       0       1

4. netstat

4.1 各网卡的netdev统计信息

root@spc:/home/vec/dev_document/dropwatch/drop_watch/src# netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
enp4s0    1500        0      0      0 0             0      0      0      0 BMU
lo       65536      327      0      0 0           327      0      0      0 LRU
wlp3s0    1500    41520      0   1434 0         40349      0      0      0 BMRU

4.2 特定协议(例如.UDP)的统计信息

root@spc:/home/vec/dev_document/dropwatch/drop_watch/src# netstat -su
IcmpMsg:
    InType3: 40
    OutType3: 40
Udp:
    1159 packets received
    40 packets to unknown port received
    0 packet receive errors
    576 packets sent
    0 receive buffer errors
    0 send buffer errors
    IgnoredMulti: 468
UdpLite:
IpExt:
    InMcastPkts: 370
    OutMcastPkts: 83
    InBcastPkts: 728
    OutBcastPkts: 77
    InOctets: 34596737
    OutOctets: 3791474
    InMcastOctets: 147789
    OutMcastOctets: 9349
    InBcastOctets: 311773
    OutBcastOctets: 13116
    InNoECTPkts: 39035

-----
Note:
    netstat -s ： 查看所有协议的统计信息。
    netstat -st ：查看tcp的统计信息。

5. ethtool

root@spc:/home/vec/dev_document/dropwatch/drop_watch/src# ethtool -S enp4s0
NIC statistics:
     tx_packets: 0
     rx_packets: 0
     tx_errors: 0
     rx_errors: 0
     rx_missed: 0
     align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     unicast: 0
     broadcast: 0
     multicast: 0
     tx_aborted: 0
     tx_underrun: 0

6. tc

vec@vec-virtual-machine:~$ tc -s qdisc show dev ens33
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
 Sent 14607963 bytes 46168 pkt (dropped 0, overlimits 0 requeues 1)
 backlog 0b 0p requeues 1
  maxpacket 285 drop_overlimit 0 new_flow_count 9 ecn_mark 0
  new_flows_len 0 old_flows_len 0

tc qdisc show dev ens34
tc -s qdisc show dev ens34

tc qdisc add dev ens34 clsact

tc filter show dev ens34 ingress
tc filter del dev ens34 ingress

tc filter show dev ens34 egress
tc filter del dev ens34 egress

tc qdisc del dev ens34 clsact

二. 如何分析和解决丢包问题

1. 扩大协议栈相关buffer

### KERNEL TUNING ###

# Increase size of file handles and inode cache
fs.file-max = 2097152

# Do less swapping
vm.swappiness = 10
vm.dirty_ratio = 60
vm.dirty_background_ratio = 2

# Sets the time before the kernel considers migrating a proccess to another core
kernel.sched_migration_cost_ns = 5000000

# Group tasks by TTY
#kernel.sched_autogroup_enabled = 0

### GENERAL NETWORK SECURITY OPTIONS ###

# Number of times SYNACKs for passive TCP connection.
net.ipv4.tcp_synack_retries = 2

# Allowed local port range
net.ipv4.ip_local_port_range = 2000 65535

# Protect Against TCP Time-Wait
net.ipv4.tcp_rfc1337 = 1

# Control Syncookies
net.ipv4.tcp_syncookies = 1

# Decrease the time default value for tcp_fin_timeout connection
net.ipv4.tcp_fin_timeout = 15

# Decrease the time default value for connections to keep alive
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15

### TUNING NETWORK PERFORMANCE ###

# Default Socket Receive Buffer
net.core.rmem_default = 31457280

# Maximum Socket Receive Buffer
net.core.rmem_max = 33554432

# Default Socket Send Buffer
net.core.wmem_default = 31457280

# Maximum Socket Send Buffer
net.core.wmem_max = 33554432

# Increase number of incoming connections
net.core.somaxconn = 65535

# Increase number of incoming connections backlog
net.core.netdev_max_backlog = 65536

# Increase the maximum amount of option memory buffers
net.core.optmem_max = 25165824

# Increase the maximum total buffer-space allocatable
# This is measured in units of pages (4096 bytes)
net.ipv4.tcp_mem = 786432 1048576 26777216
net.ipv4.udp_mem = 65536 131072 262144

# Increase the read-buffer space allocatable
net.ipv4.tcp_rmem = 8192 87380 33554432
net.ipv4.udp_rmem_min = 16384

# Increase the write-buffer-space allocatable
net.ipv4.tcp_wmem = 8192 65536 33554432
net.ipv4.udp_wmem_min = 16384

# Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

2. 其他可能性

2.1 防火墙拦截

root@spc:~# iptables -L -n
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

2.2 连接跟踪表溢

root@spc:/# conntrack -S
cpu=0           found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=1           found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=2           found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=3           found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=4           found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=5           found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=6           found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=7           found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=8           found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=9           found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=10          found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=11          found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=12          found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=13          found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=14          found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=15          found=0 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0

Note

除了防火墙本身配置DROP规则外，与防火墙有关的还有连接跟踪表nf_conntrack，Linux为每个经过内核网络

栈的数据包，生成一个新的连接记录项，当服务器处理的连接过多时，连接跟踪表被打满，服务器会丢弃新建连

接的数据包。

如何确认：

通过dmesg可以确认是否有该情况发生：如果输出值中有“nf_conntrack: table full, dropping

packet”，说明服务器nf_conntrack表已经被打满。通过conntrack工具或/proc文件系统查看nf_conntrack表实时状态：在本案例中，当前连接数远没有达到跟踪表最大值，因此排除这个因素。

如何解决：

如果确认服务器因连接跟踪表溢出而开始丢包，首先需要查看具体连接判断是否正遭受DOS攻击，如果是正常的业务流量造成，则可以考虑调整nf_conntrack的参数：

nf_conntrack_max决定连接跟踪表的大小，默认值是65535，可以根据系统内存大小计算一个合理值：CONNTRACK_MAX = RAMSIZE(in bytes)/16384/(ARCH/32)，如32G内存可以设置1048576；

nf_conntrack_buckets决定存储conntrack条目的哈希表大小，默认值是nf_conntrack_max的1/4，延续这种计算方式：BUCKETS = CONNTRACK_MAX/4，如32G内存可以设置262144；

nf_conntrack_tcp_timeout_established决定ESTABLISHED状态连接的超时时间，默认值是5天，可以缩短

到1小时，即3600

2.3 网卡设备的ring buffer溢出（网卡driver的统计）

如何确认：

通过 ifconfig/ifconfig/ip/ethtool, /proc/net/dev等可以查看drop包的数量。

如何解决：

通过调整网卡driver中的相关buffer

2.4 netdev_max_backlog溢出

通过 /proc/net/softnet_stat 可查看溢出信息。

netdev_max_backlog是内核从NIC收到包后，交由协议栈（如IP、TCP）处理之前的缓冲队列。每个CPU核都有一
个backlog队列，与Ring Buffer同理，当接收包的速率大于内核协议栈处理的速率时，CPU的backlog队列不断
增长，当达到设定的netdev_max_backlog值时，数据包将被丢弃。

如何确认:

通过查看/proc/net/softnet_stat可以确定是否发生了netdev backlog队列溢出。其中：

每一行代表每个CPU核的状态统计，从CPU0依次往下，每一列代表一个CPU核的各项统计：第一列代表中断处理程序收到的包总数；第二列即代表由于netdev_max_backlog队列溢出而被丢弃的包总数

如何解决:

netdev_max_backlog的默认值是1000，在高速链路上，可能会出现上述第二列统计不为0的情况，可以适当调大内核参数net.core.netdev_max_backlog到2000来解决。

2.5 反向路由过滤

root@spc:/# sysctl -a | grep rp_filter
net.ipv4.conf.all.arp_filter = 0
net.ipv4.conf.all.rp_filter = 2
net.ipv4.conf.default.arp_filter = 0
net.ipv4.conf.default.rp_filter = 2
net.ipv4.conf.enp4s0.arp_filter = 0
net.ipv4.conf.enp4s0.rp_filter = 2
net.ipv4.conf.lo.arp_filter = 0
net.ipv4.conf.lo.rp_filter = 0
net.ipv4.conf.wlp3s0.arp_filter = 0
net.ipv4.conf.wlp3s0.rp_filter = 2


注：
反向路由过滤机制是Linux通过反向路由查询，检查收到的数据包源IP是否可路由（Loose mode）、是否最佳
路由（Strict mode），如果没有通过验证，则丢弃数据包，设计的目的是防范IP地址欺骗攻击。rp_filter提供了三种模式供配置：

0 - 不验证
1 - RFC3704定义的严格模式：对每个收到的数据包，查询反向路由，如果数据包入口和反向路由出口不一致，则不通过
2 - RFC3704定义的松散模式：对每个收到的数据包，查询反向路由，如果任何接口都不可达，则不通过

查看和配置：sysctl -a | grep rp_filter 
关于rp_filter ：https://www.cnblogs.com/lipengxiang2009/p/7446388.html

如何确认：

查看当前rp_filter策略配置，如果设置为1，就需要查看主机的网络环境和路由策略是否可能会导致客户端的入包无法通过反向路由验证。从原理来看这个机制工作在网络层，因此，如果客户端能够Ping通服务器，就能够排除这个因素了。

解决办法：

修改路由表，使响应数据包从eth1出，即保证请求数据包进的网卡和响应数据包出的网卡为同一个网卡。

关闭rp_filter参数。（注意all和default的参数都要改）

修改/etc/sysctl.conf文件，然后sysctl -p刷新到内存。

使用sysctl -w直接写入内存：sysctl -w net.ipv4.conf.all.rp_filter=0

修改/proc文件系统： echo "0">/proc/sys/net/ipv4/conf/all/rp_filter

2.6 半连接队列溢出

半连接队列指的是TCP传输中服务器收到SYN包但还未完成三次握手的连接队列，队列大小由内核参数tcp_max_syn_backlog定义。
当服务器保持的半连接数量达到tcp_max_syn_backlog后，内核将会丢弃新来的SYN包。

如何确认：

通过dmesg可以确认是否有该情况发生：如果输出值中有“TCP: drop open request from”，说明半连接队列已被打满。半连接队列的连接数量可以通过netstat统计SYN_RECV状态的连接得知。大多数情况下这个值应该是0或很小，因为半连接状态从第一次握手完成时进入，第三次握手完成后退出，正常的网络环境中这个过程发生很快，如果这个值较大，服务器极有可能受到了SYN Flood攻击。

如何解决：

tcp_max_syn_backlog的默认值是256，通常推荐内存大于128MB的服务器可以将该值调高至1024，内存小于32MB的服务器调低到128，同样，该参数通过sysctl修改。

另外，上述行为受到内核参数tcp_syncookies的影响，若启用syncookie机制，当半连接队列溢出时，并不会直接丢弃SYN包，而是回复带有syncookie的SYC+ACK包，设计的目的是防范SYN Flood造成正常请求服务不可用。

2.7 PAWS ：内核参数/proc/sys/net/ipv4/tcp_tw_recycle 控制

PAWS全名Protect Againest Wrapped Sequence numbers，目的是解决在高带宽下，TCP序列号在一次会话中可能被重复使用而带来的问题

3. dropwatch

root@spc:/home/vec/dev_document/dropwatch/drop_watch/src# sudo ./dropwatch -l kas
Initalizing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
9 drops at __init_scratch_end+1243f1ee (0xffffffffc083f1ee)
1 drops at __netif_receive_skb_core+14f (0xffffffffac73612f)
4 drops at __init_scratch_end+1243f1ee (0xffffffffc083f1ee)
9 drops at __init_scratch_end+1243f1ee (0xffffffffc083f1ee)
4 drops at __init_scratch_end+1243f1ee (0xffffffffc083f1ee)
2 drops at __init_scratch_end+1243f1ee (0xffffffffc083f1ee)
1 drops at __netif_receive_skb_core+14f (0xffffffffac73612f)
4 drops at __init_scratch_end+1243f1ee (0xffffffffc083f1ee)
4 drops at __init_scratch_end+1243f1ee (0xffffffffc083f1ee)
5 drops at __init_scratch_end+1243f1ee (0xffffffffc083f1ee)
4 drops at __init_scratch_end+1243f1ee (0xffffffffc083f1ee)
1 drops at __netif_receive_skb_core+14f (0xffffffffac73612f)
1 drops at ip_rcv_finish_core.isra.0+1b2 (0xffffffffac7affa2)
^CGot a stop message
dropwatch>

4. perf

sudo perf record -g -a -e skb:kfree_skb
sudo perf script