DPVS-5: 后端服务监控原理与测试
后端监控原理
被动监测
DPVS自带了被动监控,通过监控后端服务对外部请求的响应情况,判断服务器是否可用。
DPVS的被动监测,并不能获取后端服务器的详细情况,仅仅通过丢包/拒绝情况来发觉后端服务是否可用。
TCP session state transfers from SYN_RECV to CLOSE
TCP session state transfers from SYN_SENT to CLOSE
TCP session expired from SYN_RECV state
TCP session expired from SYN_SENT state
TCP synproxy step 3 received RST when awaiting SYN/ACK from backend
UDP session expired from ONEWAY state
后端服务状态变换
RS服务有4种状态 UP, DOWN, DOWN-WAIT , UP-WARM
UP : 允许访问
DOWN: 不允许访问
DOWN-WAIT:允许访问
UP-WARM:允许访问
整体的状态转换如图

详细变换逻辑
更细致的状态转换,分为master lcore (控制面worker), slave lcore(数据面 worker) ,下图为官方图

 
后端服务失效的流程:
slave 检测服务失败 , 进入DOWN-WAIT, 同时发送Down notice master, master 也进入DOWN-WAIT
当Master接收的Down notice达到阈值(默认为1),进入DOWN, 广播Close notice至所有slave,所有slave进入DOWN。 这样所有lcore中的该后端服务状态都为DOWN了。
后端服务尝试恢复流程:
在进入DOWN时,master lcore会启动一个抑制时间的定时器,到期后,广播Open notice至所有slave, 所有slave进入到 UP-WARM状态,此时,外部请求可以分配到这个后端服务。
slave再次检测到该服务不可用,回到后端服务失效逻辑中去,后面的抑制时间会加倍。
后端服务恢复成功流程:
当服务器在UP-WARM状态,slave 检测该服务可用,并且已到达可用次数阈值(默认为1),进入UP状态,同时发送Up notice给master, 该后端服务在所有lcore中为UP状态。
后端服务其他情况:
slave在UP时, 收到Close notice, 直接进入DOWN状态。 这里的Close notice来自于master, master可以因为外部控制指令等直接DOWN掉后端服务。
这里的消息有
Down notice , slave to master , 单播
Up notice, slave to maser , 单播
Close notice , master to slave, 多播
Open notice, master to slave, 多播
监控测试
未开启后端监测
默认之前的双臂配置
 https://blog.csdn.net/jacicson1987/article/details/145803532
# 添加VIP
./dpip addr add 10.0.0.100/32 dev dpdk0
# 添加负载均衡服务 ,  轮询模式
./ipvsadm -A -t 10.0.0.100:80 -s rr
# 添加 3个RS, FULLNAT 模式
./ipvsadm -a -t 10.0.0.100:80 -r 192.168.100.3:80 -b
./ipvsadm -a -t 10.0.0.100:80 -r 192.168.100.4:80 -b
./ipvsadm -a -t 10.0.0.100:80 -r 192.168.100.5:80 -b
# 为负载均衡服务 10.0.0.100:80 添加一个LOCAL IP 在dpdk1上
./ipvsadm --add-laddr -z 192.168.100.200 -t 10.0.0.100:80 -F dpdk1
# 添加路由
./dpip route add 10.0.0.0/16 dev dpdk0
./dpip route add 192.168.100.0/24 dev dpdk1
访问正常
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 2 !
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 1 !
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 2 !
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 0 !
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 1 !
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 2 !
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 1 !
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 0 !
[root@dkdp192 ~]# 
关闭Server 0
root@ubuntu22:~# systemctl stop nginx 
root@ubuntu22:~# 
再测试, 发现请求还是会按照原有的轮询分配到已经disable的 Server 0上去,导致拒绝连接。
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 2 !
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 2 !
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 1 !
[root@dkdp192 ~]# curl 10.0.0.100:80
curl: (7) Failed to connect to 10.0.0.100 port 80: Connection refused
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 1 !
[root@dkdp192 ~]# curl 10.0.0.100:80
curl: (7) Failed to connect to 10.0.0.100 port 80: Connection refused
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 2 !
[root@dkdp192 ~]# curl 10.0.0.100:80
This is Server 2 !
开启后端被动监测
配置如下
./dpip addr add 10.0.0.100/32 dev dpdk0
# 添加负载均衡功能服务时,开启被动监测
./ipvsadm -A -t 10.0.0.100:80 -s rr --dest-check default
./ipvsadm -a -t 10.0.0.100:80 -r 192.168.100.3:80 -b
./ipvsadm -a -t 10.0.0.100:80 -r 192.168.100.4:80 -b
./ipvsadm -a -t 10.0.0.100:80 -r 192.168.100.5:80 -b
./ipvsadm --add-laddr -z 192.168.100.200 -t 10.0.0.100:80 -F dpdk1
 
./dpip route add 10.0.0.0/16 dev dpdk0
./dpip route add 192.168.100.0/24 dev dpdk1
RS已配置
root@r750-132:~/dpvs/bin# ./ipvsadm  -ln
IP Virtual Server version 1.9.8 (size=0)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.100:80 rr dest-check internal:default
  -> 192.168.100.3:80             FullNat 1      0          0         
  -> 192.168.100.4:80             FullNat 1      0          0         
  -> 192.168.100.5:80             FullNat 1      0          0   
测试脚本
1秒发一次请求,打印响应,如果没有打印,说明访问失败
#!/bin/bash
URL="http://10.0.0.100:80"
for i in {1..200}
do
  echo -n "$i: "
  response=$(curl -s $URL)  # 将 curl 的输出保存到变量 response 中
  if [ -z "$response" ]; then  # 判断 response 是否为空
    echo  # 如果为空,输出换行符
  else
    echo "$response"  # 如果不为空,输出 response
  fi
  sleep 1
done
失效测试
先开启测试脚本, 在关闭Server 0的 nginx。
第10秒的时候,访问 Server 0 失败
第16秒(+6s),访问Server 0 失败
第26秒(+10s), 访问Server 0 失败
第47秒(+21s). 访问Server 0 失败
第89秒(+42s), 访问Server 0 失败
由此可见,对于DOWN的服务器抑制时间是指数增加的。 (实际是用5s开始,最大3600s)
这里的后端服务状态 在 DOWN – UP-WARM – DOWN-WAIT 之间一直循环。
[root@dkdp192 ~]# ./long_curl.sh 
1: This is Server 0 !
2: This is Server 0 !
3: This is Server 1 !
4: This is Server 0 !
5: This is Server 1 !
6: This is Server 2 !
7: This is Server 2 !
8: This is Server 1 !
9: This is Server 1 !
10: 
11: This is Server 2 !
12: This is Server 2 !
13: This is Server 2 !
14: This is Server 2 !
15: This is Server 2 !
16: 
17: This is Server 2 !
18: This is Server 2 !
19: This is Server 1 !
20: This is Server 1 !
21: This is Server 1 !
22: This is Server 1 !
23: This is Server 1 !
24: This is Server 2 !
25: This is Server 2 !
26: 
27: This is Server 2 !
28: This is Server 1 !
29: This is Server 2 !
30: This is Server 2 !
31: This is Server 1 !
32: This is Server 1 !
33: This is Server 1 !
34: This is Server 2 !
35: This is Server 1 !
36: This is Server 2 !
37: This is Server 1 !
38: This is Server 1 !
39: This is Server 2 !
40: This is Server 1 !
41: This is Server 2 !
42: This is Server 2 !
43: This is Server 1 !
44: This is Server 2 !
45: This is Server 1 !
46: This is Server 2 !
47: 
48: This is Server 2 !
49: This is Server 1 !
50: This is Server 1 !
51: This is Server 2 !
52: This is Server 1 !
53: This is Server 2 !
54: This is Server 1 !
55: This is Server 2 !
56: This is Server 1 !
57: This is Server 1 !
58: This is Server 2 !
59: This is Server 1 !
60: This is Server 2 !
61: This is Server 2 !
62: This is Server 1 !
63: This is Server 2 !
64: This is Server 2 !
65: This is Server 2 !
66: This is Server 1 !
67: This is Server 2 !
68: This is Server 1 !
69: This is Server 1 !
70: This is Server 2 !
71: This is Server 1 !
72: This is Server 2 !
73: This is Server 2 !
74: This is Server 1 !
75: This is Server 2 !
76: This is Server 1 !
77: This is Server 1 !
78: This is Server 1 !
79: This is Server 2 !
80: This is Server 1 !
81: This is Server 1 !
82: This is Server 2 !
83: This is Server 1 !
84: This is Server 2 !
85: This is Server 2 !
86: This is Server 2 !
87: This is Server 1 !
88: This is Server 1 !
89: 
90: This is Server 1 !
91: This is Server 2 !
92: This is Server 1 !
93: This is Server 1 !
94: This is Server 2 !
95: This is Server 2 !
96: This is Server 2 !
97: This is Server 2 !
98: This is Server 2 !
99: This is Server 1 !
100: This is Server 2 !
101: This is Server 2 !
服务器状态
Server 0是抑制状态inhibited.
root@r750-132:~/dpvs/bin# ./ipvsadm  -ln
IP Virtual Server version 1.9.8 (size=0)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.100:80 rr dest-check internal:default
  -> 192.168.100.3:80             FullNat 0      0          0          inhibited
  -> 192.168.100.4:80             FullNat 1      0          4         
  -> 192.168.100.5:80             FullNat 1      0          4   
查看日志
与测试结果完全对应,失效的后端服务,抑制时间也是 5s, 10s, 20s , 40s …
SERVICE: [cid 07, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 1, inhibited no, warm_up_count 0] detect dest DOWN
SERVICE: [cid 00, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 1, inhibited yes, down_notice_recvd 1, inhibit_duration 5s, origin_weight 0] notify slaves DOWN
SERVICE: [cid 00, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 0, inhibited yes, down_notice_recvd 1, inhibit_duration 10s, origin_weight 1] notify slaves UP
SERVICE: [cid 06, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 1, inhibited no, warm_up_count 1] detect dest DOWN
SERVICE: [cid 00, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 1, inhibited yes, down_notice_recvd 1, inhibit_duration 10s, origin_weight 0] notify slaves DOWN
SERVICE: [cid 00, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 0, inhibited yes, down_notice_recvd 1, inhibit_duration 20s, origin_weight 1] notify slaves UP
SERVICE: [cid 04, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 1, inhibited no, warm_up_count 1] detect dest DOWN
SERVICE: [cid 00, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 1, inhibited yes, down_notice_recvd 1, inhibit_duration 20s, origin_weight 0] notify slaves DOWN
SERVICE: [cid 00, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 0, inhibited yes, down_notice_recvd 1, inhibit_duration 40s, origin_weight 1] notify slaves UP
恢复测试
开启Server 0 nginx
[root@dkdp192 ~]# ./long_curl.sh 
1: This is Server 2 !
2: This is Server 2 !
3: 
4: This is Server 2 !
5: This is Server 2 !
6: This is Server 1 !
7: This is Server 2 !
8: This is Server 2 !
9: This is Server 1 !
10: This is Server 1 !
11: This is Server 2 !
12: This is Server 2 !
13: This is Server 1 !
14: This is Server 1 !
15: This is Server 1 !
16: This is Server 2 !
17: This is Server 2 !
18: This is Server 1 !
19: This is Server 1 !
20: This is Server 1 !
21: This is Server 2 !
22: This is Server 2 !
23: This is Server 2 !
24: This is Server 1 !
25: This is Server 1 !
26: This is Server 0 !
27: This is Server 2 !
28: This is Server 0 !
29: This is Server 0 !
30: This is Server 1 !
31: This is Server 1 !
32: This is Server 1 !
33: This is Server 0 !
服务器状态
Server 0 在连通后,恢复状态
root@r750-132:~/dpvs/bin# ./ipvsadm  -ln
IP Virtual Server version 1.9.8 (size=0)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.100:80 rr dest-check internal:default
  -> 192.168.100.3:80             FullNat 1      0          3         
  -> 192.168.100.4:80             FullNat 1      0          3         
  -> 192.168.100.5:80             FullNat 1      0          1 
查看日志
多个slave worker检测到后端恢复
SERVICE: [cid 06, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 1, inhibited no, warm_up_count 1] detect dest UP
SERVICE: [cid 04, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 1, inhibited no, warm_up_count 1] detect dest UP
SERVICE: [cid 07, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 1, inhibited no, warm_up_count 1] detect dest UP
SERVICE: [cid 08, tcp, svc 10.0.0.100:80, rs 192.168.100.3:80, weight 1, inhibited no, warm_up_count 1] detect dest UP
