当前位置: 首页 > news >正文

网络虚拟化:veth,bridge,network namespace与docker网络

文章目录

  • 网络虚拟化与虚拟网络的区别
  • veth
    • veth通信实验
    • 源码实现
      • 内核模块初始化
      • 设置网卡
      • 创建网卡
      • 数据发送接收
  • network namespace与vrf
    • 命名空间通信实验
    • 源码实现
      • 数据结构
      • 默认命名空间创建
      • 新创建进程命名空间
      • 网络命名空间内的子系统初始化
        • register_pernet_subsys例子
        • register_pernet_device例子
      • 新创建进程命名空间初始化
      • 改变设备所属命名空间
      • socket创建所属命名空间
      • 多命名空间socket编程
      • 网络收发使用命名空间
    • vrf与network namespace
    • 虚拟机与容器命名空间互联方案差异
  • bridge
    • brideg通信实验
    • 源码实现
      • 初始化及创建网桥函数
      • 向网桥添加设备
        • 将添加设备添加到网桥
        • 设备帧接收函数注册
      • 数据包处理过程
        • bridge报文处理
        • 转发流程
    • bridge vlan支持
  • docker网络
    • 容器与外部网络通信
  • 延伸1:Open vSwtich与OpenFlow
    • OpenFlow
  • 延伸2:k8s网络模型CNI与cilium
  • 延伸3: OpenStack
  • 参考文章
    • 跟唐老师学习云网络系列
    • 开发内功修炼系列
    • 极客时间系列
    • 其他

网络虚拟化与虚拟网络的区别

Q:网络虚拟化与虚拟网络有何区别
GPT A:网络虚拟化与虚拟网络是两个相关但有明显区别的概念:

  1. 网络虚拟化(Network Virtualization)

定义:网络虚拟化是一种将物理网络资源(如交换机、路由器、链路、带宽等)抽象、整合或分割成一个或多个逻辑网络(虚拟网络)的技术。
目标:使网络资源能够像计算资源一样被灵活分配、隔离和管理,提高网络的灵活性、可扩展性和资源利用率。
例子:SDN(软件定义网络)、VXLAN、NVGRE、OpenFlow等。
特点:
可以在同一物理网络上创建多个相互隔离的虚拟网络。
支持动态调整、迁移和自动化部署等高级功能。
管理和控制层面通常由控制器统一管理。
2. 虚拟网络(Virtual Network)

定义:虚拟网络是建立在物理网络之上的、具有独立网络拓扑、地址空间和路由规则的逻辑网络实体。
目标:为用户或应用提供一个逻辑上独立、隔离的网络环境。
例子:VLAN(虚拟局域网)、VPN(虚拟专用网)、云中的VPC(Virtual Private Cloud)等。
特点:
主要关注“网络的使用者”视角,为租户或应用提供专属的网络空间。
虚拟网络可以通过网络虚拟化技术实现。
虚拟网络的实现可以简单(如VLAN)也可以复杂(如Overlay网络)。
区别总结:

网络虚拟化是实现手段和底层技术,侧重于资源的抽象、隔离和管理。
虚拟网络是网络虚拟化的产物或表现形式,侧重于给用户或应用提供虚拟的网络环境。
网络虚拟化是“如何做”,虚拟网络是“做出来的结果”。

veth

veth通信实验

$sudo ip link add veth0 type veth peer name veth1
$sudo ip addr add 192.168.10.1/24 dev veth0
$sudo ip addr add 192.168.10.2/24 dev veth1
$sudo ip link set veth0 up
$sudo ip link set veth1 up
$ ip addr
#...
4: veth1@veth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000link/ether 22:68:39:16:5b:a3 brd ff:ff:ff:ff:ff:ffinet 192.168.10.2/24 scope global veth1valid_lft forever preferred_lft foreverinet6 fe80::2068:39ff:fe16:5ba3/64 scope linkvalid_lft forever preferred_lft forever
5: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000link/ether 9e:72:41:a4:b6:fc brd ff:ff:ff:ff:ff:ffinet 192.168.10.1/24 scope global veth0valid_lft forever preferred_lft foreverinet6 fe80::9c72:41ff:fea4:b6fc/64 scope linkvalid_lft forever preferred_lft forever

再进行必要的设置

#  echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter
#  echo 0 > /proc/sys/net/ipv4/conf/veth0/rp_filter
#  echo 0 > /proc/sys/net/ipv4/conf/veth1/rp_filter
#  echo 1 > /proc/sys/net/ipv4/conf/veth1/accept_local
#  echo 1 > /proc/sys/net/ipv4/conf/veth0/accept_local

尝试Ping:

# ping 192.168.10.2 -I veth0
PING 192.168.10.2 (192.168.10.2) from 192.168.10.1 veth0: 56(84) bytes of data.
64 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=0.067 ms
64 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=0.033 ms
64 bytes from 192.168.10.2: icmp_seq=3 ttl=64 time=0.041 ms
64 bytes from 192.168.10.2: icmp_seq=4 ttl=64 time=0.049 ms
^C
--- 192.168.10.2 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3073ms
rtt min/avg/max/mdev = 0.033/0.047/0.067/0.012 ms

可以看到创建的veth和其他的网卡很像,只不过比较特殊的就是他有一对

源码实现

内核模块初始化

veth作为内核模块存在

# modinfo veth
filename:       /lib/modules/6.11.0-28-generic/kernel/drivers/net/veth.ko.zst
alias:          rtnl-link-veth
license:        GPL v2
description:    Virtual Ethernet Tunnel

其位置位于drivers/net/veth.c


module_init(veth_init);
static __init int veth_init(void)
{return rtnl_link_register(&veth_link_ops);
}static struct rtnl_link_ops veth_link_ops = {.kind		= DRV_NAME,.priv_size	= sizeof(struct veth_priv),.setup		= veth_setup,.validate	= veth_validate,.newlink	= veth_newlink,.dellink	= veth_dellink,.policy		= veth_policy,.maxtype	= VETH_INFO_MAX,.get_link_net	= veth_get_link_net,
};

其调用了rtnl_link_register注册了veth_link_ops到link_ops上,等新建接口时(比如用户态发送RTM_NEWLINK)rtnl_newlink就会调用veth_link_ops中的相关函数去创建和设置网卡,像openvswitch的internal网卡,tun/tap网卡等等的虚拟网卡都会调用rtnl_link_register来注册,所以像这种可以临时添加的虚拟网卡,应该都可以响应RTM_NEWLINK来创建网卡。

设置网卡

在rtnl_newlink调用中,会先间接调用rtnl_create_link,其中调用了alloc_netdev_mqs,

static int __rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh,struct nlattr **attr, struct netlink_ext_ack *extack)
{
//...dev = rtnl_create_link(link_net ? : dest_net, ifname,name_assign_type, ops, tb, extack);if (IS_ERR(dev)) {err = PTR_ERR(dev);goto out;}dev->ifindex = ifm->ifi_index;if (ops->newlink) {err = ops->newlink(link_net ? : net, dev, tb, data, extack);
//...
}struct net_device *rtnl_create_link(struct net *net, const char *ifname,unsigned char name_assign_type,const struct rtnl_link_ops *ops,struct nlattr *tb[],struct netlink_ext_ack *extack)
{
//...dev = alloc_netdev_mqs(ops->priv_size, ifname, name_assign_type,ops->setup, num_tx_queues, num_rx_queues);if (!dev)return ERR_PTR(-ENOMEM);dev_net_set(dev, net);dev->rtnl_link_ops = ops;dev->rtnl_link_state = RTNL_LINK_INITIALIZING;
//...
}/*** alloc_netdev_mqs - allocate network device* @sizeof_priv: size of private data to allocate space for* @name: device name format string* @name_assign_type: origin of device name* @setup: callback to initialize device* @txqs: the number of TX subqueues to allocate* @rxqs: the number of RX subqueues to allocate** Allocates a struct net_device with private data area for driver use* and performs basic initialization.  Also allocates subqueue structs* for each queue on the device.*/
struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,unsigned char name_assign_type,void (*setup)(struct net_device *),unsigned int txqs, unsigned int rxqs)
{
//...setup(dev);
//...
}

因此调用了ops->setup,setup函数实现:

static void veth_setup(struct net_device *dev)
{ether_setup(dev);
//...dev->netdev_ops = &veth_netdev_ops;dev->ethtool_ops = &veth_ethtool_ops;
//...
}

创建网卡

在调用完rtnl_create_link后,就会调用newlink函数了


static int veth_newlink(struct net *src_net, struct net_device *dev,struct nlattr *tb[], struct nlattr *data[],struct netlink_ext_ack *extack)
{int err;struct net_device *peer;struct veth_priv *priv;char ifname[IFNAMSIZ];struct nlattr *peer_tb[IFLA_MAX + 1], **tbp;unsigned char name_assign_type;struct ifinfomsg *ifmp;struct net *net;
//...net = rtnl_link_get_net(src_net, tbp);if (IS_ERR(net))return PTR_ERR(net);//先创建邻居peer = rtnl_create_link(net, ifname, name_assign_type,&veth_link_ops, tbp, extack);
//...//注册邻居err = register_netdevice(peer);
//...//注册自身err = register_netdevice(dev);
//.../** tie the deviced together*///把两个设备关联到一起priv = netdev_priv(dev);rcu_assign_pointer(priv->peer, peer);priv = netdev_priv(peer);rcu_assign_pointer(priv->peer, dev);
//...
}
struct veth_priv {struct net_device __rcu	*peer;atomic64_t		dropped;struct bpf_prog		*_xdp_prog;struct veth_rq		*rq;unsigned int		requested_headroom;
};

数据发送接收

函数位于veth_xmit

static const struct net_device_ops veth_netdev_ops = {.ndo_init            = veth_dev_init,.ndo_open            = veth_open,.ndo_stop            = veth_close,.ndo_start_xmit      = veth_xmit,.ndo_get_stats64     = veth_get_stats64,.ndo_set_rx_mode     = veth_set_multicast_list,.ndo_set_mac_address = eth_mac_addr,.ndo_get_iflink		= veth_get_iflink,.ndo_fix_features	= veth_fix_features,.ndo_features_check	= passthru_features_check,.ndo_set_rx_headroom	= veth_set_rx_headroom,.ndo_bpf		= veth_xdp,.ndo_xdp_xmit		= veth_ndo_xdp_xmit,.ndo_get_peer_dev	= veth_peer_dev,
};static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
{struct veth_priv *rcv_priv, *priv = netdev_priv(dev);struct veth_rq *rq = NULL;struct net_device *rcv;int length = skb->len;bool rcv_xdp = false;int rxq;rcu_read_lock();//取邻居rcv = rcu_dereference(priv->peer);if (unlikely(!rcv)) {kfree_skb(skb);goto drop;}rcv_priv = netdev_priv(rcv);rxq = skb_get_queue_mapping(skb);if (rxq < rcv->real_num_rx_queues) {rq = &rcv_priv->rq[rxq];rcv_xdp = rcu_access_pointer(rq->xdp_prog);}skb_tx_timestamp(skb);//发送if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {if (!rcv_xdp)dev_lstats_add(dev, length);} else {
drop:atomic64_inc(&priv->dropped);}if (rcv_xdp)__veth_xdp_flush(rq);rcu_read_unlock();return NETDEV_TX_OK;
}static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,struct veth_rq *rq, bool xdp)
{//进入xdp还是上送协议栈return __dev_forward_skb(dev, skb) ?: xdp ?veth_xdp_rx(rq, skb) :netif_rx(skb);
}int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
{int ret = ____dev_forward_skb(dev, skb);if (likely(!ret)) {skb->protocol = eth_type_trans(skb, dev);skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);}return ret;
}/*** eth_type_trans - determine the packet's protocol ID.* @skb: received socket data* @dev: receiving network device** The rule here is that we* assume 802.3 if the type field is short enough to be a length.* This is normal practice and works for any 'now in use' protocol.*/
__be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
{
//...//替换为邻居的devskb->dev = dev;
//...
}

因此对于veth,其数据发送就是取邻居的net_device,然后调用netif_rx把skb放入softnet_data的的input_pkt_queue上,然后后面的过程就是普通的接收流程了。
借用一张图表达veth的实现:
bridge

network namespace与vrf

网络命名空间可以为不同命名空间从逻辑上提供独立的协议栈,包括设备、路由表、arp表、iptables以及套接字(socket)等,做到隔离的效果。

命名空间通信实验

先创建一个网络命名空间

$ sudo ip netns add net1
$ lsns -t netNS TYPE NPROCS   PID USER       NETNSID NSFS            COMMAND
4026531840 net      80  1322 bentutu unassigned                 /usr/bin/pipewire
4026532383 net       0       root               /run/netns/net1

创建后可以使用,可以看到多出一个命名空间
可以使用ip netns exec或nsenter命令进入此命名空间

$ sudo nsenter --net=/run/netns/net1 bash
# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

可以看到此时只有一个环回口,回到默认的命名监控创建一对veth并将一端添加到命名空间

$ sudo ip link add veth1 type veth peer name veth1_p
$ sudo ip link set veth1 netns net1
$ ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000link/ether 54:ee:75:4c:62:fa brd ff:ff:ff:ff:ff:ff
3: wlp2s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000link/ether 34:e6:ad:38:6e:27 brd ff:ff:ff:ff:ff:ff
6: veth1_p@if7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000link/ether a6:00:93:fb:53:66 brd ff:ff:ff:ff:ff:ff link-netns net1
$ sudo ip netns exec net1 ip link list
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
7: veth1@if6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000link/ether 22:68:39:16:5b:a3 brd ff:ff:ff:ff:ff:ff link-netnsid 0

可以看到veth1已被添加到net1命名空间

$sudo ip addr add 192.168.10.1/24 dev veth1_p
$sudo ip netns exec net1 ip addr add 192.168.10.2/24 dev veth1
$sudo ip link set dev veth1_p up
$sudo ip netns exec net1 ip link set dev veth1 up
$ sudo ip netns exec net1 ping 192.168.10.1 -I veth1
PING 192.168.10.1 (192.168.10.1) from 192.168.10.2 veth1: 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=0.066 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.040 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=0.047 ms
^C
--- 192.168.10.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2034ms
rtt min/avg/max/mdev = 0.040/0.051/0.066/0.011 ms

可以看到能ping通,并且在两个命名空间内的接口,ip tables,路由等都是独立的

$ ip addr
#...
6: veth1_p@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000link/ether a6:00:93:fb:53:66 brd ff:ff:ff:ff:ff:ff link-netns net1inet 192.168.10.1/24 scope global veth1_pvalid_lft forever preferred_lft foreverinet6 fe80::a400:93ff:fefb:5366/64 scope linkvalid_lft forever preferred_lft forever
$ ip route
192.168.1.0/24 dev enp3s0 proto kernel scope link src 192.168.1.2 metric 100
192.168.10.0/24 dev veth1_p proto kernel scope link src 192.168.10.1$ sudo ip netns exec net1 ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
7: veth1@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000link/ether 22:68:39:16:5b:a3 brd ff:ff:ff:ff:ff:ff link-netnsid 0inet 192.168.10.2/24 scope global veth1valid_lft forever preferred_lft foreverinet6 fe80::2068:39ff:fe16:5ba3/64 scope linkvalid_lft forever preferred_lft forever
$ sudo ip netns exec net1 ip route
192.168.10.0/24 dev veth1 proto kernel scope link src 192.168.10.2

源码实现

数据结构

namespace
上面的net_ns就是网络命名空间,每个net_device和socket都有他们所属的命名空间,可以通过成员变量获取到net,而每个命名空间也有对应的路由表,iptable及内核参数。其中loopback_dev是每个net都有的一个环回设备,这也是为何上面实验中刚创建好命名空间,就有一个lo的原因

默认命名空间创建

之前在树莓派编译uboot及内核&&整体启动流程梳理过内核启动流程,经过内核入口点一路执行,会到达start_kernel:primary_entry->__primary_switch->__primary_switched->start_kernel
而在__primary_switched中会完成0号进程描述符的定义,其值为init_task,其定义中有


/** Set up the first task table, touch at your own risk!. Base=0,* limit=0x1fffff (=2MB)*/
struct task_struct init_task
#ifdef CONFIG_ARCH_TASK_STRUCT_ON_STACK__init_task_data
#endif__aligned(L1_CACHE_BYTES)
= {
//....nsproxy	= &init_nsproxy,
//...
}

nsproxy里面有各类命名空间


struct nsproxy init_nsproxy = {.count			= ATOMIC_INIT(1),.uts_ns			= &init_uts_ns,
#if defined(CONFIG_POSIX_MQUEUE) || defined(CONFIG_SYSVIPC).ipc_ns			= &init_ipc_ns,
#endif.mnt_ns			= NULL,.pid_ns_for_children	= &init_pid_ns,
#ifdef CONFIG_NET.net_ns			= &init_net,
#endif
#ifdef CONFIG_CGROUPS.cgroup_ns		= &init_cgroup_ns,
#endif
#ifdef CONFIG_TIME_NS.time_ns		= &init_time_ns,.time_ns_for_children	= &init_time_ns,
#endif
};

其中就包含了net_ns,也就是网络命名空间,在子系统初始化时就已经创建好了


struct net init_net = {.count		= REFCOUNT_INIT(1),.dev_base_head	= LIST_HEAD_INIT(init_net.dev_base_head),
#ifdef CONFIG_KEYS.key_domain	= &init_net_key_domain,
#endif
};static int __init net_ns_init(void)
{struct net_generic *ng;
//...down_write(&pernet_ops_rwsem);//初始化默认网络命名空间if (setup_net(&init_net, &init_user_ns))panic("Could not setup the initial network namespace");init_net_initialized = true;up_write(&pernet_ops_rwsem);if (register_pernet_subsys(&net_ns_ops))panic("Could not register network namespace subsystems");//注册命名空间新增,获取的netlinkrtnl_register(PF_UNSPEC, RTM_NEWNSID, rtnl_net_newid, NULL,RTNL_FLAG_DOIT_UNLOCKED);rtnl_register(PF_UNSPEC, RTM_GETNSID, rtnl_net_getid, rtnl_net_dumpid,RTNL_FLAG_DOIT_UNLOCKED);return 0;
}pure_initcall(net_ns_init);

新创建进程命名空间

执行新进程的fork时会判断是否执行新命名空间创建
SYSCALL_DEFINE0(fork)->kernel_clone->copy_process->copy_namespaces->create_new_namespaces->copy_net_ns
其中copy_process中将会调用dup_task_struct进行进程描述符拷贝,并进行一系列拷贝动作的处理。
如果指定了CLONE_NEWNET会调用create_new_namespaces创建新命名空间,赋值给新进程描述符的nsproxy


struct net *copy_net_ns(unsigned long flags,struct user_namespace *user_ns, struct net *old_net)
{struct ucounts *ucounts;struct net *net;int rv;//没指定增加引用计数使用旧的if (!(flags & CLONE_NEWNET))return get_net(old_net);ucounts = inc_net_namespaces(user_ns);if (!ucounts)return ERR_PTR(-ENOSPC);net = net_alloc();if (!net) {rv = -ENOMEM;goto dec_ucounts;}refcount_set(&net->passive, 1);net->ucounts = ucounts;get_user_ns(user_ns);rv = down_read_killable(&pernet_ops_rwsem);if (rv < 0)goto put_userns;//初始化新创建的命名空间rv = setup_net(net, user_ns);up_read(&pernet_ops_rwsem);if (rv < 0) {
put_userns:
#ifdef CONFIG_KEYSkey_remove_domain(net->key_domain);
#endifput_user_ns(user_ns);net_drop_ns(net);
dec_ucounts:dec_net_namespaces(ucounts);return ERR_PTR(rv);}return net;
}

网络命名空间内的子系统初始化

命名空间内的各个子系统都是在调用setup_net时初始化的,包括路由表、tcp的proc
伪文件系统、iptable规则读取,等等。
由于内核网络模块的复杂性,在内核中将网络模块划分成了各个子系统。每个子系
统都定义了一个初始化函数和一个退出函数,其结构体为pernet_operations。


struct pernet_operations {struct list_head list;/** Below methods are called without any exclusive locks.* More than one net may be constructed and destructed* in parallel on several cpus. Every pernet_operations* have to keep in mind all other pernet_operations and* to introduce a locking, if they share common resources.** The only time they are called with exclusive lock is* from register_pernet_subsys(), unregister_pernet_subsys()* register_pernet_device() and unregister_pernet_device().** Exit methods using blocking RCU primitives, such as* synchronize_rcu(), should be implemented via exit_batch.* Then, destruction of a group of net requires single* synchronize_rcu() related to these pernet_operations,* instead of separate synchronize_rcu() for every net.* Please, avoid synchronize_rcu() at all, where it's possible.** Note that a combination of pre_exit() and exit() can* be used, since a synchronize_rcu() is guaranteed between* the calls.*/int (*init)(struct net *net);void (*pre_exit)(struct net *net);void (*exit)(struct net *net);void (*exit_batch)(struct list_head *net_exit_list);unsigned int *id;size_t size;
};

register_pernet_device和register_pernet_subsys将会调用register_pernet_operations把这个pernet_operations进行注册:


/***      register_pernet_subsys - register a network namespace subsystem*	@ops:  pernet operations structure for the subsystem**	Register a subsystem which has init and exit functions*	that are called when network namespaces are created and*	destroyed respectively.**	When registered all network namespace init functions are*	called for every existing network namespace.  Allowing kernel*	modules to have a race free view of the set of network namespaces.**	When a new network namespace is created all of the init*	methods are called in the order in which they were registered.**	When a network namespace is destroyed all of the exit methods*	are called in the reverse of the order with which they were*	registered.*/
int register_pernet_subsys(struct pernet_operations *ops)
{int error;down_write(&pernet_ops_rwsem);error =  register_pernet_operations(first_device, ops);up_write(&pernet_ops_rwsem);return error;
}/***      register_pernet_device - register a network namespace device*	@ops:  pernet operations structure for the subsystem**	Register a device which has init and exit functions*	that are called when network namespaces are created and*	destroyed respectively.**	When registered all network namespace init functions are*	called for every existing network namespace.  Allowing kernel*	modules to have a race free view of the set of network namespaces.**	When a new network namespace is created all of the init*	methods are called in the order in which they were registered.**	When a network namespace is destroyed all of the exit methods*	are called in the reverse of the order with which they were*	registered.*/
int register_pernet_device(struct pernet_operations *ops)
{int error;down_write(&pernet_ops_rwsem);error = register_pernet_operations(&pernet_list, ops);//指向第一个deviceif (!error && (first_device == &pernet_list))first_device = &ops->list;up_write(&pernet_ops_rwsem);return error;
}

其中first_device是作为device的链表头存在,因此这两个函数加入的一个链表

static struct list_head *first_device = &pernet_list;

register_pernet_operations将pernet_operations挂到链上:


static int register_pernet_operations(struct list_head *list,struct pernet_operations *ops)
{
//...error = __register_pernet_operations(list, ops);
//...
}static int __register_pernet_operations(struct list_head *list,struct pernet_operations *ops)
{
//...list_add_tail(&ops->list, list);if (ops->init || (ops->id && ops->size)) {/* We held write locked pernet_ops_rwsem, and parallel* setup_net() and cleanup_net() are not possible.*/for_each_net(net) {error = ops_init(ops, net);if (error)goto out_undo;list_add_tail(&net->exit_list, &net_exit_list);}}
//...
}

而在进行注册时,会遍历所有命名空间进行初始化,net_namespace_list连接了所有的net,参见前面的数据结构小节:

#define for_each_net(VAR)				\list_for_each_entry(VAR, &net_namespace_list, list)static int ops_init(const struct pernet_operations *ops, struct net *net)
{int err = -ENOMEM;void *data = NULL;if (ops->id && ops->size) {data = kzalloc(ops->size, GFP_KERNEL);if (!data)goto out;err = net_assign_generic(net, *ops->id, data);if (err)goto cleanup;}err = 0;if (ops->init)err = ops->init(net);if (!err)return 0;cleanup:kfree(data);out:return err;
}
register_pernet_subsys例子

很多/proc/net下的文件其实是通过register_pernet_subsys实现的,见图,此处以/proc/net/protocols为例:
proc_net

register_pernet_device例子

前面说的每个net都有的loop设备就是通过register_pernet_device注册的:


/* Registered in net/core/dev.c */
struct pernet_operations __net_initdata loopback_net_ops = {.init = loopback_net_init,
};/* Setup and register the loopback device. */
static __net_init int loopback_net_init(struct net *net)
{struct net_device *dev;int err;err = -ENOMEM;dev = alloc_netdev(0, "lo", NET_NAME_UNKNOWN, loopback_setup);if (!dev)goto out;dev_net_set(dev, net);err = register_netdev(dev);if (err)goto out_free_netdev;BUG_ON(dev->ifindex != LOOPBACK_IFINDEX);//赋值net->loopback_dev = dev;return 0;out_free_netdev:free_netdev(dev);
out:if (net_eq(net, &init_net))panic("loopback: Failed to register netdevice: %d\n", err);return err;
}

因此每次新建命名空间都会有个lo:

ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

新创建进程命名空间初始化

上节提到的初始化在设备启动时就会进行,但是如果新增了一个命名空间,就要自己初始化了,再看前面提到的setup_net


/** setup_net runs the initializers for the network namespace object.*/
static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
{
//...//对每个命名空间进行初始化list_for_each_entry(ops, &pernet_list, list) {error = ops_init(ops, net);if (error < 0)goto out_undo;}
//...
}

改变设备所属命名空间

设备刚创建出来时位于默认命名空间,但是可以修改,如上面实验章节的命令:

sudo ip link set veth1 netns net1

使用netlink的RTM_SETLINK就可以做到


void __init rtnetlink_init(void)
{
//...rtnl_register(PF_UNSPEC, RTM_SETLINK, rtnl_setlink, NULL, 0);
//...
}static int rtnl_setlink(struct sk_buff *skb, struct nlmsghdr *nlh,struct netlink_ext_ack *extack)
{struct net *net = sock_net(skb->sk);struct ifinfomsg *ifm;struct net_device *dev;int err;struct nlattr *tb[IFLA_MAX+1];char ifname[IFNAMSIZ];
//...err = -EINVAL;ifm = nlmsg_data(nlh);//根据ifinde寻找if (ifm->ifi_index > 0)dev = __dev_get_by_index(net, ifm->ifi_index);//根据接口名寻找else if (tb[IFLA_IFNAME] || tb[IFLA_ALT_IFNAME])dev = rtnl_dev_get(net, NULL, tb[IFLA_ALT_IFNAME], ifname);elsegoto errout;if (dev == NULL) {err = -ENODEV;goto errout;}//执行err = do_setlink(skb, dev, ifm, extack, tb, ifname, 0);
errout:return err;
}static int do_setlink(const struct sk_buff *skb,struct net_device *dev, struct ifinfomsg *ifm,struct netlink_ext_ack *extack,struct nlattr **tb, char *ifname, int status)
{const struct net_device_ops *ops = dev->netdev_ops;int err;err = validate_linkmsg(dev, tb);if (err < 0)return err;if (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD] || tb[IFLA_TARGET_NETNSID]) {const char *pat = ifname && ifname[0] ? ifname : NULL;struct net *net = rtnl_link_get_net_capable(skb, dev_net(dev),tb, CAP_NET_ADMIN);if (IS_ERR(net)) {err = PTR_ERR(net);goto errout;}err = dev_change_net_namespace(dev, net, pat);put_net(net);if (err)goto errout;status |= DO_SETLINK_MODIFIED;}
//...
}/***	dev_change_net_namespace - move device to different nethost namespace*	@dev: device*	@net: network namespace*	@pat: If not NULL name pattern to try if the current device name*	      is already taken in the destination network namespace.**	This function shuts down a device interface and moves it*	to a new network namespace. On success 0 is returned, on*	a failure a netagive errno code is returned.**	Callers must hold the rtnl semaphore.*/int dev_change_net_namespace(struct net_device *dev, struct net *net, const char *pat)
{struct net *net_old = dev_net(dev);int err, new_nsid, new_ifindex;//.../** And now a mini version of register_netdevice unregister_netdevice.*//* If device is running close it first. *///接口downdev_close(dev);/* And unlink it from device chain */unlist_netdevice(dev);synchronize_net();/* Shutdown queueing discipline. */dev_shutdown(dev);/* Notify protocols, that we are about to destroy* this device. They should clean all the things.** Note that dev->reg_state stays at NETREG_REGISTERED.* This is wanted because this way 8021q and macvlan know* the device is just moving and can keep their slaves up.*///通知接口销毁call_netdevice_notifiers(NETDEV_UNREGISTER, dev);rcu_barrier();new_nsid = peernet2id_alloc(dev_net(dev), net, GFP_KERNEL);//分配ifindex/* If there is an ifindex conflict assign a new one */if (__dev_get_by_index(net, dev->ifindex))new_ifindex = dev_new_index(net);elsenew_ifindex = dev->ifindex;//通知接口删除rtmsg_ifinfo_newnet(RTM_DELLINK, dev, ~0U, GFP_KERNEL, &new_nsid,new_ifindex);/**	Flush the unicast and multicast chains*/dev_uc_flush(dev);dev_mc_flush(dev);/* Send a netdev-removed uevent to the old namespace */kobject_uevent(&dev->dev.kobj, KOBJ_REMOVE);netdev_adjacent_del_links(dev);/* Move per-net netdevice notifiers that are following the netdevice *///移动通知链move_netdevice_notifiers_dev_net(dev, net);/* Actually switch the network namespace *///指向新netdev_net_set(dev, net);dev->ifindex = new_ifindex;/* Send a netdev-add uevent to the new namespace */kobject_uevent(&dev->dev.kobj, KOBJ_ADD);netdev_adjacent_add_links(dev);/* Fixup kobjects */err = device_rename(&dev->dev, dev->name);WARN_ON(err);/* Adapt owner in case owning user namespace of target network* namespace is different from the original one.*/err = netdev_change_owner(dev, net_old, net);WARN_ON(err);/* Add the device back in the hashes */list_netdevice(dev);/* Notify protocols, that a new device appeared. */call_netdevice_notifiers(NETDEV_REGISTER, dev);/**	Prevent userspace races by waiting until the network*	device is fully setup before sending notifications.*/rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL);synchronize_net();err = 0;
out:return err;
}
static inline
void dev_net_set(struct net_device *dev, struct net *net)
{write_pnet(&dev->nd_net, net);
}

socket创建所属命名空间

在TCP/IP实现浅析的上节中已经看过socket创建过程:

int sock_create(int family, int type, int protocol, struct socket **res)
{return __sock_create(current->nsproxy->net_ns, family, type, protocol, res, 0);
}

可以看到scoket创建时使用的就是当前进程所属的网络命名空间。最后在sk_alloc中把命名空间设置上


/***	sk_alloc - All socket objects are allocated here*	@net: the applicable net namespace*	@family: protocol family*	@priority: for allocation (%GFP_KERNEL, %GFP_ATOMIC, etc)*	@prot: struct proto associated with this new sock instance*	@kern: is this to be a kernel socket?*/
struct sock *sk_alloc(struct net *net, int family, gfp_t priority,struct proto *prot, int kern)
{struct sock *sk;sk = sk_prot_alloc(prot, priority | __GFP_ZERO, family);if (sk) {
//...sock_net_set(sk, net);
//...}return sk;
}static inline
void sock_net_set(struct sock *sk, struct net *net)
{write_pnet(&sk->sk_net, net);
}

多命名空间socket编程

进一步的,若想在一个进程拥有不同命名空间的scoket,可以使用setns先切换到目标命名空间,此时再创建socket,继承的就是新的命名空间,创建完再切换回来就行了

int pal_sock_set_ns(fib_id_t fib)
{s_int32_t fd;char      netns_file_name[PATH_MAX];s_int32_t ret;/*   /var/run/netns/nosfib256    */PAL_VRF_NS_PATH(netns_file_name, fib);fd = open(netns_file_name, O_RDONLY, 0);if (fd < 0) {return -1;}ret = setns(fd, CLONE_NEWNET);close(fd);return ret;
}

网络收发使用命名空间

以报文发送为例,在TCP/IP实现浅析中一节分析过ip_queue_xmit,其查找路由的调用为ip_route_output_ports->ip_route_output_flow->__ip_route_output_key->ip_route_output_key_hash->ip_route_output_key_hash_rcu->fib_lookup

int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
{return __ip_queue_xmit(sk, skb, fl, inet_sk(sk)->tos);
}/* Note: skb->sk can be different from sk, in case of tunnels */
int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,__u8 tos)
{struct inet_sock *inet = inet_sk(sk);struct net *net = sock_net(sk);
//...rt = ip_route_output_ports(net, fl4, sk,daddr, inet->inet_saddr,inet->inet_dport,inet->inet_sport,sk->sk_protocol,RT_CONN_FLAGS_TOS(sk, tos),sk->sk_bound_dev_if);//...
}static inline
struct net *sock_net(const struct sock *sk)
{return read_pnet(&sk->sk_net);
}

而上面sk_alloc实现的赋值,因此调用ip_route_output_ports使用的就是socet创建时赋值的net。
在fib_lookup中:

static inline int fib_lookup(struct net *net, struct flowi4 *flp,struct fib_result *res, unsigned int flags)
{
//...tb = rcu_dereference_rtnl(net->ipv4.fib_main);
//...	tb = rcu_dereference_rtnl(net->ipv4.fib_default);
//...
}

可以看到跟数据结构小节中的结构都对上了,因此每个scoket有不同的网络命名空间,就有不同的路由表。

vrf与network namespace

交换机vrf即虚拟转发实例就可以使用命名空间来实现,因其对于新创建的vrf,都有三层隔离。

虚拟机与容器命名空间互联方案差异

对于命名空间之间的互联,虚拟机都用Tap/Tun网卡,而容器都用Veth网卡
可能是因为对于虚拟机都需要对接物理网卡,因此使用tun/tap网卡,一端连到内核,一端连到物理网卡
而对于容器,其互联是容器的互联,直接使用成对的veth即可

bridge

brideg通信实验

相关命令
brctl addbr 添加网桥
brctl addif 将网卡连接到网桥
brctl show 查看连接到网桥上有哪些网卡
新建两个命名空间及两对veth并配置:

$ sudo ip netns add net1
$ sudo ip link add veth1 type veth peer name veth1_p
$ sudo ip link set veth1 netns net1$ sudo ip netns add net2
$ sudo ip link add veth2 type veth peer name veth2_p
$ sudo ip link set veth2 netns net2$sudo ip netns exec net1 ip addr add 192.168.10.1/24 dev veth1
$sudo ip netns exec net2 ip addr add 192.168.10.2/24 dev veth2$sudo ip netns exec net1 ip link set dev veth1 up
$sudo ip netns exec net2 ip link set dev veth2 up

查看配置结果

$ sudo ip netns exec net1 ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11: veth1@if10: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default qlen 1000link/ether 22:68:39:16:5b:a3 brd ff:ff:ff:ff:ff:ff link-netnsid 0inet 192.168.10.1/24 scope global veth1valid_lft forever preferred_lft forever
$` sudo ip netns exec net2 ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
9: veth2@if8: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default qlen 1000link/ether da:3f:28:00:b0:69 brd ff:ff:ff:ff:ff:ff link-netnsid 0inet 192.168.10.2/24 scope global veth2valid_lft forever preferred_lft forever

此时创建bridge,

$ sudo brctl addbr br0
$ sudo ip link set dev veth1_p master br0
$ sudo ip link set dev veth2_p master br0$ sudo ip addr add 192.168.10.3/24 dev br0$ sudo ip link set veth1_p up
$ sudo ip link set veth2_p up
$ sudo ip link set br0 up

查看配置情况

$ brctl show
bridge name	bridge id		STP enabled	interfaces
br0		8000.fac06bb1b35b	no		veth1_pveth2_p

此时Ping一下:

$ sudo ip netns exec net1 ping 192.168.10.2 -I veth1
PING 192.168.10.2 (192.168.10.2) from 192.168.10.1 veth1: 56(84) bytes of data.
64 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=0.072 ms
64 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=0.041 ms
^C
--- 192.168.10.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.041/0.056/0.072/0.015 ms

源码实现

初始化及创建网桥函数

brige的代码位于net/bridge/br.c

module_init(br_init)static int __init br_init(void)
{int err;
//...err = stp_proto_register(&br_stp_proto);if (err < 0) {pr_err("bridge: can't register sap for STP\n");return err;}err = br_fdb_init();if (err)goto err_out;//销毁命名空间时销毁设备err = register_pernet_subsys(&br_net_ops);if (err)goto err_out1;err = br_nf_core_init();if (err)goto err_out2;err = register_netdevice_notifier(&br_device_notifier);if (err)goto err_out3;err = register_switchdev_notifier(&br_switchdev_notifier);if (err)goto err_out4;err = br_netlink_init();if (err)goto err_out5;brioctl_set(br_ioctl_deviceless_stub);//...return 0;
//...
}
int __init br_netlink_init(void)
{int err;br_mdb_init();br_vlan_rtnl_init();rtnl_af_register(&br_af_ops);err = rtnl_link_register(&br_link_ops);if (err)goto out_af;return 0;
//...
}

rtnl_link_register在分析veth时已经见过了,可以响应RTM_NEWLINK创建设备,会依次调用setup和newlink,而在rtnl_link_register其实已经创建了net_device

struct rtnl_link_ops br_link_ops __read_mostly = {.kind			= "bridge",.priv_size		= sizeof(struct net_bridge),.setup			= br_dev_setup,.maxtype		= IFLA_BR_MAX,.policy			= br_policy,.validate		= br_validate,.newlink		= br_dev_newlink,.changelink		= br_changelink,.dellink		= br_dev_delete,.get_size		= br_get_size,.fill_info		= br_fill_info,.fill_linkxstats	= br_fill_linkxstats,.get_linkxstats_size	= br_get_linkxstats_size,.slave_maxtype		= IFLA_BRPORT_MAX,.slave_policy		= br_port_policy,.slave_changelink	= br_port_slave_changelink,.get_slave_size		= br_port_get_slave_size,.fill_slave_info	= br_port_fill_slave_info,
};void br_dev_setup(struct net_device *dev)
{struct net_bridge *br = netdev_priv(dev);eth_hw_addr_random(dev);ether_setup(dev);//包含收发包函数dev->netdev_ops = &br_netdev_ops;dev->needs_free_netdev = true;//ethtooldev->ethtool_ops = &br_ethtool_ops;SET_NETDEV_DEVTYPE(dev, &br_type);dev->priv_flags = IFF_EBRIDGE | IFF_NO_QUEUE;dev->features = COMMON_FEATURES | NETIF_F_LLTX | NETIF_F_NETNS_LOCAL |NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_STAG_TX;dev->hw_features = COMMON_FEATURES | NETIF_F_HW_VLAN_CTAG_TX |NETIF_F_HW_VLAN_STAG_TX;dev->vlan_features = COMMON_FEATURES;br->dev = dev;
//...
}
static int br_dev_newlink(struct net *src_net, struct net_device *dev,struct nlattr *tb[], struct nlattr *data[],struct netlink_ext_ack *extack)
{struct net_bridge *br = netdev_priv(dev);int err;err = register_netdevice(dev);if (err)return err;if (tb[IFLA_ADDRESS]) {spin_lock_bh(&br->lock);br_stp_change_bridge_id(br, nla_data(tb[IFLA_ADDRESS]));spin_unlock_bh(&br->lock);}err = br_changelink(dev, tb, data, extack);if (err)br_dev_delete(dev, NULL);return err;
}

向网桥添加设备

br_netdev_ops实现:


static const struct net_device_ops br_netdev_ops = {.ndo_open		 = br_dev_open,.ndo_stop		 = br_dev_stop,.ndo_init		 = br_dev_init,.ndo_uninit		 = br_dev_uninit,.ndo_start_xmit		 = br_dev_xmit,.ndo_get_stats64	 = br_get_stats64,.ndo_set_mac_address	 = br_set_mac_address,.ndo_set_rx_mode	 = br_dev_set_multicast_list,.ndo_change_rx_flags	 = br_dev_change_rx_flags,.ndo_change_mtu		 = br_change_mtu,.ndo_do_ioctl		 = br_dev_ioctl,.ndo_add_slave		 = br_add_slave,.ndo_del_slave		 = br_del_slave,.ndo_fix_features        = br_fix_features,.ndo_fdb_add		 = br_fdb_add,.ndo_fdb_del		 = br_fdb_delete,.ndo_fdb_dump		 = br_fdb_dump,.ndo_fdb_get		 = br_fdb_get,.ndo_bridge_getlink	 = br_getlink,.ndo_bridge_setlink	 = br_setlink,.ndo_bridge_dellink	 = br_dellink,.ndo_features_check	 = passthru_features_check,
};

其中的ndo_add_slave应当就是向网桥添加设备

static int br_add_slave(struct net_device *dev, struct net_device *slave_dev,struct netlink_ext_ack *extack){struct net_bridge *br = netdev_priv(dev);return br_add_if(br, slave_dev, extack);
}/* called with RTNL */
int br_add_if(struct net_bridge *br, struct net_device *dev,struct netlink_ext_ack *extack)
{struct net_bridge_port *p;int err = 0;unsigned br_hr, dev_hr;bool changed_addr, fdb_synced = false;/* Don't allow bridging non-ethernet like devices. */if ((dev->flags & IFF_LOOPBACK) ||dev->type != ARPHRD_ETHER || dev->addr_len != ETH_ALEN ||!is_valid_ether_addr(dev->dev_addr))return -EINVAL;/* Also don't allow bridging of net devices that are DSA masters, since* the bridge layer rx_handler prevents the DSA fake ethertype handler* to be invoked, so we don't get the chance to strip off and parse the* DSA switch tag protocol header (the bridge layer just returns* RX_HANDLER_CONSUMED, stopping RX processing for these frames).* The only case where that would not be an issue is when bridging can* already be offloaded, such as when the DSA master is itself a DSA* or plain switchdev port, and is bridged only with other ports from* the same hardware device.*/if (netdev_uses_dsa(dev)) {list_for_each_entry(p, &br->port_list, list) {if (!netdev_port_same_parent_id(dev, p->dev)) {NL_SET_ERR_MSG(extack,"Cannot do software bridging with a DSA master");return -EINVAL;}}}/* No bridging of bridges */if (dev->netdev_ops->ndo_start_xmit == br_dev_xmit) {NL_SET_ERR_MSG(extack,"Can not enslave a bridge to a bridge");return -ELOOP;}/* Device has master upper dev */if (netdev_master_upper_dev_get(dev))return -EBUSY;/* No bridging devices that dislike that (e.g. wireless) */if (dev->priv_flags & IFF_DONT_BRIDGE) {NL_SET_ERR_MSG(extack,"Device does not allow enslaving to a bridge");return -EOPNOTSUPP;}
c//申请一个net_bridge_portp = new_nbp(br, dev);if (IS_ERR(p))return PTR_ERR(p);//通知slave设备设备加入call_netdevice_notifiers(NETDEV_JOIN, dev);err = dev_set_allmulti(dev, 1);if (err) {br_multicast_del_port(p);kfree(p);	/* kobject not yet init'd, manually free */goto err1;}err = kobject_init_and_add(&p->kobj, &brport_ktype, &(dev->dev.kobj),SYSFS_BRIDGE_PORT_ATTR);if (err)goto err2;err = br_sysfs_addif(p);if (err)goto err2;err = br_netpoll_enable(p);if (err)goto err3;//注册设备帧接收函数,实际为br_handle_frameerr = netdev_rx_handler_register(dev, br_get_rx_handler(dev), p);if (err)goto err4;dev->priv_flags |= IFF_BRIDGE_PORT;err = netdev_master_upper_dev_link(dev, br->dev, NULL, NULL, extack);if (err)goto err5;err = nbp_switchdev_mark_set(p);if (err)goto err6;dev_disable_lro(dev);//添加到brideg的已用端口列表里list_add_rcu(&p->list, &br->port_list);nbp_update_port_count(br);if (!br_promisc_port(p) && (p->dev->priv_flags & IFF_UNICAST_FLT)) {/* When updating the port count we also update all ports'* promiscuous mode.* A port leaving promiscuous mode normally gets the bridge's* fdb synced to the unicast filter (if supported), however,* `br_port_clear_promisc` does not distinguish between* non-promiscuous ports and *new* ports, so we need to* sync explicitly here.*/fdb_synced = br_fdb_sync_static(br, p) == 0;if (!fdb_synced)netdev_err(dev, "failed to sync bridge static fdb addresses to this port\n");}netdev_update_features(br->dev);br_hr = br->dev->needed_headroom;dev_hr = netdev_get_fwd_headroom(dev);if (br_hr < dev_hr)update_headroom(br, dev_hr);elsenetdev_set_rx_headroom(dev, br_hr);if (br_fdb_insert(br, p, dev->dev_addr, 0))netdev_err(dev, "failed insert local address bridge forwarding table\n");if (br->dev->addr_assign_type != NET_ADDR_SET) {/* Ask for permission to use this MAC address now, even if we* don't end up choosing it below.*/err = dev_pre_changeaddr_notify(br->dev, dev->dev_addr, extack);if (err)goto err7;}//初始化vlanerr = nbp_vlan_init(p, extack);if (err) {netdev_err(dev, "failed to initialize vlan filtering on this port\n");goto err7;}spin_lock_bh(&br->lock);changed_addr = br_stp_recalculate_bridge_id(br);if (netif_running(dev) && netif_oper_up(dev) &&(br->dev->flags & IFF_UP))br_stp_enable_port(p);spin_unlock_bh(&br->lock);//通知br_ifinfo_notify(RTM_NEWLINK, NULL, p);if (changed_addr)call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev);br_mtu_auto_adjust(br);br_set_gso_limits(br);kobject_uevent(&p->kobj, KOBJ_ADD);return 0;
//...
}
将添加设备添加到网桥

其中将被添加设备添加到网桥的关键函数是new_nbp:


/* called with RTNL but without bridge lock */
static struct net_bridge_port *new_nbp(struct net_bridge *br,struct net_device *dev)
{//申请插口对象struct net_bridge_port *p;int index, err;index = find_portno(br);if (index < 0)return ERR_PTR(index);p = kzalloc(sizeof(*p), GFP_KERNEL);if (p == NULL)return ERR_PTR(-ENOMEM);//关联插口对象和bridgep->br = br;dev_hold(dev);//关联插口对象和被插入设备p->dev = dev;p->path_cost = port_cost(dev);p->priority = 0x8000 >> BR_PORT_BITS;//保存p->port_no = index;p->flags = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;br_init_port(p);br_set_state(p, BR_STATE_DISABLED);br_stp_port_timer_init(p);err = br_multicast_add_port(p);if (err) {dev_put(dev);kfree(p);p = ERR_PTR(err);}return p;
}/* find an available port number */
static int find_portno(struct net_bridge *br)
{int index;struct net_bridge_port *p;unsigned long *inuse;//创建位图inuse = bitmap_zalloc(BR_MAX_PORTS, GFP_KERNEL);if (!inuse)return -ENOMEM;//置0set_bit(0, inuse);	/* zero is reserved */list_for_each_entry(p, &br->port_list, list) {//设置位图set_bit(p->port_no, inuse);}//找到未使用的indexindex = find_first_zero_bit(inuse, BR_MAX_PORTS);bitmap_free(inuse);return (index >= BR_MAX_PORTS) ? -EXFULL : index;
}
设备帧接收函数注册

接收函数注册是netdev_rx_handler_register


/***	netdev_rx_handler_register - register receive handler*	@dev: device to register a handler for*	@rx_handler: receive handler to register*	@rx_handler_data: data pointer that is used by rx handler**	Register a receive handler for a device. This handler will then be*	called from __netif_receive_skb. A negative errno code is returned*	on a failure.**	The caller must hold the rtnl_mutex.**	For a general description of rx_handler, see enum rx_handler_result.*/
int netdev_rx_handler_register(struct net_device *dev,rx_handler_func_t *rx_handler,void *rx_handler_data)
{if (netdev_is_rx_handler_busy(dev))return -EBUSY;if (dev->priv_flags & IFF_NO_RX_HANDLER)return -EINVAL;//赋值保存/* Note: rx_handler_data must be set before rx_handler */rcu_assign_pointer(dev->rx_handler_data, rx_handler_data);rcu_assign_pointer(dev->rx_handler, rx_handler);return 0;
}

数据包处理过程

前面分析veth发送接收过程的时候看到过,veth的一端在接收数据包后,直接寻找邻居的net_device,然后调用netif_rx上送,最终会调用到__netif_receive_skb_core,但与普通路径不同,加入bridge的veth,会用上面看到的函数netdev_rx_handler_register设置rx_handler,其报文不会上送协议栈

static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,struct packet_type **ppt_prev)
{
//tcpdump抓包
//...
rx_handler = rcu_dereference(skb->dev->rx_handler);if (rx_handler) {if (pt_prev) {ret = deliver_skb(skb, pt_prev, orig_dev);pt_prev = NULL;}switch (rx_handler(&skb)) {case RX_HANDLER_CONSUMED:ret = NET_RX_SUCCESS;goto out;case RX_HANDLER_ANOTHER:goto another_round;case RX_HANDLER_EXACT:deliver_exact = true;case RX_HANDLER_PASS:break;default:BUG();}}
//...
//送往协议栈
//...
out:/* The invariant here is that if *ppt_prev is not NULL* then skb should also be non-NULL.** Apparently *ppt_prev assignment above holds this invariant due to* skb dereferencing near it.*/*pskb = skb;return ret;
}

在处理完tcpdump抓包后,会处理rx_handler,如果结果是RX_HANDLER_CONSUMED,则直接退出,不上送协议栈。

bridge报文处理

再看下br_handle_frame的实现:


/** Return NULL if skb is handled* note: already called with rcu_read_lock*/
static rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
{struct net_bridge_port *p;struct sk_buff *skb = *pskb;const unsigned char *dest = eth_hdr(skb)->h_dest;if (unlikely(skb->pkt_type == PACKET_LOOPBACK))return RX_HANDLER_PASS;if (!is_valid_ether_addr(eth_hdr(skb)->h_source))goto drop;skb = skb_share_check(skb, GFP_ATOMIC);if (!skb)return RX_HANDLER_CONSUMED;memset(skb->cb, 0, sizeof(struct br_input_skb_cb));//获取前面使用netdev_rx_handler_register设置进去的rx_handler_data,即为net_bridge_portp = br_port_get_rcu(skb->dev);if (p->flags & BR_VLAN_TUNNEL) {if (br_handle_ingress_vlan_tunnel(skb, p,nbp_vlan_group_rcu(p)))goto drop;}
//...
forward:switch (p->state) {case BR_STATE_FORWARDING:case BR_STATE_LEARNING://目的mac与bridge的地址一致,是发给自己的报文if (ether_addr_equal(p->br->dev->dev_addr, dest))skb->pkt_type = PACKET_HOST;return nf_hook_bridge_pre(skb, pskb);default:
drop:kfree_skb(skb);}return RX_HANDLER_CONSUMED;
}static int nf_hook_bridge_pre(struct sk_buff *skb, struct sk_buff **pskb)
{
#ifdef CONFIG_NETFILTER_FAMILY_BRIDGE//...frame_finish:net = dev_net(skb->dev);br_handle_frame_finish(net, NULL, skb);
#else//直接看简单的分支br_handle_frame_finish(dev_net(skb->dev), NULL, skb);
#endifreturn RX_HANDLER_CONSUMED;
}/* note: already called with rcu_read_lock */
int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{//获取veth所连接的网桥端口struct net_bridge_port *p = br_port_get_rcu(skb->dev);enum br_pkt_type pkt_type = BR_PKT_UNICAST;struct net_bridge_fdb_entry *dst = NULL;struct net_bridge_mdb_entry *mdst;bool local_rcv, mcast_hit = false;struct net_bridge *br;u16 vid = 0;u8 state;if (!p || p->state == BR_STATE_DISABLED)goto drop;state = p->state;if (!br_allowed_ingress(p->br, nbp_vlan_group_rcu(p), skb, &vid,&state))goto out;nbp_switchdev_frame_mark(p, skb);/* insert into forwarding database after filtering to avoid spoofing *///二层转发表学习br = p->br;if (p->flags & BR_LEARNING)br_fdb_update(br, p, eth_hdr(skb)->h_source, vid, 0);local_rcv = !!(br->dev->flags & IFF_PROMISC);if (is_multicast_ether_addr(eth_hdr(skb)->h_dest)) {/* by definition the broadcast is also a multicast address */if (is_broadcast_ether_addr(eth_hdr(skb)->h_dest)) {pkt_type = BR_PKT_BROADCAST;local_rcv = true;} else {pkt_type = BR_PKT_MULTICAST;if (br_multicast_rcv(br, p, skb, vid))goto drop;}}if (state == BR_STATE_LEARNING)goto drop;BR_INPUT_SKB_CB(skb)->brdev = br->dev;BR_INPUT_SKB_CB(skb)->src_port_isolated = !!(p->flags & BR_ISOLATED);if (IS_ENABLED(CONFIG_INET) &&(skb->protocol == htons(ETH_P_ARP) ||skb->protocol == htons(ETH_P_RARP))) {br_do_proxy_suppress_arp(skb, br, vid, p);} else if (IS_ENABLED(CONFIG_IPV6) &&skb->protocol == htons(ETH_P_IPV6) &&br_opt_get(br, BROPT_NEIGH_SUPPRESS_ENABLED) &&pskb_may_pull(skb, sizeof(struct ipv6hdr) +sizeof(struct nd_msg)) &&ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {struct nd_msg *msg, _msg;msg = br_is_nd_neigh_msg(skb, &_msg);if (msg)br_do_suppress_nd(skb, br, vid, p, msg);}switch (pkt_type) {case BR_PKT_MULTICAST:mdst = br_mdb_get(br, skb, vid);if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&br_multicast_querier_exists(br, eth_hdr(skb))) {if ((mdst && mdst->host_joined) ||br_multicast_is_router(br)) {local_rcv = true;br->dev->stats.multicast++;}mcast_hit = true;} else {local_rcv = true;br->dev->stats.multicast++;}break;case BR_PKT_UNICAST://单播查找dst = br_fdb_find_rcu(br, eth_hdr(skb)->h_dest, vid);default:break;}if (dst) {unsigned long now = jiffies;if (test_bit(BR_FDB_LOCAL, &dst->flags))return br_pass_frame_up(skb);if (now != dst->used)dst->used = now;//转发到端口br_forward(dst->dst, skb, local_rcv, false);} else {if (!mcast_hit)br_flood(br, skb, pkt_type, local_rcv, false);elsebr_multicast_flood(mdst, skb, local_rcv, false);}if (local_rcv)return br_pass_frame_up(skb);out:return 0;
drop:kfree_skb(skb);goto out;
}
转发流程
/*** br_forward - forward a packet to a specific port* @to: destination port* @skb: packet being forwarded* @local_rcv: packet will be received locally after forwarding* @local_orig: packet is locally originated** Should be called with rcu_read_lock.*/
void br_forward(const struct net_bridge_port *to,struct sk_buff *skb, bool local_rcv, bool local_orig)
{
//...if (should_deliver(to, skb)) {if (local_rcv)deliver_clone(to, skb, local_orig);else//转发__br_forward(to, skb, local_orig);return;}out:if (!local_rcv)kfree_skb(skb);
}static void __br_forward(const struct net_bridge_port *to,struct sk_buff *skb, bool local_orig)
{struct net_bridge_vlan_group *vg;struct net_device *indev;struct net *net;int br_hook;vg = nbp_vlan_group_rcu(to);skb = br_handle_vlan(to->br, to, vg, skb);if (!skb)return;indev = skb->dev;//将skb的dve改为目的devskb->dev = to->dev;if (!local_orig) {
//...net = dev_net(indev);} else {
//...net = dev_net(skb->dev);indev = NULL;}NF_HOOK(NFPROTO_BRIDGE, br_hook,net, NULL, skb, indev, skb->dev,br_forward_finish);
}int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{skb->tstamp = 0;return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,net, sk, skb, NULL, skb->dev,br_dev_queue_push_xmit);}int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
{skb_push(skb, ETH_HLEN);if (!is_skb_forwardable(skb->dev, skb))goto drop;br_drop_fake_rtable(skb);if (skb->ip_summed == CHECKSUM_PARTIAL &&(skb->protocol == htons(ETH_P_8021Q) ||skb->protocol == htons(ETH_P_8021AD))) {int depth;if (!__vlan_get_protocol(skb, skb->protocol, &depth))goto drop;skb_set_network_header(skb, depth);}dev_queue_xmit(skb);return 0;
//...
}

最终调用了dev_queue_xmit,对于veth最终会调用到其ndo_start_xmit函数即veth_xmit,已经在前面梳理过,其会获取到另一端的peer,然后发送过去。
整个bridge的工作源码都是在设备层,也证明了其是二层设备。

bridge vlan支持

Bridge 本身是支持 VLAN 功能的,如下图所示,通过配置,Bridge 可以将一个物理网卡设备 eth0 划分成两个子设备 eth0.10,eth0.20,分别挂到 Bridge 虚拟出的两个 VLAN 上,VLAN id 分别为 VLAN 10 和 VLAN 20。同样,两个 VM 的虚拟网卡设备 vnet0 和 vnet 1 也分别挂到相应的 VLAN 上。这样配好的最终效果就是 VM1 不能和 VM2 通信了,达到了隔离。
图片借用一下:
vlan

docker网络

docker使用veth+bridge+网络命名空间来实现的docker网络
一般用来把 tap/tun、veth-pair连到Bridge上面,这样可以把一组容器,或者一组虚机连在一起,Docker就是用Bridge把Host里面的容器都连在一起,使得这些容器可以互相访问。也可以把Host上面的物理网卡也加入到Bridge,这样主机的VM就可以和外界通信了。
容器中一般不会把主机上原有的eth0也加入到Bridge,而虚机使用Bridge,一般会把原来Host上面的网卡加入到Bridge。(容器不加入,一是因为大量容器的IP,可能会和Host所在网络上,它那些兄弟VM们的IP冲突。二还有容器网络模式可以多种模式选择,保留灵活性。)。
具体可参考《跟唐老师学习云网络》 - Docker网络实现

容器与外部网络通信

延伸1:Open vSwtich与OpenFlow

openvswitch的内核模块位于net/openvswitch

OpenFlow

定义

OpenFlow是软件定义网络SDN(Software Defined Network)架构中控制平面和转发平面之间的通信协议,通过标准化开放接口实现控制平面和转发平面的分离。OpenFlow允许控制器直接访问和操作网络设备的转发平面,这些网络设备可能是物理上的设备,也可能是虚拟的设备。

目的

随着设备在物理网络基础设施上实施服务器虚拟化的快速发展,虚拟机的数量越来越多,网络管理变得越来越复杂,新业务的网络部署越来越慢。这就要求设备操作简单灵活,扩展性能高,可以集中控制和管理设备的转发行为。然而传统网络设备的控制平面和转发平面集成在一起,扩展性能低,技术更新周期长,难以实现集中控制和管理及快速部署新业务网络。SDN技术可以分离控制平面和网络转发平面,而OpenFlow技术可以实现控制平面和转发平面之间的通信,集中控制和管理整个网络的转发业务,实现新业务网络的快速部署。

延伸2:k8s网络模型CNI与cilium

以下是GPT给出的docker网络与k8s网络的对比

方面Docker 网络K8s 网络(Pod 网络)
网络模型多种网络驱动,局限于单主机或 overlay扁平网络模型,主张全局直连
IP 分配容器通常通过 NAT 或 bridge 访问外部每个 Pod 独立 IP,集群内直连
通信隔离容器间通信需配置端口映射或自定义网络Pod 间天然互通(可用 NetworkPolicy 限制)
多主机通信需 overlay 网络,配置复杂由 CNI 插件自动处理
实现方式网络 namespace + veth + bridge/overlayCNI 插件,抽象更高
网络策略支持有限支持 NetworkPolicy 精细控制

延伸3: OpenStack

参考文章

跟唐老师学习云网络系列

《跟唐老师学云网络》—— 目录
《跟唐老师学习云网络》 - 网络命名空间 Network Namespace
《跟唐老师学习云网络》 - Veth网线
《跟唐老师学习云网络》 - TUN/TAP网线
《跟唐老师学习云网络》 - Bridge网桥
《跟唐老师学习云网络》 - OVS交换机
《跟唐老师学习云网络》 - Docker网络实现
《跟唐老师学习云网络》 - Kubernetes网络实现
《跟唐老师学习云网络》 - OpenStack网络实现

开发内功修炼系列

轻松理解 Docker 网络虚拟化基础之 veth 设备!
聊聊 Linux 上软件实现的“交换机” - Bridge!
动手实验+源码分析,彻底弄懂 Linux 网络命名空间
手工模拟实现 Docker 容器网络!

极客时间系列

趣谈 Linux 操作系统
容器实战高手课

其他

从 Bridge 到 OVS,探索虚拟交换机
实战演练:Linux Bridge 与 Open vSwitch 对比配置全过程
Open vSwitch 文档(中文)
Openvswitch原理与代码分析

http://www.dtcms.com/a/301791.html

相关文章:

  • JavaScript核心概念全解析
  • 基于CNN图像特征提取流程(简化版)
  • Python训练Day25
  • 深度学习(鱼书)day04--手写数字识别项目实战
  • RK3568 Linux驱动学习——U-Boot使用
  • Docker的docker-compose类比Spring的ApplicationContext
  • Yaffs文件系统学习
  • Mysql数据库基础(入门)
  • 智慧施工:施工流程可视化管理系统
  • 【分享】外国使馆雷电综合防护系统改造方案(一)
  • 自动出题与批改系统(数学题生成+OCR识别)
  • Vue入门到实战之第三篇【超基础】
  • 从 .NET Framework 到 .NET 8:跨平台融合史诗与生态演进全景
  • 数据科学专业的行业适配全景图
  • Unity TAA
  • 大数据工程师:职责与技能全景图 -- 从“数据搬运工”到“价值架构师”
  • 三、构建一个Agent
  • Triton IR
  • 【测试报告】思绪网(Java+Selenium+Jmeter自动化测试)
  • 力扣面试150题--二进制求和
  • 五度标调法调域统计分析工具
  • 【笔记】Einstein关系式 D = ukBT 的推导与应用研究
  • 零拷贝 详述
  • Day4.AndroidAudio初始化
  • Linux学习篇11——Linux软件包管理利器:RPM与YUM详解与实战指南,包含如何配置失效的YUM镜像地址
  • 【RH134 问答题】第 2 章 调度未来任务
  • 第1章 AB实验的基本原理和应用
  • 任务提醒工具怎么选?对比16款热门软件
  • Valgrind Helgrind 工具全解:线程同步的守门人
  • Linux 基础命令大全