当前位置: 首页 > news >正文

cpuset v1

What are cpusets ?

Cpusets provide a mechanism for assigning a set of CPUs and Memory

Nodes to a set of tasks.   In this document "Memory Node" refers to

an on-line node that contains memory.

有了cpuset,一个进程想要通过sched_setaffinity(2)系统调用来设置其cpu亲和性,以及想通过mbind(2) and set_mempolicy(2)系统调用来设置其memory node policy,都会受到其自身所在的cpuset的限制(不能超出其cpuset限制的cpus以及memory nodes)。内核的scheduler以及page allocator会根据进程所属cpuset的cpus_allowedmems_allowed字段来对进程进行限制。

Why are cpusets needed ?

The management of large computer systems, with many processors (CPUs),

complex memory cache hierarchies and multiple Memory Nodes having

non-uniform access times (NUMA) presents additional challenges for

the efficient scheduling and memory placement of processes.

How are cpusets implemented ?

Cpusets提供了一种内核机制来限制一个进程或一组进程对cpu和内存节点使用的限制。在cpuset开发之前,内核已经实现了相关机制,如sched_setaffinity来设置进程的cpu亲和性,以及mbind, set_mempolicy来设置进程对内存使用的策略。cpusets扩展了他们,具体包括:

 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the

   kernel.

 - Each task in the system is attached to a cpuset, via a pointer

   in the task structure to a reference counted cgroup structure.

 - Calls to sched_setaffinity are filtered to just those CPUs

   allowed in that task's cpuset.

 - Calls to mbind and set_mempolicy are filtered to just

   those Memory Nodes allowed in that task's cpuset.

 - The root cpuset contains all the systems CPUs and Memory

   Nodes.

 - For any cpuset, one can define child cpusets containing a subset

   of the parents CPU and Memory Node resources.

 - The hierarchy of cpusets can be mounted at /dev/cpuset, for

   browsing and manipulation from user space.

 - A cpuset may be marked exclusive, which ensures that no other

   cpuset (except direct ancestors and descendants) may contain

   any overlapping CPUs or Memory Nodes.

 - You can list all the tasks (by pid) attached to any cpuset.

cpuset的实现,只是在内核增加了几个简单的hook,并不影响性能:

 - in init/main.c, to initialize the root cpuset at system boot.

 - in fork and exit, to attach and detach a task from its cpuset.

 - in sched_setaffinity, to mask the requested CPUs by what's

   allowed in that task's cpuset.

 - in sched.c migrate_live_tasks(), to keep migrating tasks within

   the CPUs allowed by their cpuset, if possible.

 - in the mbind and set_mempolicy system calls, to mask the requested

   Memory Nodes by what's allowed in that task's cpuset.

 - in page_alloc.c, to restrict memory to allowed nodes.

 - in vmscan.c, to restrict page recovery to the current cpuset.

在/proc/<pid>/status文件中,增加了进程cpus_allowed和mems_allowed的信息,示例如下:

  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff

  Cpus_allowed_list:      0-127

  Mems_allowed:   ffffffff,ffffffff

  Mems_allowed_list:      0-63

cpuset文件说明:

 - cpuset.cpus: list of CPUs in that cpuset

 - cpuset.mems: list of Memory Nodes in that cpuset

 - cpuset.memory_migrate flag: if set, move pages to cpusets nodes

 - cpuset.cpu_exclusive flag: is cpu placement exclusive?

 - cpuset.mem_exclusive flag: is memory placement exclusive?

 - cpuset.mem_hardwall flag:  is memory allocation hardwalled

 - cpuset.memory_pressure: measure of how much paging pressure in cpuset

 - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes

 - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes

 - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset

 - cpuset.sched_relax_domain_level: the searching range when migrating tasks

In addition, only the root cpuset has the following file:

 - cpuset.memory_pressure_enabled flag: compute memory_pressure?

cpuset其实是对复杂系统cpu及内存资源的"soft-partitions"。

一个进程通过fork会自动继承父进程的cpuset信息,不过也可以将一个进程re-attached到其他cpuset。

cpuset还有以下的规则:

 - Its CPUs and Memory Nodes must be a subset of its parents.

 - It can't be marked exclusive unless its parent is.

 - If its cpu or memory is exclusive, they may not overlap any sibling.

root (top_cpuset) cpuset中的cpusmems文件是只读的!cpus文件会通过CPU hotplug notifier自动跟踪

cpu_online_mask的值;mems文件也会自动跟踪node_states[N_MEMORY]的值。

What are exclusive cpusets ?

独占,顾名思义,就是其他cpuset不能与其共用相关资源(cpus、memory nodes)。不过,其parent和child cpuset是可以的。

What is sched_load_balance ?

内核调度器会自动进行负载均衡,不过在调度一个task到目标cpu的时候,会受到task所在的cpuset以及sched_setaffinity的限制。

负载均衡是有开销的,这不仅包含load balancing算法本身,还包括被迁移任务缓存的丢失等。越复杂的系统(任务数越多、cpu数量越大、numa等),负载均衡的开销也越大。所以,内核调度将cpu划分为不同的partition,叫调度域。这样,内核可以只在调度域内进行负载均衡。有的cpu不包含在任何调度域内(isolcpus场景),所以,也就不会进行load balance

默认,这里会有一个root 调度域,包含了所有的cpu,也包含了isolcpus,不过isolcpus不会参与load balance。

可以将root 调度域设置为不进行load balance,主要基于两点:

 1) On large systems, load balancing across many CPUs is expensive.

    If the system is managed using cpusets to place independent jobs

    on separate sets of CPUs, full load balancing is unnecessary.

 2) Systems supporting realtime on some CPUs need to minimize

    system overhead on those CPUs, including avoiding task load

    balancing if that is not needed.

如果top cpuset的"cpuset.sched_load_balance"被设置为了1,则表示调度器将进行全cpus的负载均衡。这种情况,其他的任何sub cpuset对sched_load_balance的设置将没有意义。因为,系统已经fully load balancing.

所以,基于以上的考虑:top cpuset flag "cpuset.sched_load_balance" should be disabled, and only some of the smaller, child cpusets have this flag enabled.

There is an impedance mismatch here, between cpusets and sched domains.

Cpusets are hierarchical and nest.  Sched domains are flat; they don't

overlap and each CPU is in at most one sched domain.

上面这段话需要说明一下:

这里并不是真指的调度域就是平面结构(实际上,调度域也是分层的)。这里指的是最底层的调度域是平面的,而且是没有交集的。这跟cpuset不同,多个相同层级的cpusetcpus是可以有交集的

一般来讲,cpuset会直接影响调度域的构建。如果一个cpuset的"cpuset.sched_load_balance"被设置为1,那么内核很可能将该cpuset作为一个调度域来构建。不过,下面这种情况就比较特殊了:

So if each of two partially

overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we

form a single sched domain that is a superset of both.

This mismatch is why there is not a simple one-to-one relation

between which cpusets have the flag "cpuset.sched_load_balance" enabled,

and the sched domain configuration.

下面是cpuset和scheduler交互,构建调度域的信息:

The cpuset code builds a new such partition and passes it to the

scheduler sched domain setup code, to have the sched domains rebuilt

as necessary, whenever:

 - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,

 - or CPUs come or go from a cpuset with this flag enabled,

 - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs

   and with this flag enabled changes,

 - or a cpuset with non-empty CPUs and with this flag enabled is removed,

 - or a cpu is offlined/onlined.

The scheduler remembers the currently active sched domain partitions.

When the scheduler routine partition_sched_domains() is invoked from

the cpuset code to update these sched domains, it compares the new

partition requested with the current, and updates its sched domains,

removing the old and adding the new, for each change.

How do I use cpuset

If a cpuset has its Memory Nodes modified, then for each task attached

to that cpuset, the next time that the kernel attempts to allocate

a page of memory for that task, the kernel will notice the change

in the task's cpuset, and update its per-task memory placement to

remain within the new cpusets memory placement.

If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset

will have its allowed CPU placement changed immediately.  Similarly,

if a task's pid is written to another cpuset's 'tasks' file, then its

allowed CPU placement is changed immediately.  If such a task had been

bound to some subset of its cpuset using the sched_setaffinity() call,

the task will be allowed to run on any CPU allowed in its new cpuset,

negating the effect of the prior sched_setaffinity() call.

总结来说,如果一个任务所在cpuset的Memory Nodes修改了,则内核会在任务下一次申请分配内存时更新;如果一个任务所在cpuset的'cpuset.cpus'更新了,则内核调度器是立马进行更新的

To start a new job that is to be contained within a cpuset, the steps are:

 1) mkdir /sys/fs/cgroup/cpuset

 2) mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset

 3) Create the new cpuset by doing mkdir's and write's (or echo's) in

    the /sys/fs/cgroup/cpuset virtual file system.

 4) Start a task that will be the "founding father" of the new job.

 5) Attach that task to the new cpuset by writing its pid to the

    /sys/fs/cgroup/cpuset tasks file for that cpuset.

 6) fork, exec or clone the job tasks from this founding father task.

http://www.dtcms.com/a/426911.html

相关文章:

  • 2025年9月个人工作生活总结
  • Java SE “JDK1.8新特性”面试清单(含超通俗生活案例与深度理解)
  • 站台建筑资阳网站推广
  • 【论文阅读 | ECCV 2024 | DAMSDet:具有竞争性查询选择与自适应特征融合的动态自适应多光谱检测变换器】
  • 企业网站 三网系统好玩有趣的网站
  • 小程序的页面宽度 设置多少合适??
  • 基于libwebsockets与cJson的ASR Server实时语音识别实现指南
  • golang 写路由的时候要注意
  • EXCEL哪个版本开始支持VSTO-office插件?
  • 盲盒抽卡机小程序的技术挑战与解决方案
  • 全网网站建设推广国外设计网站都有哪些
  • 零基础学AI大模型之LangChain聊天模型多案例实战
  • GPU 网络基础,Part 2(MoE 训练中的网络挑战;什么是前、后端网络;什么是东西向、南北向流量)
  • 【菜狗学聚类】序列嵌入表示、UMAP降维——20250930
  • 网站外链建设的八大基本准则东大桥做网站的公司
  • MySQL进阶知识点(八)---- SQL优化
  • 【C++STL :vector类 (二) 】攻克 C++ Vector 的迭代器失效陷阱:从源码层面详解原理与解决方案
  • C++ string类常用操作
  • 修改网站模板详解如何开网站需要多少钱
  • 浅谈WebSocket
  • 做网站背景wordpress登录样式
  • 自动化通信谜团:耐达讯自动化Modbus RTU如何变身 Profibus连接触摸屏
  • 调节阀控制的“语言障碍“:耐达讯自动化一招破解,让Modbus RTU变身Profibus!
  • LE AUDIO之助听器Hearing Access Profile
  • 提升学习自主性:听写自动化对儿童习惯养成的技术支持
  • MySql的存储过程以及JDBC实战
  • 中国电建地产北京山谷:以“三好”战略绘就文旅康养与乡村振兴融合新图景
  • 中国网站建设中心做旅游网站公司
  • 【PyCharm】远程本地的WSL2如何配置copilot的Tab不冲突
  • 49.多路转接epoll