Released to the public domain, by Paul Jackson, SGI, 2 July 2001.
Revised and released again to the public domain, except for portions marked subject to the terms and conditions of the GNU General Public License, by Paul Jackson, SGI, 8 October 2001, 31 October 2001, 14 November 2001, and 21 December 2001.
The following Design Notes document CpuMemSets.
CpuMemSets provide an API and implementation for Linux that will support the Processor and Memory Placement semantics required for optimum performance on Linux NUMA systems.
These Notes describe the kernel system mechanisms and interfaces appropriate to support process memory placement and CPU scheduling affinity. This is the capability to place task scheduling and memory allocation onto specific processors and memory blocks and to select and specify the operating parameters of certain related kernel task scheduling and memory allocation policies.
Vendor neutral base
The recommended kernel APIs should be vendor neutral. It is more important that the Linux community, and Linus, view the proposed APIs as something that is sensible to add to mainstream Linux, for a variety of small and large systems than it is that any given proprietary Unix-based interface, such as might be available in Irix, DYNIX/ptx, Tru64, HP-UX, AIX, or others, be emulated. It is important that some common basis for NUMA system resource management be agreed to by the main NUMA system developers.
Respect classic kernel-user split
Since these API's will serve as the basis for a variety of system and application vendors to develop substantial capabilities, these interfaces should reflect the best available classic Unix design philosophy.
Kernel patch suitable for mainstream Linux kernel, on all size systems.
Sufficient, essential and minimal primitives for large systems
The primitive notions of interest are hardware processors (CPUs) and memory, and their corresponding software constructs of tasks and vm areas.
Observe that we do not include nodes here. One architecture currently popular for NUMA systems constructs systems from a number of nodes, where each node has some memory and a few CPUs. Within such a node, all CPUs and all memory are equivalent. In particular, any cache is either for the entire node, or for a single CPU. However other architectures currently in development may have multiple CPU cores on one die, sharing a die-level cache, and multiple such die per node. In such architectures, simply knowing which CPUs and memory are on which nodes is insufficient to determine the NUMA behavior of a system. One must also know which CPU cores are sharing die-level (sub-node) caches. That is an example of the reason this CpuMemSet design focuses on CPUs and memory blocks, not on nodes. Nodes are not as stable a concept, over various architectures.
We want to enable administrators to control the allocation of a systems resources, in this case of processors and memory blocks, to tasks and vm areas. We want to enable architects to control the use of the processors on which the application's tasks execute and the preferred memory block from which a tasks vm areas obtain system memory. We want to provide a common, vendor-neutral basis for such facilities as dplace, RunOn, cpusets and nodesets.
Previous attempts to provide NUMA friendly processor and memory allocation have failed because either they didn't provide enough, or they provided too much.
Attempts to put cpusets, nodesets, topologies, various scheduling policies, quads, and such into the kernel are typically trying to place too much complexity, or too much policy, or perhaps excessively vendor specific details, into the kernel.
The cpus_allowed bit vector (added by Ingo for the Tux kernel-based web server, and used in the task scheduling loop to control which CPUs a task can run on) is a step in the right direction, but exposes an application to the specific details of which system CPU numbers it is running on, which is not quite abstract or virtual enough. In particular, the application focused logic to use certain application CPUs or memories assigned to it in various ways (run this task here, put that shared memory region there) is confused with the system administrators decision to assign a set of resources (CPU and memory) to a job. Also, cpus_allowed is limited to 32 or 64 CPUs, and will need to change for larger systems. Such a change should not impact the primary kernel API used for manage CPU and memory placement for NUMA and other large systems.
The key contribution of this proposal to advancing our ability to manage the complexities and conflicting demands on the CPU and memory placement mechanism in Linux is a suggested layering of the implementation.
There are several other proposals and implementations addressing these same needs. Some of these proposals have a tendency, in this authors view, to attempt to address:
all with a single body of changes to kernel code, providing a single API. This results in solutions that are overly constrained, that are different in essential ways, but that do not co-exist easily with other solutions. It has also increased the risk for kernel code duplication, with variations of code for CONFIG_NUMA or CONFIG_DISCONTIGMEM parallel to the main stream code.
This proposal adds some additional structure, with some generic and flexible interfaces designed to separate and isolate the diverse and conflicting demands on the design, so that for example the requirements for hot swapping CPUs don't impact on the application API, or the requirements to support existing legacy API's don't impact on the details of critical allocation and scheduling code in the kernel.
This implementation proposes to add two layers, cpumemsets and cpumemmaps, in the following structure:
1. Existing scheduling and allocation code:
2. cpumemmap:
3. cpumemset:
4. Existing placement API's:
Sure, you can use CpuMemSets directly in your application. But it is not the primary purpose of the CpuMemSets API or kernel mechanism to directly support applications.
If you find that the current CpuMemSets API is better suited for expressing your applications processor and memory placement needs than anything else available, good. But if you find the API to be too cumbersome and primitive or otherwise ill suited for convenient use by your application, then find or develop a decent library and API that is easier to use in your circumstances.
Hopefully, that library and API (will) depend on the CpuMemSets API and kernel mechanism.
As of this writing (21 December 2001), there are three major capabilities that CpuMemSets doesn't provide, but that are also needed to solve related needs, and that will likely impact the use, design or implementation of CpuMemSets.
1. Distances and Topology:
2. Grouping:
3. Dynamic Scheduling and Process Migration:
Early in the boot sequence, before the normal kernel memory allocation routines are usable, the kernel sets up a single default cpumemmap and cpumemset. If no action is ever taken by user level code to change them, this one Map and one Set will apply to the kernel and all processes and vm areas for the life of that system boot.
By default, this Map includes all CPUs and memory blocks, and this Set allows scheduling on all CPUs and allocation on all blocks. A hook is provided to allow for an architecture specific routine to initialize this Map and Set. This hook could be used to properly sort the kernel cpumemset memory lists so that initial kernel data structures are allocated on the desired nodes.
An optional kernel boot parameter causes this initial Map and Set to include only one CPU and one memory block, in case the administrator or some system service will be managing the remaining CPUs and blocks in some specific way. This boot parameter is provided to the above hook for the use of the architecture specific initialization routine.
As soon as the system has booted far enough to run the first user process, init(1M), an early init script may be invoked that examines the topology and metrics of the system, and establishes optimized cpumemmap and cpumemset settings for the kernel and for init. Prior to that, various kernel deamons are started and kernel data structures allocated, which may allocate memory without the benefit of these optimized settings. This reduces the amount of knowlege that the kernel need have of special topology and distance attributes of a system, in that the kernel need only know enough to get early allocations placed correctly. More esoteric topology awareness can be kept in userland.
System administrators and services with root privileges manage the initial allocation of system CPUs and memory blocks to cpumemmaps, deciding which applications will be allowed the use of which CPUs and memory blocks. They also manage the cpumemset for the kernel, which specifies what order to search for kernel memory, depending on which CPU is executing the request. For an optimal system, the cpumemset for the kernel should probably sort the memory lists for each CPU by distance from that CPU.
Almost all ordinary applications will be unaware of CpuMemSets, running in whatever CPUs and memory blocks their inherited cpumemmap and cpumemset dictate.
But major multi-processor applications can take advantage of CpuMemSets, probably via existing legacy API's, to control the placement of the various processes and memory regions that the application manages. Emulators for whatever API the application is using will convert these requests into cpumemset changes, which will provide the application with detailed control of the CPUs and memory blocks provided to the application by its cpumemmap.
On systems supporting hot-swap of CPUs (or even memory, if someone can figure that out) the system administrator would be able to change CPUs and remap by changing the applications cpumemmap, without the application being aware of the change.
The role of a System Manager with regards to the systems processor and memory is to allocate portions of the system to various applications, usually with simple default policies, such as "spread things out evenly", and occasionally with more precision, controlling exactly which CPUs and memories a particular application uses.
The role of an Application Architect in this regard is to specify for a given application the details of just which CPUs and memory available are used to schedule which tasks, and to allocate which memory.
The System Manager is managing a particular physical computer system, preferring to remain relatively oblivious to the inards of applications, and the Application Architect is managing the details of task and memory usage within a single application, preferring to ignore the details of the particular system being used to execute the application.
The System Manager does not usually care whether the application puts two particular threads on the same CPU or different, and the Application Architect does not care whether that CPU is number 9 or number 99 in the system.
Paul Dorwin of IBM is working on the topology subsystem for Linux. As of November 14, 2001, he has published a design document on the lse-tech (Linux Scalability Effort on SourceForge) mailing list. We intend for this CpuMemSet design to closely track Dorwin's work.
We cover some basic notions of distance here, and anticipate that Dorwin's work done in concert with CpuMemSets will support these notions. We cover this here, even though it is not part of CpuMemSets, because it is usually involved in attempts to solve these needs, and we want to be clear that we recognize its importance, even though it is separate from this design.
The kernel provides information, via /proc, of the number of CPUs and memory blocks, and of the distance between them, so that sufficiently intelligent system administrators and services can assign "closely" placed CPUs and memory blocks (perhaps all on the same node or quad) to the same cpumemset, for optimal performance. But the kernel has no notion (for the purpose of CpuMemSets) of topology, nodes or quads., with the possible exception of architecture specific code that sets up the initial kernel cpumemmap and cpumemset. Nor does the kernel task scheduler or memory allocation code pay any attention to this distance, with the possible exception of more dynamic scheduler or allocator mechanisms, distinct from CpuMemSets. The kernel just reports topology and distances to the user code.
Processors are separate, parallel scheduled, general purpose execution units. Memory blocks are partition classes of physical general purpose system ram, such that any two distinct locations within the same node are the same distance from all processors, and for any two separate blocks, there is typically at least one processor such the two blocks are at a different distance from that processor. The distance from a given processor to a given memory block is a scalar approximation of that memory's latency and bandwidth, when accessed from that processor. The longer the latency and the lower the bandwidth, the higher the distance. For Intel IA64 systems, we expect to make use of the ACPI support for distances, and to use a distance metric that is scaled to make the closest <processor, memory> pair be at a distance of 10 from each other.
Not all the processing or memory elements visible on the system bus are general purpose. There may be I/O buffer memory, DMA engines, vector processors and frame buffers. We might care about the distance from any processing element, whether a general purpose CPU or not, to any memory element, whether system RAM or not.
In addition to <CPU, mem> distances, we also require <CPU, CPU> distances. The <CPU, CPU> distance is a measure of how costly it would be due to caching affects to change the current CPU on which a task is executing (and has considerable cache presence) to the other CPU. These distances reflect the impact of the system caches - two processors sharing a major cache are closer. The scheduler should be more reluctant to reschedule to a CPU further away, and two tasks communicating via shared memory will want to stay on CPUs that are close to each other, in addition to being close to the shared memory.
On most systems, it is probably not worth attempting to estimate how much presence a task might have in the caches of the CPU it most recently ran on. Rather, the scheduler should simply be reluctant to change the CPU on which a task is scheduled, perhaps with reluctance proportional to the <CPU, CPU> distance. For larger systems having relatively (to bus speed) faster CPUs relying more heavily on the caches, it will become worthwhile to include an estimate of cache occupancy when deciding whether to change CPU.
The ACPI standard describes an NxN table of distances between N nodes, under the assumption that a system consists of several nodes, each node having some memory and one or a few CPUs, with all CPUs on a node equidistant from all else. Kanoj, as part of the LinuxScalabilityEffort, has proposed a PxP distance vector between any two of P processors. The above provides P distinct M-length distance vectors, one for each processor, giving the distance from that processor to each of M Memory blocks, and P distinct P-length distance vectors for each processor, giving the distance from that processor to each of the P processors. The implementation should be based on ACPI where that is available, and derive what else is needed from other, potentially architecture specific detail.
On a large NUMA system, administrators may want to control which subset of processors and memory is devoted to which major application. This can be done using "hard" partitions, where subsets of the system are booted using separate system images, and the partitions act as a cluster of distinct computers, rather than a single system image computer. Doing so partially defeats the advantages of a large SMP or NUMA machine. At times it would be nice to be able to carve out more flexible, possibly overlapping, partitions of the systems CPUs and memory, allowing all processes to see a single system image, without rebooting, but while still guaranteeing certain CPU and memory resources to selected applications at various times.
CpuMemSets provides the System Administrator substantial control over system processor and memory resources with out the attendant inflexibility of hard partitions.
On a system supporting CpuMemSets, all processes have their scheduling constrained by their cpumemmap and cpumemset. The kernel simply will not schedule a process on a CPU that is not allowed by its cpumemmap and cpumemset. The Linux task scheduler must support a mechanism, such as the cpus_allowed bit vector, to control on which CPUs a task may be scheduled.
Similarly, all memory allocation is constrained by the cpumemmap and cpumemset associated to the kernel or vm area requesting the memory, except for specific requests within the kernel. The Linux page allocation code has been changed to search only in the memory blocks allowed by the vm area requesting memory. If memory is not available in the specified memory blocks, then the allocation must fail or sleep, awaiting memory. The search for memory will not consider other memory blocks in the system.
It is this "mandatory" nature of cpumemmaps and cpumemsets that makes it practical to provide many of the benefits of hard partitioning, in a dynamic single system image environment.
Because, as described below, cpumemmaps do not have system-wide names, one cannot create them ahead of time, during system initialization, and then later attach to them by name.
Rather the following scenarios provide examples of how to attach cpumemmaps to major system services.
1. Some boot script starts up a major service, on some particular subset of the machine (its own cpumemmap). That script could set its *child* Map to the cpumemmap desired for the major service it was spawning, and then fork/exec the service. Or if the service has root privilege, it could modify its own cpumemmaps as it saw fit.
2. Some higher level API maintains its own named space of "virtual systems", and its own notion of what users or applications are permitted to run on which virtual system. Such a "virtual system" might include a certain number of CPUs and memory blocks, and perhaps other system resources managed by other means. Perhaps permissions depend on the requesters ability to read or write a file; perhaps by other mechanisms not obvious to the kernel.
Perhaps some root privileged daemon is running that is responsible for managing these virtual systems defined by this API, or perhaps some non-root daemon is running with access to all the CPUs and memory blocks that might be used for this service, which it can partial out to service users as it sees fit.
When some process (user's agent or application) asks to run on one of these named virtual systems, and is granted permission to do so by the daemon, then the daemon either:
The Bulk Remap call rewrites any affected map to replace particular system CPU or memory block numbers with other numbers. The Bulk Remap operation can affect just the maps referenced by one process (*current*, *child* and any attached vm areas), or all sharing those maps, or the kernel map, or all maps in the system. The processing of this call forces the kernel to recompute any cached hints in the scheduler (for example, the "cpus_allowed" setting) or allocator (for example, the zone lists), so this call, particularly with the CMS_BULK_SHARE or CMS_BULK_ALL options, can impact system performance for a brief time, while it is being executed.
In other words, the Bulk Remap operation changes the system CPU and memory block numbers that are in the affected maps. You pass in a list of substitutions to be made. The change is made in place, affecting all tasks and vm areas sharing the affected map.
For example, you could ask that each appearance of system CPU 7 in the affected maps be changed to CPU 5. If you did that example to all the Maps in the system (using CMS_BULK_ALL) then immediately nothing would be scheduled on CPU 7 anymore. Bit 7 in cpus_allowed would be cleared in all tasks, and any cpus_allowed that had bit 7 on would get bit 5 set instead.
The CPU substitutions to be made are passed in as a pair of equal length lists, one with the old system CPU numbers, the other list with the corresponding new number. The memory block substitutions are passed in with another such pair of equal length lists.
Cpumemmaps and cpumemsets have "copy-on-write" semantics within the kernel. When they are propagated across fork, exec, vm area creation and other such kernel operations, usually just another kernel link is created to them, and their reference counter incremented. Most operations to change (set) maps and sets cause a copy to be made and the changes to be applied to that copy, with just the current link, while the reference counter on the original copy is decremented. The Bulk Remap feature, above, takes advantage of this natural sharing of maps and sets, and allows for changing a map in place, affecting all tasks and vm areas linked to that map.
Each cpumemset has an associated cpumemmap. When changing a cpumemmap, you select which one to change by specifying the same choices and related parameters (optional virtual address or pid) as when changing a cpumemset.
After changing a single cpumemmap with a cmsSetCMM() call, then that cpumemmap will no longer be shared by any other cpumemset. Only the cpumemset you went through to get to the cpumemmap will have a reference to the new changed cpumemmap. It would be an error if the changed cpumemmap didn't supply enough CPUs or memory blocks to meet the needs of the single cpumemset using it.
When changing multiple cpumemmaps with a Bulk operation, the changed cpumemmap must have enough CPUs and memory blocks for all the cpumemsets that will be sharing that cpumemmap after the change.
Yes, though constructing properly sorted memory lists for cpumemsets is tedious, most applications need not notice this, because the default memory list for an application, unless it knows better, should be the one it inherited.
Presumably, the inherited memory lists will be most often be sorted to provide memory close to the faulting cpu. But it is the responsibility of the system administrator or service to determine this, not the typical application.
Applications that have some specific memory access pattern for a particular address range may want to construct memory lists to control placement of that memory.
Cpumemmap and cpumemset calls that specify a range of memory (CMS_VMAREA) apply to all pages in the specified range. The internal kernel data structures tracking each vm area in an address space are automatically split if a cpumemmap or cpumemset is applied to only part of the range of pages in that vm area. This splitting happens transparently to the application, and subsequent remerging of two such neighboring vm areas may occur, if the two vm areas no longer differ. This same behavior is seen in the system calls madvise, msync and mincore.
CpuMemSets don't have system-wide names. They are not like files or processes, with well known names (paths and pids). Rather they are like classic Unix(tm) anonymous pipes or anonymous shared memory regions, which are identifiable within an individual process (by file descriptor or virtual address), but not by a common name space visible to all processes on the system.
In other words, cpumemmaps and cpumemsets can be "named" with the following tuples:
The system numbers for CPUs and memory blocks are system wide. But the application numbers are relative to the individual processes sharing a particular cpumemmap.
Had there been an important need, cpumemmaps and cpumemsets could have been made a separately named, allocated and protected system resource. But this would have required additional work, a more complex API, and more software.
No compelling requirement for naming CpuMemSets has been discovered, so far at least.
Granted, this has been one of the more surprising aspects of this Design.
The cmsGetCpu() call resembles a call sometimes supported called getcpu(), except that cmsGetCpu() returns the currently executing application CPU number, as found in the current processes cpumemmap. This information, along with the results of the cmsQuery*() calls, which any application may perform, may be helpful on some architectures in discovering topology and current system utilization. If a process can be scheduled on two or more CPUs, then the results of cmsGetCpu() may become invalid even before the query returns to the invoking user code.
The library code implementing cmsQuery*() calls constructs the returned cpumemmaps and cpumemsets by using malloc(3) to allocate each distinct structure and array element in the return value, and linking them together. The cmsFree*() calls assume this layout, and call free(3) on each element.
If you construct your own cpumemmap or cpumemset, using some other memory layout, don't pass that to cmsFree*().
You may alter in place and replace malloc'd elements of a cpumemmap or cpumemset returned by a cmsQuery*() call, and pass the result back into a corresponding cmsSet*() or cmsFree*() call. You will have to explicitly free(3) any elements of the data structure that you disconnect in this fashion, to avoid a memory leak.
The following display (contents copyright by SGI and subject to the GPL Library General Public License, as noted in the displayed contents) provides the C language header file containing details of the cpumemmap and cpumemset API's.
/* * CpuMemSets Library * Copyright (C) 2001 Silicon Graphics, Inc. * All rights reserved. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Library General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This library is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * Library General Public License for more details. * * You should have received a copy of the GNU Library General Public * License along with this library; if not, write to the * Free Software Foundation, Inc., 59 Temple Place - Suite 330, * Boston, MA 02111-1307 USA. */ /* * cpumemsets.h - CpuMemSet application interface for managing * system scheduling and memory allocation across * the various CPUs and memory blocks in a system. */ #ifndef __CPUMEMSET_H #define __CPUMEMSET_H /* * The CpuMemSet interface provides general purpose processor and * memory placement facilities to applications, system services * and emulations of other CPU and memory placement interfaces. * * It is not the objective that CpuMemSets provide the various * placement policies and scheduling heuristics needed for * efficient operation of SMP and NUMA systems. Nor are * CpuMemSets intended to replace existing API's that have been * developed to provide such solutions, on Linux and other * vendor systems. Rather it is the objective of CpuMemSets * that they provide a common Linux kernel mechanism suitable * to support the implementation of various such solutions and * provide emulations of existing API's, with minimal impact * on existing (or future) kernel scheduling and allocator code. * * CpuMemSets were born of the following realizations: * * 1) The kernel should support scheduling choice and memory * placement mechanisms sufficiently generic and policy * neutral to support a variety of solutions and policies, * without having to make constant kernel changes. * * 2) There are too many existing and anticipated solutions to * these static scheduling and placement problems to all fit * in the kernel, so one kernel mechanism is needed to support * them all. * * 3) The ongoing rate of evolution of the more dynamic aspects * of scheduling and allocation mandate that the relatively * static aspects addressed by CpuMemSets be kept "off to the * side" as much as possible, with very minimal impact on the * existing, or future, scheduling and allocation code. * * 4) In the long run, it is untenable to have separate scheduling * or allocation kernel code (as in the current mm/numa.c) * for large systems (multiple memory blocks or NUMA), as * opposed to single memory systems. Maintenance costs go * up, bugs are introduced, fixes to one are missed in the * other, and semantics diverge gratuitously. Rather one * body of kernel code is required, which is optimal (both * readability and performance) for normal systems but still * entirely suitable for large systems. * * CpuMemSets are implemented using two somewhat separate layers. * * 1) cpumemmap (cmm): * * The bottom layer provides a simple pair of maps, mapping * system CPU and memory block numbers to application CPU * and memory block numbers. System numbers are those used * by the kernel task scheduling and memory allocation code, * and typically include all CPU and memory in the system. * Application numbers are those used by an application in * its cpumemset to specify its CPU and memory affinity * for those CPU and memory blocks available in its map. * Each process, each virtual memory area, and the kernel has * such a map. These maps are inherited across fork, exec * and the various ways to create vm areas. Only a process * with root privileges can extend cpumemmaps to include * additional system CPUs or memory blocks. Changing a map * will cause kernel scheduling code to immediately start * using the new system CPUs, and cause kernel allocation * code to allocate additional memory pages using the new * system memory blocks, but memory already allocated on old * blocks will not be migrated, unless by some other means. * * The bulk of the kernel is still written using whatever * system CPU and memory block numbers are appropriate for * a system. Changes to cpumemmaps are converted at the time * of the cmsSet*() calls into changes to the system masks * (such as cpus_allowed) and lists (such as zone lists) * used by the existing scheduler and allocator. * * 2) cpumemset (cms): * * The upper layer specifies on which of the application * CPUs known to that process a task can be scheduled, and * in which application memory blocks known to the kernel * or that vm area, memory can be allocated. The kernel * allocators search the memory block lists in the given * order for available memory, and a different list is * specified for each CPU that may execute the request. * An application may change the cpumemset of its tasks * and vm areas, and root may change the cpumemset used * for kernel memory allocation. Also root may change the * cpumemsets of any process, and any process may change the * cpumemsets of other processes with the same uid (kill(2) * permissions). * * * Each task has two cpumemsets, one defining its *current* CPU * allocation and created vm areas, and one that is inherited by * any *child* process it forks. Both the *current* and *child* * cpumemsets of a newly forked process are set to copies of * the *child* cpumemset of the parent process. Allocations of * memory to existing vm areas visible to a process depend on * the cpumemset of that vm area (as acquired from its creating * process at creation, and possibly modified since), not on * the cpumemset of the currently accessing task. * * During system boot, the kernel creates and attaches a * default cpumemmap and cpumemset that is used everywhere. * By default this initial map and set contain all CPUs and * all memory blocks. The memory blocks are not necessarily * sorted in any particular order, though provision is made for * an architecture specific hook to code that can rearrange * this initial cpumemset and cpumemmap. An optional kernel * boot command line parameter causes this initial cpumemmap * and cpumemset to contain only the first CPU and one memory * block, rather than all of them, for the convenience of system * management services that wish to take greater control of * the system. * * The kernel will only schedule a task on the CPUs in the tasks * cpumemset, and only allocate memory to a user virtual memory * area from the list of memories in that areas memory list. * The kernel allocates kernel memory only from the list of * memories in the cpumemset attached to the CPU executing the * allocation request, except for specific calls with the kernel * that specify some other CPU or memory block. * * Both the *current* and *child* cpumemmaps and cpumemsets of * a newly forked process are taken from the *child* settings * of its parent, and memory allocated during the creation of * the new process is allocated according to the parents *child* * cpumemset and associated cpumemmap, because that cpumemset is * acquired by the new process and then by any vm area created * by that process. * * The cpumemset (and associated cpumemmap) of a newly created * virtual memory area is taken from the *current* cpumemset * of the task creating it. In the case of attaching to an * existing vm area, things get more complicated. Both mmap'd * memory objects and System V shared memory regions can be * attached to by multiple processes, or even attached to * multiple times by the same process at different addresses. * If such an existing memory region is attached to, then by * default the new vm area describing that attachment inherits * the *current* cpumemset of the attaching process. If however * the policy flag CMS_SHARE is set in the cpumemset currently * linked to from each vm area for that region, then the new * vm area will also be linked to this same cpumemset. * * When allocating another page to an area, the kernel will * choose the memory list for the CPU on which the current * task is being executed, if that CPU is in the cpumemset of * that memory area, else it will choose the memory list for * the default CPU (see CMS_DEFAULT_CPU) in that memory areas * cpumemset. The kernel then searches the chosen memory list * in order, from the beginning of that memory list, looking * for available memory. Typical kernel allocators search the * same list multiple times, with increasingly aggressive search * criteria and memory freeing actions. * * The cpumemmap and cpumemset calls with the CMS_VMAREA apply * to all future allocation of memory by any existing vm area, * for any pages overlapping any addresses in the range [start, * start+len), similar to the behavior of madvise, mincore * and msync. * * Interesting Error Cases: * * If a request is made to set a cpumemmap that has fewer CPUs * or memory blocks listed than needed by any cpumemsets that * will be using that cpumemmap after the change, then that * cmsSetCMM() will fail, with errno set to ENOENT. That is, * you cannot remove elements of a cpumemmap that are in use. * * If a request is made to set a cpumemset that references CPU * or memory blocks not available in its current cpumemmap, * then that cmsSetCMS() will fail, with errno set to ENOENT. * That is, you cannot reference unmapped application CPUs * or memory blocks in a cpumemset. * * If a request is made to set a cpumemmap by a process * without root privileges, and that request attempts to * add any system CPU or memory block number not currently * in the map being changed, then that request will fail, * with errno set to EPERM. * * If a cmsSetCMS() request is made on another * process, then the requesting process must either have * root privileges, or the real or effective user ID of * the sending process must equal the real or saved * set-user-ID of the other process, or else the request * will fail, with errno set to EPERM. These permissions * are similar to those required by the kill(2) system call. * * Every cpumemset must specify a memory list for the * CMS_DEFAULT_CPU, to ensure that regardless of which CPU * a memory request is executed on, a memory list will * be available to search for memory. Attempts to set * a cpumemset without a memory list specified for the * CMS_DEFAULT_CPU will fail, with errno set to EINVAL. * * If a request is made to set a cpumemset that has the same * CPU (application number) listed in more than one array * "cpus" of CPUs sharing any cms_memory_list_t, then the * request will fail, with errno set to EINVAL. Otherwise, * duplicate CPU or memory block numbers are harmless, except * for minor inefficiencies. * * The operations to query and set cpumemmaps and cpumemsets * can be applied to any process (any pid). If the pid is * zero, then the operation is applied to the current process. * If the specified pid does not exist, then the operation * wil fail with errno set to ESRCH. * * Not all portions of a cpumemset are useful in all cases. * For example the CPU portion of a vm area cpumemset is unused. * It is not clear as of this writing whether CPU portions of the * kernels cpumemset are useful. When setting a CMS_KERNEL or * CMS_VMAREA cpumemset, it is acceptable to pass in a cpumemset * structure with an empty cpu list (nr_cpus == 0 and *cpus == * NULL), and such an empty cpu list will be taken as equivalent * to passing in the cpu list from the *current* cpumemset of * the requesting process. * * A /proc interface should be provided to display the cpumemset * and cpumemmap structures, settings and connection to tasks, * vm areas, the kernel, and system and application CPUs * and memory blocks. This /proc interface is to be used by * system utilities that report on system activity and settings. * The CpuMemSet interface described in this file is independent * of that /proc reporting interface. * * None of this CpuMemSet apparatus has knowledge of distances * between nodes or memory blocks in a NUMA system. Presumably * other mechanisms exist on such large machines to report * to system services and tools in user space the topology * and distances of the system processor, memory and I/O * architecture, thus enabling such user space services to * construct cpumemmaps and cpumemsets with the desired structure. * * System services and utilities that query and modify cpumemmaps * identify maps by one of: * CMS_CURRENT - specifying a process id, for the *current* * map attached to that process * CMS_CHILD - specifying a process id, for the *child* * map attached to that process * CMS_VMAREA - specifying a process id and virtual address * range [start, start+len], for the map attached * to the pages in that address range of that process * CMS_KERNEL - for the kernel (pid, start and len args not used) * * System services and utilities that query and modify cpumemsets * identify maps by one of: * CMS_CURRENT - specifying a process id, for the *current* * set attached to that process * CMS_CHILD - specifying a process id, for the *child* * set attached to that process * CMS_VMAREA - specifying a process id and virtual address * range [start, start+len], for the map attached * to the pages in that address range of that process * CMS_KERNEL - for the kernel (pid, start and len args not used) * * If a cpumemset has a policy of CMS_ROUND_ROBIN, the kernel * searches memory lists beginning one past where the last search * on that same Memory List of that same cpumemset concluded, * instead of always from the beginning of the memory list. * * If a cpumemset has a policy of CMS_EARLY_BIRD, the * kernel first searches the first Memory Block on the memory * list, then if that doesn't provide the required memory, * the kernel searches the memory list beginning one past * where the last search on that same Memory List of that * same cpumemset concluded. "EARLY_BIRD" comes from "FIRST_ROBIN" * a contraction of FIRST_TOUCH (aka DEFAULT) and ROUND_ROBIN. * * This API is not directly implemented by dedicated system * calls, but rather by adding options to a lower level general * purpose system call. That low level API (currently using * prctl) should not be used by applications, and is subject * to change. Rather use this CpuMemSet API, which should * be stable over time. To the extent consistent with the * evolution of Linux and as resources permit, changes to this * API will preserve forward and backward, source and binary * compatibility for both kernel and application. * * The cpumemmaps and cpumemsets returned by the cmsQuery*() * routines are constructed using a malloc() for each separate * structure and array, and should, when no longer needed, by * freed with a cmsFreeCMM() or cmsFreeCMS() call, to free() * that memory. */ #if defined(sgi) typedef unsigned short int uint16_t; typedef int pid_t; typedef unsigned int size_t; #else #include "stdint.h" #endif #define CMS_DEFAULT 0x00 /* Memory list order (first-touch, typically) */ #define CMS_ROUND_ROBIN 0x01 /* Memory allocation round-robin on each list */ #define CMS_SHARE 0x02 /* Inherit virtual memory area CMS, not task */ #define CMS_EARLY_BIRD 0x03 /* First touch (default), then round-robin */ typedef int cms_setpol_t; /* Type of policy argument for sets */ /* 16 bits gets us 64K CPUs ... no one will ever need more than that! */ typedef uint16_t cms_acpu_t; /* Type of application CPU number */ typedef uint16_t cms_amem_t; /* Type of application memory block number */ typedef uint16_t cms_scpu_t; /* Type of system CPU number */ typedef uint16_t cms_smem_t; /* Type of system memory block number */ #define CMS_DEFAULT_CPU ((cms_acpu_t)-1) /* Marks default Memory List */ /* Calls to query and set cmm and cms need to specify which one ... */ #define CMS_CURRENT 0 /* cmm or *current* cms of this process */ #define CMS_CHILD 1 /* *child* cms of this process */ #define CMS_VMAREA 2 /* cmm or cms of vmarea at given virtual addr */ #define CMS_KERNEL 3 /* cmm or cms of kernel (root-only) */ typedef int cms_choice_t; /* Type of cmm/cms choice argument */ #define CMS_BULK_PID 0 /* Bulk remap just this pid (current+child) */ #define CMS_BULK_SHARE 1 /* Bulk remap all sharing with this pid */ #define CMS_BULK_KERNEL 2 /* Bulk remap kernel cpumemmap */ #define CMS_BULK_ALL 3 /* Bulk remap all cpumemmaps (likely slow!) */ typedef int cms_remap_choice_t; /* Type of Bulk Remap operation */ /* cpumemmap: Type for the pair of maps ... */ typedef struct cpumemmap { int nr_cpus; /* number of CPUs in map */ cms_scpu_t *cpus; /* array maps application to system CPU num */ int nr_mems; /* number of mems in map */ cms_smem_t *mems; /* array maps application to system mem num */ } cpumemmap_t; /* * How memory looks to (typically) a set of equivalent CPUs, * including which memory blocks to search for memory, in what order, * and the list of CPUs to which this list of memory blocks applies. * The cpumemset is sufficiently complex that this portion of the * data structure type is specified separately, then an array of * cms_memory_list_t structures is included in the main cpumemset type. */ typedef struct cms_memory_list { int nr_cpus; /* Number of CPUs sharing this memory list */ cms_acpu_t *cpus; /* Array of CPUs sharing this memory list */ int nr_mems; /* Number of memory blocks in this list */ cms_amem_t *mems; /* Array of 'nr_mems' memory blocks */ } cms_memory_list_t; /* * Specify a single cpumemset, describing on which CPUs to * schedule tasks, from which memory blocks to allocate memory, * and in what order to search these memory blocks. */ typedef struct cpumemset { cms_setpol_t policy; /* or'd CMS_* set policy flags */ int nr_cpus; /* Number of CPUs in this cpumemset */ cms_acpu_t *cpus; /* Array of 'nr_cpus' processor numbers */ int nr_mems; /* Number of Memory Lists in this cpumemset */ cms_memory_list_t *mems;/* Array of 'nr_mems' Memory Lists */ } cpumemset_t; /* * Remap Vector: used in bulk remap to request changing multiple * system CPU and/or memory block values to other values, for one * process, or all sharing the same map as one process, or for all * processes in a system. Useful for isolating a CPU or memory by * remapping prior existing users to other resources, as well for * offloading a CPU prior to hot swapping it out. */ typedef struct cms_remap_vector { int nr_cpus; /* number of CPUs being remapped */ cms_scpu_t *oldcpus; /* for any system CPU listed here in a map: */ cms_scpu_t *newcpus; /* ... change it to corresponding CPU here */ int nr_mems; /* number of mems being remapped */ cms_smem_t *oldmems; /* for any system mem listed here in a map: */ cms_smem_t *newmems; /* ... change it to corresponding mem here */ } cms_remap_vector_t; /* Manage cpumemmaps (need perms like kill(2), must be root to grow map) */ cpumemmap_t *cmsQueryCMM (cms_choice_t c, pid_t pid, void *start); int cmsSetCMM ( cms_choice_t c, pid_t pid, void *start, size_t len, cpumemmap_t *cmm); /* Manage cpumemsets (need perms like kill(2), must be root to grow map) */ cpumemset_t *cmsQueryCMS (cms_choice_t c, pid_t pid, void *start); int cmsSetCMS ( cms_choice_t c, pid_t pid, void *start, size_t len, cpumemset_t *cms); /* Bulk remap - change system CPUs/mems for multiple cpumemmaps (root-only) */ int cmsBulkRemap (cms_remap_choice_t c, pid_t pid, cms_remap_vector_t *v); /* Return application CPU number currently executing on */ cms_acpu_t cmsGetCPU(void); /* Free results from above cmsQuery*() calls */ void cmsFreeCMM (cpumemmap_t *cmm); void cmsFreeCMS (cpumemset_t *cms); #endif
One way to understand these data structures is to look at an example.
Given the following hardware configuration: Let's say we have a four node system, with four CPUs per node, and one memory block per node, named as follows: Name the 16 CPUs: c0, c1, ..., c15 # 'c' for CPU and number them: 0, 1, 2, ..., 15 # cms_pcpu_t Name the 4 memories: mb0, mb1, mb2, mb3 # 'mb' for memory block and number them: 0, 1, 2, 3 # cms_pmem_t cpumemmap: Now lets say the administrator (root) chooses to setup a Map containing just the 2nd and 3rd node (CPUs and memory thereon). The cpumemmap for this would contain: { 8, # nr_cpus (length of CPUs array) p1, # CPUs (ptr to array of cms_pcpu_t) 2, # nr_mems (length of mems array) p2 # mems (ptr to array of cms_pmem_t) } where p1, p2 point to arrays of system CPU + mem numbers: p1 = [ 4,5,6,7,8,9,10,11 ] # CPUs (array of cms_pcpu_t) p2 = [ 1,2 ] # mems (array of cms_pmem_t) This map shows, for example, that for this Map, application CPU 0 corresponds to system CPU 4 (c4). cpumemset: Further lets say that an application running within this map chooses to restrict itself to just the odd-numbered CPUs, and to search memory in the common "first-touch" manner (local node first). It would establish a cpumemset containing: { CMS_DEFAULT, # cms_policy 4, # nr_cpus (length of CPUs array) q1, # CPUs (ptr to array of cms_lcpu_t) 2, # nr_mems (length of mems array) q2, # mems (ptr to array of cms_memory_list) } where q1 points to an array of 4 application CPU numbers and q2 to an array of 2 memory lists: q1 = [ 1,3,5,7 ], # CPUs (array of cms_lcpu_t) q2 = [ # See "Verbalization example" below { 3, r1, 2, s1 } { 2, r2, 2, s2 } ] where r1, r2 are arrays of application CPUs: r1 = [1, 3, CMS_DEFAULT_CPU] r2 = [5, 7] and s1, s2 are arrays of memory blocks: s1 = [0, 1] s2 = [1, 0]
Verbalization examples:
To read item q1 out loud:
To read item q2 out loud:
Interpretations of the above:
Observation:
The following kernel, library, configuration and related changes (amongst others, no doubt will be needed to implement CpuMemSets:
October 8, 2001 Revision
Copyright 2001