Linux VM/IO balancing (fwd to linux-mm?) (fwd)
Hi, here's an interesting tidbit I got from Matthew Dillon. A lot of this is very interesting indeed and I guess we want some of it before kernel 2.4 and most of it in kernel 2.5 ... Rik ---------- Forwarded message ---------- Date: Mon, 22 May 2000 23:32:20 -0700 (PDT) From: Matthew Dillon <dillon@apollo.backplane.com> To: Rik van Riel <riel@conectiva.com.br> Subject: Linux VM/IO balancing (fwd to linux-mm?) I sent this to Alan who suggested that I send this to you! I've fleshed it out a bit more from the version I sent to Alan. I've been following the linux VM/memory subsystem issues closely and I have a few suggestions on how to clean it up. Unfortunately, being FreeBSD centric these days I do not want to create controversy. But at this point I think even linux developers know that the algorithms being used in the linux kernel are struggling with the issue. The time may be ripe for some outside input (though I will note that I was heavy into linux in earlier years, as Alan's mail archives attest to!). What I do below is essentially describe the FreeBSD VM/memory subsystem from an algorithmic point of view, minus some of the fluff. It is very straightforward in concept and the benefits should be obvious to anyone who has worked on the VM system. I think it would be a fine blueprint to use in linux. I would like you to consider posting this to linux-kernel. I leave it up to you -- I think if I were to post it independantly it would simply result in controversy and seem more like a FreeBSD vs Linux thing rather then a 'how to fix the VM system' thing. But if you post it and preface it appropriately, the result might be better. So I am sending it to you. It may help kickstart creating a formal algorithmic blueprint rather then hacking up patches that solve one problem that only create other problems. Please feel free to edit it in any way if you think you can reduce potential us-vs-them issues even more. I think there is a chance here to get the whole of the linux developer community on the same page in regards to the memory-load issue. I didn't intend to turn this into a paper, but I've spent so much time describing it that I think I am going to submit it to daemon news in a cleaned-up form :-). Thanks! -Matt --------- Even though I don't do much coding for Linux these days, I've always avidly tracked the kernel mailing list. I've been tracking the memory balancing thread for a while now and, having dealt with similar issues in FreeBSD I believe I can offer a suggestion. First of all, I am not trying to turn this into a comparison or anything. Don't think of the FreeBSD memory subsystem as being 'FreeBSD' more as being the work of a handful of very good theorists and programmers (John Dyson being the main element there), and years of testing under varying loads. The algorithms FreeBSD uses are NOT 15 years old, they are more like 3 years old, and much of the work I myself have done is less then one year old. Also, keep in mind that the standard FUD about FreeBSD 'swapping more' the linux is just that --- FUD. FreeBSD only swaps when it needs to, or thinks it may benefit by freeing up an idle dirty page for more active reuse. I make an attempt to describe in exact detail how and why FreeBSD pages things out, and why it works so well, somewhere down below :-). My suggestions are as follows: First, stop treating the buffer cache as an entity separate from the VM system. The two are inexorably bound together, especially considering the massive use of mmap() (both file-backed and anonymous mmaps) in modern programming. Depending on what you are running a properly balanced system might have anywhere from 80% of its memory assigned as file cache to 80% of its memory assigned to hold anonymous memory for processes. it is NOT possible to impose limitations and still get a reasonably scaleable balanced system. DO NOT TREAT THESE AS TWO DIFFERENT CACHES! Second, start keeping real statistics on memory use across on a physical-page-basis. That means tracking how often VM pages are referenced (statistically) as well as how often filesystem pages are referenced by discrete I/O calls (deterministically). Keep track of a real per-physical-page statistical 'weight'. (What this means for linux is that you really need to test the pte's associated with physical pages by iterating through the physical pagse in your outer loop, NOT by trying to iterate through every page table of every process!). FreeBSD keeps a center-weighted statistic for every page of memory (buffer cache or VM cache, it makes no difference). This has turned out to be a nearly perfect balancing algorithm and I strongly recommend that linux adopt a similar model. But what makes it work is that FreeBSD is willing to eat a couple of cpu cycles to keep accurate statistics of page use by the VM system in order to avoid the bad things that happen when one would otherwise choose the wrong page to reuse or clean. What I describe below is the essential core of the algorithm FreeBSD uses. It's not an exact representation, but it gets to the heart of FreeBSD's memory-balancing success. The algorithm is a *modified* LRU. Lets say you decide on a weighting betweeen 0 and 10. When a page is first allocated (either to the buffer cache or for anonymous memory) its statistical weight is set to the middle (5). If the page is used often the statistical weight slowly rises to its maximum (10). If the page remains idle (or was just used once) the statistical weight slowly drops to its minimum (0). The statistical weight is updated in real time by I/O system calls, and updated statistically (by checking and clearing the page-referenced bit in pte's) for mapped memory. When you mmap() a file and issue syscalls on the descriptor, the weight may be updated by BOTH methods. The rate at which the statistical page-reference updating operates depends on the perceived memory load. A lightly loaded system (unstressed memory) doesn't bother to scan the page-referenced bit all that often, while a heavy memory load scans the page-referenced bit quite often to keep the statistical weight intact. When memory is allocated and no free pages are available, a clean page is discarded from the cache (all 'clean' pages are considered to be cache pretty much), lowest weight first. This in itself does NOT contribute to the memory load calculation. That is, if you are scanning a 10GB file you are not creating any memory stress on the system. The LRU nature of the order of the pages in the queue is not strict. The real parameter is the statistic, the ordering of the pages in the queue uses a heuristic -- the pages 'migrate' over time so they are reasonably well ordered within the queue, but no great effort is made to order them exactly. The VM system will scan a portion of the queue to locate a reasonable page to reuse (for example, it will look for a page with a weighting less then 2). The pagedaemon's scan rate is based on the perceived memory load and ONLY the perceived memory load. It is perfectly acceptable to have most of the system memory in 'active' use if allocations are not occuring often, perfectly acceptable to have most of the system memory backing file pages if processes aren't doing a lot of pageins, perfectly acceptable for the system memory to be mostly dedicated to process anonymous memory if processes have big ACTIVE footprints, perfectly acceptable for most of the pages to be dirty if they are all in active use and the memory subsystem is not otherwise being stressed. The reason FreeBSD's memory subsystem works so well is precisely because it does not impose any artificial limitations on the balance point. Memory load is calculated in two ways: First, if the memory system finds itself reusing active pages (in my example, any page with a statistical weight greater then 5), second based on the dirty:clean page ratio. A high ratio does not itself cause paging to occur, but a high ratio combined with the system reusing active pages does. The dirty/clean ratio is treated as an INDEPENDANT problem. The same statistic is kept for dirty pages as it is for clean pages, but dirty pages are placed on their own independant LRUish queue and do not take part in the 'normal' memory allocation algorithm. A separate algorithm (also part of the pageout daemon) controls the cleaning of dirty pages. When the memory load increases, an attempt is made to balance the dirty/clean ratio by 'cleaning' dirty pages, which of course means paging them out. FreeBSD makes NO distinction between writing a dirty file-backed page and allocating swap for a dirty anonymous memory page. The same per-page memory-use statistic is also used to determine which dirty pages to clean first. In effect, it is precisely this attempt to balance the dirty/clean ratio which increases the number of clean pages available to reuse. The idea here is to increase the number of clean pages to the point where the system is no longer being forced to reuse 'active' pages. Once this is achieved there is no longer any need clean the remaining dirty pages. Under extreme memory loads the balance point moves on its own to a point where FreeBSD tries to keep as many pages in a clean state as possible. When the memory load gets to this point the system is considered to be thrashing and we start taking anti-thrashing measures, such as swapping out whole processes and holding them idle for 20-second spans. It rarely gets to this point, but even when it does the system is still kept reasonably balanced. It should be noted that the center-weighting algorithm works in virtually all situations, including workign WONDERFULLY when you have I/O centric programs (i.e. a program that reads or writes gigabytes of data). By making slight adjustments to the initial weight (or even no adjustments at all) the VM system will tend to reuse used-once memory (think of scanning a file) before it tries to reuse more actively used memory. Now, of course, there are other kernel processes messing with memory. The filesystem update daemon, for example. But these processes are not designed to handle heavy memory loads and we do it that way on purpose. At most the update daemon will speed up a little under intense filesystem loads, but that is as far as it goes. Only one process is designed to handle heavy memory loads and that is the pageout daemon. --- Stress Cases * Stressing dirty pages in the system via I/O calls (read/write) The algorithm tends to cause sequential I/O calls to give pages a middling weight, and since the pages are not reused they tend to be recycled within their domain (so you don't blow the rest of the cache). * Stressing dirty pages in the system via mmap (shared R+W) The system tends to run low on clean pages, detected by the fact that new allocations are reusing clean pages which have high weights. When this occurs the pageout daemon attempts to 'clean' dirty pages (page them out) in order to increase the number of clean pages available. Having a larger number of clean pages available tends to give them more time to age, thus reducing the average weight the allocator sees. This is a negative feedback loop which results in balance. * I/O (read/shared-mmap) stress The algorithm tends to weight the clean pages according to use. The weightings for filesystem cache pages read via read() are adjusted at the time of the read() while VM pages are adjusted statistically (The VM page scan rate depends on the level of stress). Since in modern systems mmap() is used heavily, no special consideration is given to one access method verses the other. * VM (anonymous memory) stress Anonymous swap-backed memory is treated no differently from file-backed (filesystem buffers / mmap) memory. Clean anonymous pages (most likely with swap already allocated if they are clean) can be reused just the same as pages belonging to the filesystem buffer cache. Swap is assigned to dirty anonymous pages on the fly, only when the pageout daemon decides to actually clean the page. Once swap is assigned the clean page can be reused. If a swap-backed page is brought back into memory, it is brought back in clean (swap is left assigned). Swap is only freed if the page is re-dirtied by the process. Thus most anonymous-memory pages in a heavily loaded system tend to remain clean, allowing them to be reused more easily and extending the life of the system further along the curve before it reaches a thrashing state. * Write Clustering. Whenever the system decides to clean a dirty page it will, on the fly, attempt to locate dirty nearby pages. FreeBSD is actually quite sophisticated in this regard in that it actually goes and does the calculation to ensure that only pages physically contiguous on the disk are clustered for the write. The cluster is then written and marked clean all in one go (cluster size limit is 64-128K). * Sequential Detection Heuristic for read clustering (read()) A heuristic detects sequential read behavior and implements two optimizations. (1) it implements read-aheads (as long as they are reasonably contiguous on the physical media, we explicitly do not try to issue read-aheads if it would cause an extra disk seek), (2) it implements priority depression read-behind (reduce by 1 the statistical weight of pages that have already been read). Reuse of the pages can still cause the statistical weighting to increase to the maximum, but this optimization has a tendancy to greatly reduce the stress that large sequential reads have on the rest of the memory subsystem. * Sequential Detection Heuristic for read clustering (VM fault) A heuristic detects sequential VM fault operation, either forwards or backwards and adjusts the cluster window around the fault taken, either shifting it forwards or backwards, or making the window smaller (e.g. if random fault operation is detecting). fault-ahead I/O is initiated based on the algorithm and anything found cached is pre-faulted into the page table. (The window size in FreeBSD is approximately 64KBytes for this particular algorithm). The window is further restriction to ensure that only media-contiguous blocks are clustered. * Sequential Detection Heuristic for write clustering (write()) In the case of write() I/O (write system call), in order to avoid saturating the memory system with dirty pages, if the sequential detection heuristic determines that writes are occuring sequentially, FreeBSD implements write-behind. That is it issues the I/O on the dirty buffers preceding the write point immediately (and asynchronously), in order to get the pages into a clean state and thus reuseable, thus avoiding stressing the memory system. In this case there is also a limit emplaced on the number of dirty filesystem buffers allowed to accumulate (since I/O is slower then the write() calls creating the dirty buffers). What you wind up in this case is maximum disk throughput for the sequential write without thousands of unnecessary dirty pages, which is asynchronous up to a reasonable point and then starts blocking to give the I/O the chance to catch up a little in order to avoid starving the clean page cache. * Sequential Detection Heuristic for write clustering (mmap) Currently not implemented under FreeBSD. This used to be a big problem because you could completely saturate the VM system with dirty pages before the system even realized it. To fix this we threw in a memory-stress check in vm_fault to block when dirtying pages in the face of having too many dirty pages already, giving I/O a chance to catch up a little. This actually improved performance because it left a greater number of clean pages available and so the page selection algorithm in the allocator worked better (tended to select idle pages rather then active pages). -Matt Matthew Dillon <dillon@backplane.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/