From: r...@conectiva.com.br (Rik van Riel) Subject: RFC: design for new VM Date: 2000/08/02 Message-ID: <Pine.LNX.4.21.0008021212030.16377-100000@duckman.distro.conectiva> X-Deja-AN: 653785019 Sender: owner-linux-ker...@vger.rutgers.edu X-Sender: r...@duckman.distro.conectiva X-Authentication-Warning: duckman.distro.conectiva: riel owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII MIME-Version: 1.0 Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu [Linus: I'd really like to hear some comments from you on this idea] Hi, here is a (rough) draft of the design for the new VM, as discussed at UKUUG and OLS. The design is heavily based on the FreeBSD VM subsystem - a proven design - with some tweaks where we think things can be improved. Some of the ideas in this design are not fully developed, but none of those "new" ideas are essential to the basic design. The design is based around the following ideas: - center-balanced page aging, using - multiple lists to balance the aging - a dynamic inactive target to adjust the balance to memory pressure - physical page based aging, to avoid the "artifacts" of virtual page scanning - separated page aging and dirty page flushing - kupdate flushing "old" data - kflushd syncing out dirty inactive pages - as long as there are enough (dirty) inactive pages, never mess up aging by searching for clean active pages ... even if we have to wait for disk IO to finish - very light background aging under all circumstances, to avoid half-hour old referenced bits hanging around Center-balanced page aging: - goals - always know which pages to replace next - don't spend too much overhead aging pages - do the right thing when the working set is big but swapping is very very light (or none) - always keep the working set in memory in favour of use-once cache - page aging almost like in 2.0, only on a physical page basis - page->age starts at PAGE_AGE_START for new pages - if (referenced(page)) page->age += PAGE_AGE_ADV; - else page->age is made smaller (linear or exponential?) - if page->age == 0, move the page to the inactive list - NEW IDEA: age pages with a lower page age - data structures (page lists) - active list - per node/pgdat - contains pages with page->age > 0 - pages may be mapped into processes - scanned and aged whenever we are short on free + inactive pages - maybe multiple lists for different ages, to be better resistant against streaming IO (and for lower overhead) - inactive_dirty list - per zone - contains dirty, old pages (page->age == 0) - pages are not mapped in any process - inactive_clean list - per zone - contains clean, old pages - can be reused by __alloc_pages, like free pages - pages are not mapped in any process - free list - per zone - contains pages with no useful data - we want to keep a few (dozen) of these around for recursive allocations - other data structures - int memory_pressure - on page allocation or reclaim, memory_pressure++ - on page freeing, memory_pressure-- (keep it >= 0, though) - decayed on a regular basis (eg. every second x -= x>>6) - used to determine inactive_target - inactive_target == one (two?) second(s) worth of memory_pressure, which is the amount of page reclaims we'll do in one second - free + inactive_clean >= zone->pages_high - free + inactive_clean + inactive_dirty >= zone->pages_high \ + one_second_of_memory_pressure * (zone_size / memory_size) - inactive_target will be limited to some sane maximum (like, num_physpages / 4) The idea is that when we have enough old (inactive + free) pages, we will NEVER move pages from the active list to the inactive lists. We do that because we'd rather wait for some IO completion than evict the wrong page. Kflushd / bdflush will have the honourable task of syncing the pages in the inactive_dirty list to disk before they become an issue. We'll run balance_dirty over the set of free + inactive_clean + inactive_dirty AND we'll try to keep free+inactive_clean > pages_high .. failing either of these conditions will cause bdflush to kick into action and sync some pages to disk. If memory_pressure is high and we're doing a lot of dirty disk writes, the bdflush percentage will kick in and we'll be doing extra-agressive cleaning. In that case bdflush will automatically become more agressive the more page replacement is going on, which is a good thing. Physical page based page aging In the new VM we'll need to do physical page based page aging for a number of reasons. Ben LaHaise said he already has code to do this and it's "dead easy", so I take it this part of the code won't be much of a problem. The reasons we need to do aging on a physical page are: - avoid the virtual address based aging "artifacts" - more efficient, since we'll only scan what we need to scan (especially when we'll test the idea of aging pages with a low age more often than pages we know to be in the working set) - more direct feedback loop, so less chance of screwing up the page aging balance IO clustering IO clustering is not done by the VM code, but nicely abstracted away into a page->mapping->flush(page) callback. This means that: - each filesystem (and swap) can implement their own, isolated IO clustering scheme - (in 2.5) we'll no longer have the buffer head list, but a list of pages to be written back to disk, this means doing stuff like delayed allocation (allocate on flush) or kiobuf based extents is fairly trivial to do Misc Page aging and flushing are completely separated in this scheme. We'll never end up aging and freeing a "wrong" clean page because we're waiting for IO completion of old and to-be-freed pages. Write throttling comes quite naturally in this scheme. If we have too many dirty inactive pages we'll write throttle. We don't have to take dirty active pages into account since those are no candidate for freeing anyway. Under light write loads we will never write throttle (good) and under heavy write loads the inactive_target will be bigger and write throttling is more likely to kick in. Some background page aging will always be done by the system. We need to do this to clear away referenced bits every once in a while. If we don't do this we can end up in the situation where, once memory pressure kicks in, pages which haven't been referenced in half an hour still have their referenced bit set and we have no way of distinguishing between newly referenced pages and ancient pages we really want to free. (I believe this is one of the causes of the "freeze" we can sometimes see in current kernels) Over the next weeks (months?) I'll be working on implementing the new VM subsystem for Linux, together with various other people (Andrea Arcangeli??, Ben LaHaise, Juan Quintela, Stephen Tweedie). I hope to have it ready in time for 2.5.0, but if the code turns out to be significantly more stable under load than the current 2.4 code I won't hesitate to submit it for 2.4.bignum... regards, Rik -- "What you're running that piece of s*** Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: torva...@transmeta.com (Linus Torvalds) Subject: Re: RFC: design for new VM Date: 2000/08/03 Message-ID: <Pine.LNX.4.10.10008031020440.6384-100000@penguin.transmeta.com> X-Deja-AN: 654115657 Sender: owner-linux-ker...@vger.rutgers.edu References: <Pine.LNX.4.21.0008021212030.16377-100000@duckman.distro.conectiva> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII MIME-Version: 1.0 Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu On Wed, 2 Aug 2000, Rik van Riel wrote: > > [Linus: I'd really like to hear some comments from you on this idea] I am completely and utterly baffled on why you think that the multi-list approach would help balancing. Every single indication we have ever had is that balancing gets _harder_ when you have multiple sources of pages, not easier. As far as I can tell, the only advantage of multiple lists compared to the current one is to avoid overhead in walking extra pages, no? And yet you claim that you see no way to fix the current VM behaviour. This is illogical, and sounds like complete crap to me. Why don't you just do it with the current scheme (the only thing needed to be added to the current scheme being the aging, which we've had before), and prove that the _balancing_ works. If you can prove that the balancing works but that we spend unnecessary time in scanning the pages, then you've proven that the basic VM stuff is right, and then the multiple queues becomes a performance optimization. Yet you seem to sell the "multiple queues" idea as some fundamental change. I don't see that. Please explain what makes your ideas so radically different? > The design is based around the following ideas: > - center-balanced page aging, using > - multiple lists to balance the aging > - a dynamic inactive target to adjust > the balance to memory pressure > - physical page based aging, to avoid the "artifacts" > of virtual page scanning > - separated page aging and dirty page flushing > - kupdate flushing "old" data > - kflushd syncing out dirty inactive pages > - as long as there are enough (dirty) inactive pages, > never mess up aging by searching for clean active > pages ... even if we have to wait for disk IO to > finish > - very light background aging under all circumstances, to > avoid half-hour old referenced bits hanging around As far as I can tell, the above is _exactly_ equivalent to having one single list, and multiple "scan-points" on that list. A "scan-point" is actually very easy to implement: anybody at all who needs to scan the list can just include his own "anchor-page": a "struct page_struct" that is purely local to that particular scanner, and that nobody else will touch because it has an artificially elevated usage count (and because there is actually no real page associated with that virtual "struct page" the page count will obviosly never decrease ;). Then, each scanner just advances its own anchor-page around the list, and does whatever it is that the scanner is designed to do on the page it advances over. So "bdflush" would do .. lock_list(); struct page *page = advance(&bdflush_entry); if (page->buffer) { get_page(page); unlock_list(); flush_page(page); continue; } unlock_list(); .. while the page ager would do lock_list(); struct page *page = advance(&bdflush_entry); page->age = page->age >> 1; if (PageReferenced(page)) page->age += PAGE_AGE_REF; unlock_list(); etc.. Basically, you can have any number of virtual "clocks" on a single list. No radical changes necessary. This is something we can easily add to 2.4.x. The reason I'm unconvinced about multiple lists is basically: - they are inflexible. Each list has a meaning, and a page cannot easily be on more than one list. It's really hard to implement overlapping meanings: you get exponential expanision of combinations, and everybody has to be aware of them. For example, imagine that the definition of "dirty" might be different for different filesystems. Imagine that you have a filesystem with its own specific "walk the pages to flush out stuff", with special logic that is unique to that filesystem ("you cannot write out this page until you've done 'Y' or whatever). This is hard to do with your approach. It is trivial to do with the single-list approach above. More realistic (?) example: starting write-back of pages is very different from waiting on locked pages. We may want to have a "dirty but not yet started" list, and a "write-out started but not completed" locked list. Right now we use the same "clock" for them (the head of the LRU queue with some ugly heuristic to decide whether we want to wait on anything). But we potentially really want to have separate logic for this: we want to have a background "start writeout" that goes on all the time, and then we want to have a separate "start waiting" clock that uses different principles on which point in the list to _wait_ on stuff. This is what we used to have in the old buffer.c code (the 2.0 code that Alan likes). And it was _horrible_ to have separate lists, because in fact pages can be both dirty and locked and they really should have been on both lists etc.. - in contrast, scan-points (withour LRU, but instead working on the basis of the age of the page - which is logically equivalent) offer the potential for specialized scanners. You could have "statistics gathering robots" that you add dynamically. Or you could have per-device flush deamons. For example, imagine a common problem with floppies: we have a timeout for the floppy motor because it's costly to start them up again. And they are removable. A perfect floppy driver would notice when it is idle, and instead of turning off the motor it might decide to scan for dirty pages for the floppy on the (correct) assumption that it would be nice to have them all written back instead of turning off the motor and making the floppy look idle. With a per-device "dirty list" (which you can test out with a page scanner implementation to see if it ends up reall yimproving floppy behaviour) you could essentially have a guarantee: whenever the floppy motor is turned off, the filesystem on that floppy is synced. Test implementation: floppy deamon that walks the list and turns off the engine only after having walked it without having seen any dirty blocks. In the end, maybe you realize that you _really_ don't want a dirty list at all. You want _multiple_ dirty lists, one per device. And that's really my point. I think you're too eager to rewrite things, and not interested enough in verifying that it's the right thing. Which I think you can do with the current one-list thing easily enough. - In the end, even if you don't need the extra flexibility of multiple clocks, splitting them up into separate lists doesn't change behaviour, it's "only" a CPU time optimization. Which may well be worth it, don't get me wrong. But I don't see why you tout this as being something radically needed in order to get better VM behaviour. Sure, multiple lists avoids the unnecessary walking over pages that we don't care about for some particular clock. And they may well end up being worth it for that reason. But it's not a very good way of doing prototyping of the actual _behaviour_ of the lists. To make a long story short, I'd rather see a proof-of-concept thing. And I distrust your notion that "we can't do it with the current setup, we'll have to implement something radically different". Bascially, IF you think that your newly designed VM should work, then you should be able to prototype and prove it easily enough with the current one. I'm personally of the opinion that people see that page aging etc is hard, so they try to explain the current failures by claiming that it needs a completely different approach. And in the end, I don't see what's so radically different about it - it's just a re-organization. And as far as I can see it is pretty much logically equivalent to just minor tweaks of the current one. (The _big_ change is actually the addition of a proper "age" field. THAT is conceptually a very different approach to the matter. I agree 100% with that, and the reason I don't get all that excited about it is just that we _have_ done page aging before, and we dropped it for probably bad reasons, and adding it back should not be that big of a deal. Probabl yless than 50 lines of diff). Read Dilbert about the effectiveness of (and reasons for) re- organizations. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Chris Wedgwood <c...@f00f.org> Subject: Re: RFC: design for new VM Date: 2000/08/03 Message-ID: <linux.kernel.20000803191906.B562@metastasis.f00f.org>#1/1 X-Deja-AN: 653924407 Approved: n...@nntp-server.caltech.edu X-To: Rik van Riel <r...@conectiva.com.br> Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 X-Cc: linux...@kvack.org, linux-ker...@vger.rutgers.edu, Linus Torvalds <torva...@transmeta.com> Newsgroups: mlist.linux.kernel On Wed, Aug 02, 2000 at 07:08:52PM -0300, Rik van Riel wrote: here is a (rough) draft of the design for the new VM, as discussed at UKUUG and OLS. The design is heavily based on the FreeBSD VM subsystem - a proven design - with some tweaks where we think things can be improved. Can the differences between your system and what FreeBSD has be isolated or contained -- I ask this because the FreeBSD VM works _very_ well compared to recent linux kernels; if/when the new system is implement it would nice to know if performance differences are tuning related or because of 'tweaks'. --cw - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Theodore Y. Ts'o" <ty...@MIT.EDU> Subject: Re: RFC: design for new VM Date: 2000/08/05 Message-ID: <linux.kernel.200008052248.SAA00643@tsx-prime.MIT.EDU>#1/1 X-Deja-AN: 654919808 Approved: n...@nntp-server.caltech.edu X-To: Rik van Riel <r...@conectiva.com.br> X-CC: Chris Wedgwood <c...@f00f.org>, linux...@kvack.org, linux-ker...@vger.rutgers.edu, Matthew Dillon <dil...@apollo.backplane.com> Newsgroups: mlist.linux.kernel Date: Thu, 3 Aug 2000 13:01:56 -0300 (BRST) From: Rik van Riel <r...@conectiva.com.br> You're right, the differences between FreeBSD VM and the new Linux VM should be clearly indicated. > I ask this because the FreeBSD VM works _very_ well compared to > recent linux kernels; if/when the new system is implement it > would nice to know if performance differences are tuning related > or because of 'tweaks'. Indeed. The amount of documentation (books? nah..) on VM is so sparse that it would be good to have both systems properly documented. That would fill a void in CS theory and documentation that was painfully there while I was trying to find useful information to help with the design of the new Linux VM... ... and you know, once written, it would make a *wonderful* paper to present at Freenix or for ALS.... (speaking as someone who has been on program committees for both conferences :-) - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: dil...@apollo.backplane.com (Matthew Dillon) Subject: Re: RFC: design for new VM Date: 2000/08/04 Message-ID: <200008041541.IAA88364@apollo.backplane.com>#1/1 X-Deja-AN: 654461672 Sender: owner-linux-ker...@vger.rutgers.edu References: <Pine.LNX.4.21.0008031243070.24022-100000@duckman.distro.conectiva> Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu :> here is a (rough) draft of the design for the new VM, as :> discussed at UKUUG and OLS. The design is heavily based :> on the FreeBSD VM subsystem - a proven design - with some :> tweaks where we think things can be improved. :> :> Can the differences between your system and what FreeBSD has be :> isolated or contained : :You're right, the differences between FreeBSD VM and the new :Linux VM should be clearly indicated. : :> I ask this because the FreeBSD VM works _very_ well compared to :> recent linux kernels; if/when the new system is implement it :> would nice to know if performance differences are tuning related :> or because of 'tweaks'. : :Indeed. The amount of documentation (books? nah..) on VM :is so sparse that it would be good to have both systems :properly documented. That would fill a void in CS theory :and documentation that was painfully there while I was :trying to find useful information to help with the design :of the new Linux VM... : :regards, : :Rik Three or four times in the last year I've gotten emails from people looking for 'VM documentation' or 'books they could read'. I couldn't find a blessed thing! Oh, sure, there are papers strewn about, but most are very focused on single aspects of a VM design. I have yet to find anything that covers the whole thing. I've written up an occassional 'summary piece' for FreeBSD, e.g. the Jan 2000 Daemon News article, but that really isn't adequate. The new Linux VM design looks exciting! I will be paying close attention to your progress with an eye towards reworking some of FreeBSD's code. Except for one or two eyesores (1) the FreeBSD code is algorithmically sound, but pieces of the implementation are rather messy from years of patching. When I first started working on it the existing crew had a big bent towards patching rather then rewriting and I had to really push to get some of my rewrites through. The patching had reached the limits of the original code-base's flexibility. note(1) - the one that came up just last week was the O(N) nature of the FreeBSD VM maps (linux uses an AVL tree here). These work fine for 95% of the apps out there but turn into a sludgepile for things like malloc debuggers and distributed shared memory systems which want to mprotect() on a page-by-page basis. The second eyesore is the lack of physically shared page table segments for 'standard' processes. At the moment, it's an all (rfork/RFMEM/clone) or nothing (fork) deal. Physical segment sharing outside of clone is something Linux could use to, I don't think it does it either. It's not easy to do right. -Matt Matthew Dillon <dil...@backplane.com> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: torva...@transmeta.com (Linus Torvalds) Subject: Re: RFC: design for new VM Date: 2000/08/04 Message-ID: <Pine.LNX.4.10.10008041033230.813-100000@penguin.transmeta.com>#1/1 X-Deja-AN: 654504126 Sender: owner-linux-ker...@vger.rutgers.edu References: <200008041541.IAA88364@apollo.backplane.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII MIME-Version: 1.0 Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu On Fri, 4 Aug 2000, Matthew Dillon wrote: > > The second eyesore > is the lack of physically shared page table segments for 'standard' > processes. At the moment, it's an all (rfork/RFMEM/clone) or nothing > (fork) deal. Physical segment sharing outside of clone is something > Linux could use to, I don't think it does it either. It's not easy to > do right. It's probably impossible to do right. Basically, if you do it, you do it wrong. As far as I can tell, you basically screw yourself on the TLB and locking if you ever try to implement this. And frankly I don't see how you could avoid getting screwed. There are architecture-specific special cases, of course. On ia64, the page table is not really one page table, it's a number of pretty much independent page tables, and it would be possible to extend the notion of fork vs clone to be a per-page-table thing (ie the single-bit thing would become a multi-bit thing, and the single "struct mm_struct" would become an array of independent mm's). You could do similar tricks on x86 by virtually splitting up the page directory into independent (fixed-size) pieces - this is similar to what the PAE stuff does in hardware, after all. So you could have (for example) each process be quartered up into four address spaces with the top two address bits being the address space sub-ID. Quite frankly, it tends to be a nightmare to do that. It's also unportable: it works on architectures that either support it natively (like the ia64 that has the split page tables because of how it covers large VM areas) or by "faking" the split on regular page tables. But it does _not_ work very well at all on CPU's where the native page table is actually a hash (old sparc, ppc, and the "other mode" in IA64). Unless the hash happens to have some of the high bits map into a VM ID (which is common, but not really something you can depend on). And even when it "works" by emulation, you can't share the TLB contents anyway. Again, it can be possible on a per-architecture basis (if the different regions can have different ASI's - ia64 again does this, and I think it originally comes from the 64-bit PA-RISC VM stuff). But it's one of those bad ideas that if people start depending on it, it simply won't work that well on some architectures. And one of the beauties of UNIX is that it truly is fairly architecture-neutral. And that's just the page table handling. The SMP locking for all this looks even worse - you can't share a per-mm lock like with the clone() thing, so you have to create some other locking mechanism. I'd be interested to hear if you have some great idea (ie "oh, if you look at it _this_ way all your concerns go away"), but I suspect you have only looked at it from 10,000 feet and thought "that would be a cool thing". And I suspect it ends up being anything _but_ cool once actually implemented. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: dil...@apollo.backplane.com (Matthew Dillon) Subject: Re: RFC: design for new VM Date: 2000/08/04 Message-ID: <200008042351.QAA89101@apollo.backplane.com>#1/1 X-Deja-AN: 654613804 Sender: owner-linux-ker...@vger.rutgers.edu References: <Pine.LNX.4.10.10008041033230.813-100000@penguin.transmeta.com> Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu :> (fork) deal. Physical segment sharing outside of clone is something :> Linux could use to, I don't think it does it either. It's not easy to :> do right. : :It's probably impossible to do right. Basically, if you do it, you do it :wrong. : :As far as I can tell, you basically screw yourself on the TLB and locking :if you ever try to implement this. And frankly I don't see how you could :avoid getting screwed. : :There are architecture-specific special cases, of course. On ia64, the :.. I spent a weekend a few months ago trying to implement page table sharing in FreeBSD -- and gave up, but it left me with the feeling that it should be possible to do without polluting the general VM architecture. For IA32, what it comes down to is that the page table generated by any segment-aligned mmap() (segment == 4MB) made by two processes should be shareable, simply be sharing the page directory entry (and thus the physical page representing 4MB worth of mappings). This would be restricted to MAP_SHARED mappings with the same protections, but the two processes would not have to map the segments at the same VM address, they need only be segment-aligned. This would be a transparent optimization wholely invisible to the process, something that would be optionally implemented in the machine-dependant part of the VM code (with general support in the machine-independant part for the concept). If the process did anything to create a mapping mismatch, such as call mprotect(), the shared page table would be split. The problem being solved for FreeBSD is actually quite serious -- due to FreeBSD's tracking of individual page table entries, being able to share a page table would radically reduce the amount of tracking information required for any large shared areas (shared libraries, large shared file mappings, large sysv shared memory mappings). For linux the problem is relatively minor - linux would save considerable page table memory. Linux is still reasonably scaleable without the optimization while FreeBSD currently falls on its face for truely huge shared mappings (e.g. 300 processes all mapping a shared 1GB memory area, aka Oracle 8i). (Linux falls on its face for other reasons, mainly the fact that it maps all of physical memory into KVM in order to manage it). I think the loss of MP locking for this situation is outweighed by the benefit of a huge reduction in page faults -- rather then see 300 processes each take a page fault on the same page, only the first process would and the pte would already be in place when the others got to it. When it comes right down to it, page faults on shared data sets are not really an issue for MP scaleability. In anycase, this is a 'dream' for me for FreeBSD right now. It's a very difficult problem to solve. -Matt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: torva...@transmeta.com (Linus Torvalds) Subject: Re: RFC: design for new VM Date: 2000/08/05 Message-ID: <Pine.LNX.4.10.10008041655420.11340-100000@penguin.transmeta.com>#1/1 X-Deja-AN: 654617173 Sender: owner-linux-ker...@vger.rutgers.edu References: <200008042351.QAA89101@apollo.backplane.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII MIME-Version: 1.0 Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu On Fri, 4 Aug 2000, Matthew Dillon wrote: > : > :There are architecture-specific special cases, of course. On ia64, the > :.. > > I spent a weekend a few months ago trying to implement page table > sharing in FreeBSD -- and gave up, but it left me with the feeling > that it should be possible to do without polluting the general VM > architecture. > > For IA32, what it comes down to is that the page table generated by > any segment-aligned mmap() (segment == 4MB) made by two processes > should be shareable, simply be sharing the page directory entry (and thus > the physical page representing 4MB worth of mappings). This would be > restricted to MAP_SHARED mappings with the same protections, but the two > processes would not have to map the segments at the same VM address, they > need only be segment-aligned. I agree that from a page table standpoint you should be correct. I don't think that the other issues are as easily resolved, though. Especially with address space ID's on other architectures it can get _really_ interesting to do TLB invalidates correctly to other CPU's etc (you need to keep track of who shares parts of your page tables etc). > This would be a transparent optimization wholely invisible to the process, > something that would be optionally implemented in the machine-dependant > part of the VM code (with general support in the machine-independant > part for the concept). If the process did anything to create a mapping > mismatch, such as call mprotect(), the shared page table would be split. Right. But what about the TLB? It's not a problem on the x86, because the x86 doesn't have ASN's anyway. But fo rit to be a valid notion, I feel that it should be able to be portable too. You have to have some page table locking mechanism for SMP eventually: I think you miss some of the problems because the current FreeBSD SMP stuff is mostly still "big kernel lock" (outdated info?), and you'll end up kicking yourself in a big way when you have the 300 processes sharing the same lock for that region.. (Not that I think you'd necessarily have much contention on the lock - the problem tends to be more in the logistics of keeping track of the locks of partial VM regions etc). > (Linux falls on its face for other reasons, mainly the fact that it > maps all of physical memory into KVM in order to manage it). Not true any more.. Trying to map 64GB of RAM convinced us otherwise ;) > I think the loss of MP locking for this situation is outweighed by the > benefit of a huge reduction in page faults -- rather then see 300 > processes each take a page fault on the same page, only the first process > would and the pte would already be in place when the others got to it. > When it comes right down to it, page faults on shared data sets are not > really an issue for MP scaleability. I think you'll find that there are all these small details that just cannot be solved cleanly. Do you want to be stuck with a x86-only solution? That said, I cannot honestly say that I have tried very hard to come up with solutions. I just have this feeling that it's a dark ugly hole that I wouldn't want to go down.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: dil...@apollo.backplane.com (Matthew Dillon) Subject: Re: RFC: design for new VM Date: 2000/08/05 Message-ID: <200008050152.SAA89298@apollo.backplane.com>#1/1 X-Deja-AN: 654639522 Sender: owner-linux-ker...@vger.rutgers.edu References: <Pine.LNX.4.10.10008041655420.11340-100000@penguin.transmeta.com> Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu :I agree that from a page table standpoint you should be correct. : :I don't think that the other issues are as easily resolved, though. :Especially with address space ID's on other architectures it can get :_really_ interesting to do TLB invalidates correctly to other CPU's etc :(you need to keep track of who shares parts of your page tables etc). : :... :> mismatch, such as call mprotect(), the shared page table would be split. : :Right. But what about the TLB? I'm not advocating trying to share TLB entries, that would be a disaster. I'm contemplating just the physical page table structure. e.g. if you mmap() a 1GB file shared (or private read-only) into 300 independant processes, it should be possible to share all the meta-data required to support that mapping except for the TLB entries themselves. ASNs shouldn't make a difference... presumably the tags on the TLB entries are added on after the metadata lookup. I'm also not advocating attempting to share intermediate 'partial' in-memory TLB caches (hash tables or other structures). Those are typically fixed in size, per-cpu, and would not be impacted by scale. :You have to have some page table locking mechanism for SMP eventually: I :think you miss some of the problems because the current FreeBSD SMP stuff :is mostly still "big kernel lock" (outdated info?), and you'll end up :kicking yourself in a big way when you have the 300 processes sharing the :same lock for that region.. If it were a long-held lock I'd worry, but if it's a lock on a pte I don't think it can hurt. After all, even with separate page tables if 300 processes fault on the same backing file offset you are going to hit a bottleneck with MP locking anyway, just at a deeper level (the filesystem rather then the VM system). The BSDI folks did a lot of testing with their fine-grained MP implementation and found that putting a global lock around the entire VM system had absolutely no impact on MP performance. :> (Linux falls on its face for other reasons, mainly the fact that it :> maps all of physical memory into KVM in order to manage it). : :Not true any more.. Trying to map 64GB of RAM convinced us otherwise ;) Oh, that's cool! I don't think anyone in FreeBSDland has bothered with large-memory (> 4GB) memory configurations, there doesn't seem to be much demand for such a thing on IA32. :> I think the loss of MP locking for this situation is outweighed by the :> benefit of a huge reduction in page faults -- rather then see 300 :> processes each take a page fault on the same page, only the first process :> would and the pte would already be in place when the others got to it. :> When it comes right down to it, page faults on shared data sets are not :> really an issue for MP scaleability. : :I think you'll find that there are all these small details that just :cannot be solved cleanly. Do you want to be stuck with a x86-only :solution? : :That said, I cannot honestly say that I have tried very hard to come up :with solutions. I just have this feeling that it's a dark ugly hole that I :wouldn't want to go down.. : : Linus Well, I don't think this is x86-specific. Or, that is, I don't think it would pollute the machine-independant code. FreeBSD has virtually no notion of 'page tables' outside the i386-specific VM files... it doesn't use page tables (or two-level page-like tables... is Linux still using those?) to store meta information at all in the higher levels of the kernel. It uses architecture-independant VM objects and vm_map_entry structures for that. Physical page tables on FreeBSD are throw-away-at-any-time entities. The actual implementation of the 'page table' in the IA32 sense occurs entirely in the machine-dependant subdirectory for IA32. A page-table sharing mechanism would have to implement the knowledge -- the 'potential' for sharing at a higher level (the vm_map_entry structure), but it would be up to the machine-dependant VM code to implement any actual sharing given that knowledge. So while the specific implementation for IA32 is definitely machine-specific, it would have no effect on other OS ports (of course, we have only one other working port at the moment, to the alpha, but you get the idea). -Matt Matthew Dillon <dil...@backplane.com> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: torva...@transmeta.com (Linus Torvalds) Subject: Re: RFC: design for new VM Date: 2000/08/05 Message-ID: <Pine.LNX.4.10.10008041854240.1727-100000@penguin.transmeta.com>#1/1 X-Deja-AN: 654642040 Sender: owner-linux-ker...@vger.rutgers.edu References: <200008050152.SAA89298@apollo.backplane.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII MIME-Version: 1.0 Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu On Fri, 4 Aug 2000, Matthew Dillon wrote: > : > :Right. But what about the TLB? > > I'm not advocating trying to share TLB entries, that would be > a disaster. You migth have to, if the machine has a virtually mapped cache.. Ugh. That gets too ugly to even contemplate, actually. Just forget the idea. > If it were a long-held lock I'd worry, but if it's a lock on a pte > I don't think it can hurt. After all, even with separate page tables > if 300 processes fault on the same backing file offset you are going > to hit a bottleneck with MP locking anyway, just at a deeper level > (the filesystem rather then the VM system). The BSDI folks did a lot > of testing with their fine-grained MP implementation and found that > putting a global lock around the entire VM system had absolutely no > impact on MP performance. Hmm.. That may be load-dependent, but I know it wasn't true for Linux. The kernel lock for things like brk() were some of the worst offenders, and people worked hard on making mmap() and friends not need the BKL exactly because it showed up very clearly in the lock profiles. > :> (Linux falls on its face for other reasons, mainly the fact that it > :> maps all of physical memory into KVM in order to manage it). > : > :Not true any more.. Trying to map 64GB of RAM convinced us otherwise ;) > > Oh, that's cool! I don't think anyone in FreeBSDland has bothered with > large-memory (> 4GB) memory configurations, there doesn't seem to be > much demand for such a thing on IA32. Not normally no. Linux didn't start seeing the requirement until last year or so, when running big databases and big benchmarks just required it because the working set was so big. "dbench" with a lot of clients etc. Now, whether such a working set is realistic or not is another issue, of course. 64GB isn't as much memory as it used to be, though, and we couldn't have beated the mindcraft NT numbers without large memory support. > Well, I don't think this is x86-specific. Or, that is, I don't think it > would pollute the machine-independant code. FreeBSD has virtually no > notion of 'page tables' outside the i386-specific VM files... it doesn't > use page tables (or two-level page-like tables... is Linux still using > those?) to store meta information at all in the higher levels of the > kernel. It uses architecture-independant VM objects and vm_map_entry > structures for that. Physical page tables on FreeBSD are > throw-away-at-any-time entities. The actual implementation of the > 'page table' in the IA32 sense occurs entirely in the machine-dependant > subdirectory for IA32. It's not the page tables themselves I worry about, but all the meta-data synchronization requirements. But hey. Go wild, prove me wrong. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: l...@veszprog.hu (Gabor Lenart) Subject: Re: RFC: design for new VM Date: 2000/08/07 Message-ID: <20000807121145.D2872@veszprog.hu>#1/1 X-Deja-AN: 655365329 X-Operating-System: galaxy Linux 2.2.16 i686 Content-Transfer-Encoding: QUOTED-PRINTABLE Sender: owner-linux-ker...@vger.rutgers.edu References: <Pine.LNX.4.21.0008021212030.16377-100000@duckman.distro.conectiva> <20000803191906.B562@metastasis.f00f.org> Content-Type: text/plain; charset=iso-8859-2 MIME-Version: 1.0 Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu On Thu, Aug 03, 2000 at 07:19:06PM +1200, Chris Wedgwood wrote: > On Wed, Aug 02, 2000 at 07:08:52PM -0300, Rik van Riel wrote: >=20 > here is a (rough) draft of the design for the new VM, as > discussed at UKUUG and OLS. The design is heavily based > on the FreeBSD VM subsystem - a proven design - with some > tweaks where we think things can be improved.=20 >=20 > Can the differences between your system and what FreeBSD has be > isolated or contained -- I ask this because the FreeBSD VM works > _very_ well compared to recent linux kernels; if/when the new system > is implement it would nice to know if performance differences are > tuning related or because of 'tweaks'. A little question. AFAIK Linux needs less memory than FreeBSD. The new =46reeBSD like VM will casue Linux won't work on little machines which uses Linux at our Univ because they're too powerless for running FreeBS= D (the pervious sysadm ran FreeBSD everywhere and only that machines coul= dn't run FreeBSD as fast as Linux). --=20 +-[ L=E9n=E1rt G=E1bor ]----[ http://lgb.supervisor.hu/ ]------[+36 30= 2270823 ]--+ |--UNIX--OpenSource--> The future is in our hands. <--LME--L= inux--| +-----[ Veszprog Kft ]------[ Supervisor BT ]-------[ Expertus Kft ]--= ------+ - To unsubscribe from this list: send the line "unsubscribe linux-kernel"= in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: a...@lxorguk.ukuu.org.uk (Alan Cox) Subject: Re: RFC: design for new VM Date: 2000/08/07 Message-ID: <E13Lld0-0003XX-00@the-village.bc.nu>#1/1 X-Deja-AN: 655392425 Content-Transfer-Encoding: 7bit Sender: owner-linux-ker...@vger.rutgers.edu References: <20000807121145.D2872@veszprog.hu> Content-Type: text/plain; charset=us-ascii MIME-Version: 1.0 Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu > A little question. AFAIK Linux needs less memory than FreeBSD. The new It depends what you are doing. Espeically with newer BSD > FreeBSD like VM will casue Linux won't work on little machines which > uses Linux at our Univ because they're too powerless for running FreeBSD > (the pervious sysadm ran FreeBSD everywhere and only that machines couldn't > run FreeBSD as fast as Linux). The VM changes will make the small boxes run faster if done right. At least page aging worked right on 2.0 ! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Gerrit.Huize...@us.ibm.com Subject: Re: RFC: design for new VM Date: 2000/08/07 Message-ID: <200008071740.KAA25895@eng2.sequent.com> X-Deja-AN: 655502734 Sender: owner-linux-ker...@vger.rutgers.edu References: <8725692F.0079E22B.00@d53mta03h.boulder.ibm.com> Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Reply-To: Gerrit.Huize...@us.ibm.com Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu Hi Rik, I have a few comments on your RFC for VM. Some are simply observational, some are based on our experience locally with the development, deployment and maintenance of a VM subsystem here at IBM NUMA-Q (formerly Sequent Computer Systems, Inc.). As you may remember, our VM subsystem was initially designed in ~1982-1984 to operate on 30 processor SMP machines, and in roughly 1993-1995 it was updated to support NUMA systems up to 64 processors. Our machines started with ~1 GB of physical memory, and today support up to 64 GB of physical memory on a 32-64 processor machine. These machines run a single operating system (DYNIX/ptx) which is derived originally from BSD 4.2, although the VM subsystem has been completely rewritten over the years. Along the way, we learned many things about memory latency, large memory support, SMP & NUMA issues, some of which may be useful to you in your current design effort. First, and perhaps foremost, I believe your design deals almost exclusively with page aging & page replacement algorithms, rather than being a complete VM redesign, although feel free to correct me if I have misconstrued that. For instance, I don't believe you are planning to redo the 3 or 4 tier page table layering as part of your effort, nor are you changing memory allocation routines in any kernel-visible way. I also don't see any modifications to kernel pools, general memory management of free pages (e.g. AVL trees vs. linked lists), any changes to the PAE mechanism currently in use, no reference to alternate page sizes (e.g. Intel PSE), buffer/page cache organization, etc. I also see nothing in the design which reduces the needs for global TLB flushes across this system, which is one area where I believe Linux is starting to suffer as CPU counts increase. I believe a full VM redesign would tend to address all of these issues, even if it did so in a completely modular fashion. I also note that you intend to draw heavily from the FreeBSD implementation. Two areas in which to be very careful here have already been mentioned, but they are worth restating: FreeBSD has little to no SMP experience (e.g. kernel big lock) and little to no large memory experience. I believe Linux is actually slighly more advanced in both of these areas, and a good redesign should preserve and/or improve on those capabilities. I believe that your current proposed aging mechanism, while perhaps a positive refinement of what currently exists, still suffers from a fundamental problem in that you are globally managing page aging. In both large memory systems and in SMP systems, scaleability is greatly enhanced if major capabilities like page aging can in some way be localized. One mechanism might be to use something like per-CPU zones from which private pages are typically allocated from and freed to. This, in conjunction with good scheduler affinity, maximizes the benefits of any CPU L1/L2 cache. Another mechanism, and the one that we chose in our operating system, was to use a modified process resident set sizes as the machanism for page management. The basic modifications are to make the RSS tuneable system wide as well as per process. The RSS size "flexes" based on available memory and a processes page fault frequency (PFF). Frequent page faults force the RSS to increase, infrequent page faults cause a processes resident size to shrink. When memory pressure mounts, the running process manages itself a little more agressively; processes which have "flexed" their resident set size beyond their system or per process recommended maxima are among the first to lose pages. And, when pressure can not be addressed to RSS management, swapping starts. Another fundamental flaw I see with both the current page aging mechanism and the proposed mechanism is that workloads which exhaust memory pay no penalty at all until memory is full. Then there is a sharp spike in the amount of (slow) IO as pages are flushed, processes are swapped, etc. There is no apparent smoothing of spikes, such as increasing the rate of IO as the rate of memory pressure increases. With the exception of laptops, most machines can sustain a small amount of background asynchronous IO without affecting performance (laptops may want IO batched to maximize battery life). I would propose that as memory pressure increases, paging/swapping IO should increase somewhat proportionally. This provides some smoothing for the bursty nature of most single user or small ISP workloads. I believe databases style loads on larger machines would also benefit. Your current design does not address SMP locking at all. I would suggest that a single VM lock would provide reasonable scaleability up to about 16 processors, depending on page size, memory size, processor speed, and the ratio of processor speed to memory bandwidth. One method for stretching that lock is to use zoned, per-processor (or per-node) data for local page allocations whenever possible. Then local allocations can use minimal locking (need only to protect from memory allocations in interrupt code). Further, the layout of memory in a bitmaped, power of 2 sized "buddy system" can speed allocations, reducing the amount of time during which a critical lock needs to be held. AVL trees will perform similarly well, with the exception that a resource bitmap tends to be easier on TLB entries and processor cache. A bitmaped allocator may also be useful in more efficiently allocating pages of variable sizes on a CPU which supports variable sized pages in hardware. Also, I note that your filesys->flush() mechanism utilizes a call per page. This is an interesting capability, although I'd question the processor efficiency of a page granularity here. On large memory systems, with large processes starting (e.g. Netscape, StarOffice, or possible a database client), it seems like a callback to a filesystem which said something like flush("I must have at least 10 pages from you", "and I'd really like 100 pages") might be a better way to use this advisory capability. You've already pointed out that you may request that a specific page might be requested but other pages may be freed; this may be a more explicit way to code the policy you really want. It would also be interesting to review the data structure you intend to use in terms of cache line layout, as well as look at the algorithms which use those structures with an eye towards minimizing page & cache hits for both SMP *and* single processor efficiency. Hope this is of some help, Gerrit Huizenga IBM NUMA-Q (nee' Sequent) Gerrit.Huize...@us.ibm.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: c...@monkey.org (Chuck Lever) Subject: Re: RFC: design for new VM Date: 2000/08/07 Message-ID: <Pine.BSO.4.20.0008071641300.2595-100000@naughty.monkey.org>#1/1 X-Deja-AN: 655572257 Sender: owner-linux-ker...@vger.rutgers.edu References: <200008071740.KAA25895@eng2.sequent.com> Content-Type: TEXT/PLAIN; charset=US-ASCII MIME-Version: 1.0 Reply-To: chuckle...@bigfoot.com Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu hi gerrit- good to see you on the list. On Mon, 7 Aug 2000 Gerrit.Huize...@us.ibm.com wrote: > Another fundamental flaw I see with both the current page aging mechanism > and the proposed mechanism is that workloads which exhaust memory pay > no penalty at all until memory is full. Then there is a sharp spike > in the amount of (slow) IO as pages are flushed, processes are swapped, > etc. There is no apparent smoothing of spikes, such as increasing the > rate of IO as the rate of memory pressure increases. With the exception > of laptops, most machines can sustain a small amount of background > asynchronous IO without affecting performance (laptops may want IO > batched to maximize battery life). I would propose that as memory > pressure increases, paging/swapping IO should increase somewhat > proportionally. This provides some smoothing for the bursty nature of > most single user or small ISP workloads. I believe databases style > loads on larger machines would also benefit. 2 comments here. 1. kswapd runs in the background and wakes up every so often to handle the corner cases that smooth bursty memory request workloads. it executes the same code that is invoked from the kernel's memory allocator to reclaim pages. 2. i agree with you that when the system exhausts memory, it hits a hard knee; it would be better to soften this. however, the VM system is designed to optimize the case where the system has enough memory. in other words, it is designed to avoid unnecessary work when there is no need to reclaim memory. this design was optimized for a desktop workload, like the scheduler or ext2 "async" mode. if i can paraphrase other comments i've heard on these lists, it epitomizes a basic design philosophy: "to optimize the common case gains the most performance advantage." can a soft-knee swapping algorithm be demonstrated that doesn't impact the performance of applications running on a system that hasn't exhausted its memory? - Chuck Lever -- corporate: <chu...@netscape.com> personal: <chuckle...@bigfoot.com> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel <r...@conectiva.com.br> Subject: Re: RFC: design for new VM Date: 2000/08/07 Message-ID: <linux.kernel.Pine.LNX.4.21.0008071844100.25008-100000@duckman.distro.conectiva>#1/1 X-Deja-AN: 655621932 Approved: n...@nntp-server.caltech.edu X-To: chuckle...@bigfoot.com Content-Type: TEXT/PLAIN; charset=US-ASCII MIME-Version: 1.0 X-cc: Gerrit.Huize...@us.ibm.com, linux...@kvack.org, linux-ker...@vger.rutgers.edu, Linus Torvalds <torva...@transmeta.com> Newsgroups: mlist.linux.kernel On Mon, 7 Aug 2000, Chuck Lever wrote: > On Mon, 7 Aug 2000 Gerrit.Huize...@us.ibm.com wrote: > > Another fundamental flaw I see with both the current page aging mechanism > > and the proposed mechanism is that workloads which exhaust memory pay > > no penalty at all until memory is full. Then there is a sharp spike > > in the amount of (slow) IO as pages are flushed, processes are swapped, > > etc. There is no apparent smoothing of spikes, such as increasing the > > rate of IO as the rate of memory pressure increases. With the exception > > of laptops, most machines can sustain a small amount of background > > asynchronous IO without affecting performance (laptops may want IO > > batched to maximize battery life). I would propose that as memory > > pressure increases, paging/swapping IO should increase somewhat > > proportionally. This provides some smoothing for the bursty nature of > > most single user or small ISP workloads. I believe databases style > > loads on larger machines would also benefit. > > 2 comments here. > > 1. kswapd runs in the background and wakes up every so often to handle > the corner cases that smooth bursty memory request workloads. it executes > the same code that is invoked from the kernel's memory allocator to > reclaim pages. *nod* The idea is that the memory_pressure variable indicates how much page stealing is going on (on average) so every time kswapd wakes up it knows how much pages to steal. That way it should (if we're "lucky") free enough pages to get us along until the next time kswapd wakes up. > 2. i agree with you that when the system exhausts memory, it > hits a hard knee; it would be better to soften this. The memory_pressure variable is there to ease this. If the load is more or less bursty, but constant on a somewhat longer timescale (say one minute), then we'll average the inactive_target to somewhere between one and two seconds worth of page steals. > can a soft-knee swapping algorithm be demonstrated that doesn't > impact the performance of applications running on a system that > hasn't exhausted its memory? The algorithm we're using (dynamic inactive target w/ agressively trying to meet that target) will eat disk bandwidth in the case of one application filling memory really fast but not swapping, but since the data is kept in memory, it shouldn't be a very big performance penalty in most cases. About NUMA scalability: we'll have different memory pools per NUMA node. So if you have a 32-node, 64GB NUMA machine, it'll partly function like 32 independant 2GB machines. We'll have to find a solution for the pagecache_lock (how do we make this more scalable?), but the pagecache_lru_lock, the memory queues/lists and kswapd will be per _node_. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Gerrit.Huize...@us.ibm.com Subject: Re: RFC: design for new VM Date: 2000/08/08 Message-ID: <200008080048.RAA13326@eng2.sequent.com>#1/1 X-Deja-AN: 655647982 Sender: owner-linux-ker...@vger.rutgers.edu References: <87256934.0078DADB.00@d53mta03h.boulder.ibm.com> Reply-To: Gerrit.Huize...@us.ibm.com Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu > On Mon, 7 Aug 2000, Rik van Riel wrote: > The idea is that the memory_pressure variable indicates how > much page stealing is going on (on average) so every time > kswapd wakes up it knows how much pages to steal. That way > it should (if we're "lucky") free enough pages to get us > along until the next time kswapd wakes up. Seems like you could signal kswapd when either the page fault rate increases or the rate of (memory allocations / memory frees) hits a tuneable? ratio (I hate relying on luck, simply because so much luck is bad ;-) > About NUMA scalability: we'll have different memory pools > per NUMA node. So if you have a 32-node, 64GB NUMA machine, > it'll partly function like 32 independant 2GB machines. One lesson we learned early on is that anything you can possibly do on a per-CPU basis helps both SMP and NUMA activity. This includes memory management, scheduling, TCP performance counters, any kind of system counters, etc. Once you have the basic SMP hierarchy in place, adding a NUMA hierarchy (or more than one for architectures that need it) is much easier. Also, is there a kswapd per pool? Or does one kswapd oversee all of the pools (in the NUMA world, that is)? gerrit - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Gerrit.Huize...@us.ibm.com Subject: Re: RFC: design for new VM Date: 2000/08/08 Message-ID: <200008080036.RAA03032@eng2.sequent.com>#1/1 X-Deja-AN: 655658668 Sender: owner-linux-ker...@vger.rutgers.edu References: <87256934.0072FA16.00@d53mta04h.boulder.ibm.com> Reply-To: Gerrit.Huize...@us.ibm.com Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu Hi Chuck, > 1. kswapd runs in the background and wakes up every so often to handle > the corner cases that smooth bursty memory request workloads. it executes > the same code that is invoked from the kernel's memory allocator to > reclaim pages. yep... We do the same, although primarily through RSS management and our pageout deamon (separate from swapout). One possible difference - dirty pages are schedule for asynchronous flush to disk and then moved to the end of the free list after IO is complete. If the process faults on that page, either before it is paged out or aftewrwards, it can be "reclaimed" either from the dirty list or the free list , without re-reading from disk. The pageout daemon runs with the dirty list reaches a tuneable size, and the pageout deamon shrinks the list to a tuneable size, moving all written pages to the free list. In many ways, similar to what Rik is proposing, although I don't see any "fast reclaim" capability. Also, the method by which pages are aged is quite different (global phys memory scan vs. processes maintaining their own LRU set). Having a list of prime candidates to flush makes the kswapd/pageout overhead lower than using a global clock hand, but the global clock hand *may* more perform better global optimisation of page aging. > 2. i agree with you that when the system exhausts memory, it hits a hard > knee; it would be better to soften this. however, the VM system is > designed to optimize the case where the system has enough memory. in > other words, it is designed to avoid unnecessary work when there is no > need to reclaim memory. this design was optimized for a desktop workload, > like the scheduler or ext2 "async" mode. if i can paraphrase other > comments i've heard on these lists, it epitomizes a basic design > philosophy: "to optimize the common case gains the most performance > advantage." This works fine until I have a stable load on my system and then start {Netscape, StarOffice, VMware, etc.} which then causes IO for demand paging of the executable, as well as paging/swapping activity to make room for the piggish footprints of these bigger applications. This is where it might help to pre-write dirty pages when the system is more idle, without fully returning those pages to the free list. > can a soft-knee swapping algorithm be demonstrated that doesn't impact the > performance of applications running on a system that hasn't exhausted its > memory? > > - Chuck Lever Our VM doesn't exhibit a strong knee, but its method of avoiding that is again the flexing RSS management. Inactive processes tend to shrink to their working footprint, larger processes tend to grow to expand their footprint but still self-manage within the limits of available memory. I think it is possible to soften the knee on a per-workload basis, and that's probably a spot for some tuneables. E.g. when to flush dirty old pages, how many to flush, and I think Rik has already talked about having those tunables. Despite the fact that our systems have been primarily deployed for a single workload type (databases), we still have found that (the right!) VM tuneables can have an enormous impact on performance. I think the same will be much more true of an OS like Linux which tries to be many things to all people. gerrit - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: r...@conectiva.com.br (Rik van Riel) Subject: Re: RFC: design for new VM Date: 2000/08/08 Message-ID: <Pine.LNX.4.21.0008081216090.5200-100000@duckman.distro.conectiva>#1/1 X-Deja-AN: 655876173 Sender: owner-linux-ker...@vger.rutgers.edu References: <200008080048.RAA13326@eng2.sequent.com> X-Sender: r...@duckman.distro.conectiva X-Authentication-Warning: duckman.distro.conectiva: riel owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII MIME-Version: 1.0 Newsgroups: linux.dev.kernel X-Loop: majord...@vger.rutgers.edu On Mon, 7 Aug 2000 Gerrit.Huize...@us.ibm.com wrote: > > On Mon, 7 Aug 2000, Rik van Riel wrote: > > The idea is that the memory_pressure variable indicates how > > much page stealing is going on (on average) so every time > > kswapd wakes up it knows how much pages to steal. That way > > it should (if we're "lucky") free enough pages to get us > > along until the next time kswapd wakes up. > > Seems like you could signal kswapd when either the page fault > rate increases or the rate of (memory allocations / memory > frees) hits a tuneable? ratio We will. Each page steal and each allocation will increase the memory_pressure variable, and because of that, also the inactive_target. Whenever either - one zone gets low on free memory *OR* - all zones get more or less low on free+inactive_clean pages *OR* - we get low on inactive pages (inactive_shortage > inactive_target/2), THEN kswapd gets woken up immediately. We do this both from the page allocation code and from __find_page_nolock (which gets hit every time we reclaim an inactive page back for its original purpose). > > About NUMA scalability: we'll have different memory pools > > per NUMA node. So if you have a 32-node, 64GB NUMA machine, > > it'll partly function like 32 independant 2GB machines. > > One lesson we learned early on is that anything you can > possibly do on a per-CPU basis helps both SMP and NUMA > activity. This includes memory management, scheduling, > TCP performance counters, any kind of system counters, etc. > Once you have the basic SMP hierarchy in place, adding a NUMA > hierarchy (or more than one for architectures that need it) > is much easier. > > Also, is there a kswapd per pool? Or does one kswapd oversee > all of the pools (in the NUMA world, that is)? Currently we have none of this, but once 2.5 is forked off, I'll submit a patch which shuffles all variables into per-node (per pgdat) structures. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/