[PATCH] Recent VM fiasco - fixed
Hi to all! After I _finally_ got tired of the constant worse and worse VM behaviour in the recent kernels, I thought I could spare few hours this weekend just to see what's going on. I was quite surprised to see that VM subsystem, while at its worst condition (at least in 2.3.x), is quite easily repairable even to unskilled ones... I compiled and checked few kernels back to 2.3.51, and found that new code was constantly added just to make things go worse. Short history: 2.3.51 - mostly OK, but reading from disk takes too much CPU (kswapd) 2.3.99-pre1, 2 - as .51 + aggressive swap out during writing 2.3.99-pre3, 4, 5 - reading better 2.3.99-pre5, 6 - both reading and writing take 100% CPU!!! I also tried some pre7-x (forgot which one) but that one was f****d up beyond a recognition (read: was killing my processes including X11 like mad, every time I started writing to disk). Thus patch that follows, and fixes all above mentioned problems, was made against pre6, sorry. I'll made another patch when pre7 gets out, if things are still not properly fixed. BTW, this patch mostly *removes* cruft recently added, and returns to the known state of operation. After that is achieved it is then easy to selectively add good things I might have removed, and change behaviour as wanted, but I would like to urge people to test things thoroughly before releasing patches this close to 2.4. Then again, I might have introduced bugs in this patch, too. :) But, I *tried* to break it (spent some time doing that), and testing didn't reveal any bad behaviour. Enjoy! Patch
-- Zlatko
Re: [PATCH] Recent VM fiasco - fixed
On 8 May 2000, Zlatko Calusic wrote: > BTW, this patch mostly *removes* cruft recently added, and > returns to the known state of operation. Which doesn't work. Think of a 1GB machine which has a 16MB DMA zone, a 950MB normal zone and a very small HIGHMEM zone. With the old VM code the HIGHMEM zone would be swapping like mad while the other two zones are idle. It's Not That Kind Of Party(tm) cheers, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
Rik van Riel <riel@conectiva.com.br> writes: > On 8 May 2000, Zlatko Calusic wrote: > > > BTW, this patch mostly *removes* cruft recently added, and > > returns to the known state of operation. > > Which doesn't work. > > Think of a 1GB machine which has a 16MB DMA zone, > a 950MB normal zone and a very small HIGHMEM zone. > > With the old VM code the HIGHMEM zone would be > swapping like mad while the other two zones are > idle. > > It's Not That Kind Of Party(tm) > OK, I see now what you have in mind, and I'll try to test it when I get home (yes, late worker... my only connection to the Net :)) If only I could buy 1GB to test in the real setup. ;) But still, optimizing for 1GB, while at the same time completely killing performances even *usability* for the 99% of users doesn't look like a good solution, does it? There was lot of VM changes recently (>100K of patches) where we went further and further away from the mostly stable code base (IMHO) trying to fix zone balancing. Maybe it's time we try again, fresh from the "start"? I'll admit I didn't understand most of the conversation about zone balancing recently on linux-mm. And I know it's because I didn't have much time lately to hack the kernel, unfortunately. But after few hours spent dealing with the horrible VM that is in the pre6, I'm not scared anymore. And I think that solution to all our problems with zone balancing must be very simple. But it's probably hard to find, so it will need lots of modeling and testing. I don't think adding few lines here and there all the time will take us anywhere. Regards, -- Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
On 8 May 2000, Zlatko Calusic wrote: > > But still, optimizing for 1GB, while at the same time completely > killing performances even *usability* for the 99% of users doesn't > look like a good solution, does it? Oh, definitely. I'll make a new pre7 that has a lot of the simplifications discussed here over the weekend, and seems to work for me (tested both on a 512MB setup and a 64MB setup for some sanity). This pre7 almost certainly won't be all that perfect either, but gives a better starting point. > But after few hours spent dealing with the horrible VM that is in the > pre6, I'm not scared anymore. Good. This is really not scary stuff. Much of it is quite straightforward, and is mainly just getting the right "feel". It's really easy to make mistakes here, but they tend to be mistakes that just makes the system act badly, not the kind of _really_ scary mistakes (the ones that make it corrupt disks randomly ;) Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
On 8 May 2000, Zlatko Calusic wrote: > Rik van Riel <riel@conectiva.com.br> writes: > > On 8 May 2000, Zlatko Calusic wrote: > > > > > BTW, this patch mostly *removes* cruft recently added, and > > > returns to the known state of operation. > > > > Which doesn't work. > > > > Think of a 1GB machine which has a 16MB DMA zone, > > a 950MB normal zone and a very small HIGHMEM zone. > > > > With the old VM code the HIGHMEM zone would be > > swapping like mad while the other two zones are > > idle. > > > > It's Not That Kind Of Party(tm) > > OK, I see now what you have in mind, and I'll try to test it when I > get home (yes, late worker... my only connection to the Net :)) > If only I could buy 1GB to test in the real setup. ;) > > But still, optimizing for 1GB, while at the same time completely > killing performances even *usability* for the 99% of users doesn't > look like a good solution, does it? 20MB and 24MB machines will be in the same situation, if that's of any help to you ;) > But after few hours spent dealing with the horrible VM that is > in the pre6, I'm not scared anymore. And I think that solution > to all our problems with zone balancing must be very simple. It is. Linus is working on a conservative & simple solution while I'm trying a bit more "far-out" code (active and inactive list a'la BSD, etc...). We should have at least one good VM subsystem within the next few weeks ;) regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
Rik van Riel <riel@conectiva.com.br> writes: > 20MB and 24MB machines will be in the same situation, if > that's of any help to you ;) > Yes, you are right. And thanks for that tip (booting with mem=24m) because that will be my first test case later tonight. > > But after few hours spent dealing with the horrible VM that is > > in the pre6, I'm not scared anymore. And I think that solution > > to all our problems with zone balancing must be very simple. > > It is. Linus is working on a conservative & simple solution > while I'm trying a bit more "far-out" code (active and inactive > list a'la BSD, etc...). We should have at least one good VM > subsystem within the next few weeks ;) > Nice. I'm also in favour of some kind of active/inactive list solution (looks promising), but that is probably 2.5.x stuff. I would be happy to see 2.4 out ASAP. Later, when it stabilizes, we will have lots of fun in 2.5, that's for sure. Regards, -- Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
On 8 May 2000, Zlatko Calusic wrote: > Rik van Riel <riel@conectiva.com.br> writes: > > > > But after few hours spent dealing with the horrible VM that is > > > in the pre6, I'm not scared anymore. And I think that solution > > > to all our problems with zone balancing must be very simple. > > > > It is. Linus is working on a conservative & simple solution > > while I'm trying a bit more "far-out" code (active and inactive > > list a'la BSD, etc...). We should have at least one good VM > > subsystem within the next few weeks ;) > > Nice. I'm also in favour of some kind of active/inactive list > solution (looks promising), but that is probably 2.5.x stuff. I have it booting (against pre7-4) and it seems almost stable ;) (with _low_ overhead) > I would be happy to see 2.4 out ASAP. Later, when it stabilizes, > we will have lots of fun in 2.5, that's for sure. Of course, this has the highest priority. regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
Rik, That's astonishing, I'm sure, but think of us poor bastards who DON'T have an SMP machine with >1gig of RAM. This is a P120, 32meg. Lately, fine has degenerated into bad into worse into absolutely obscene. It even kills my PGSQL compiles. And I killed *EVERYTHING* there was to kill. The only processes were init, bash and gcc/cc1. VM still wiped it out. d On Mon, 8 May 2000, Rik van Riel wrote: > On 8 May 2000, Zlatko Calusic wrote: > > > BTW, this patch mostly *removes* cruft recently added, and > > returns to the known state of operation. > > Which doesn't work. > > Think of a 1GB machine which has a 16MB DMA zone, > a 950MB normal zone and a very small HIGHMEM zone. > > With the old VM code the HIGHMEM zone would be > swapping like mad while the other two zones are > idle. > > It's Not That Kind Of Party(tm) > > cheers, > > Rik > -- > The Internet is not a network of computers. It is a network > of people. That is its real strength. > > Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies > http://www.conectiva.com/ http://www.surriel.com/ > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.rutgers.edu > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
On Tue, 9 May 2000, Daniel Stone wrote: > That's astonishing, I'm sure, but think of us poor bastards who > DON'T have an SMP machine with >1gig of RAM. > > This is a P120, 32meg. The old zoned VM code will run that machine as efficiently as if it had 16MB of ram. See my point now? Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
Daniel Stone <tamriel@ductape.net> writes: > That's astonishing, I'm sure, but think of us poor bastards who > DON'T have an SMP machine with >1gig of RAM. He has to care obout us fortunate guys with e.g. 8GB memory also. The recent kernels are broken for that also. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
On 9 May 2000, Christoph Rohland wrote: > Daniel Stone <tamriel@ductape.net> writes: > > > That's astonishing, I'm sure, but think of us poor bastards who > > DON'T have an SMP machine with >1gig of RAM. > > He has to care obout us fortunate guys with e.g. 8GB memory also. The > recent kernels are broken for that also. Try out the really recent one - pre7-8. So far it hassome good reviews, and I've tested it both on a 20MB machine and a 512MB one.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
Linus Torvalds <torvalds@transmeta.com> writes: > Try out the really recent one - pre7-8. So far it hassome good reviews, > and I've tested it both on a 20MB machine and a 512MB one.. Nope, does more or less lockup after the first attempt to swap something out. I can still run ls and free. but as soon as something touches /proc it locks up. Also my test programs do not do anything any more. I append the mem and task info from sysrq. Mem info seems to not change after lockup. Greetings Christoph Show Memory -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
On 9 May 2000, Christoph Rohland wrote: > Linus Torvalds <torvalds@transmeta.com> writes: > > > Try out the really recent one - pre7-8. So far it hassome good reviews, > > and I've tested it both on a 20MB machine and a 512MB one.. > > Nope, does more or less lockup after the first attempt to swap > something out. I can still run ls and free. but as soon as something > touches /proc it locks up. Also my test programs do not do anything > any more. This may be due to an unrelated bug with the task_lock() fixing (see separate patch from Manfred for that one). > I append the mem and task info from sysrq. Mem info seems to not > change after lockup. I suspect that if you do right-alt + scrolllock, you'll see it looping on a spinlock. Which is why the memory info isn't changing ;) But I'll double-check the shm code (I didn't test anything that did any shared memory, for example). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
Linus Torvalds <torvalds@transmeta.com> writes: > On 9 May 2000, Christoph Rohland wrote: > > > Linus Torvalds <torvalds@transmeta.com> writes: > > > > > Try out the really recent one - pre7-8. So far it hassome good reviews, > > > and I've tested it both on a 20MB machine and a 512MB one.. > > I append the mem and task info from sysrq. Mem info seems to not > > change after lockup. > > I suspect that if you do right-alt + scrolllock, you'll see it looping on > a spinlock. Which is why the memory info isn't changing ;) > > But I'll double-check the shm code (I didn't test anything that did any > shared memory, for example). Juan Quintela's patch fixes the lockup. shm paging locked up on the page lock. Now I can give more data about pre7-8. After a short run I can say the following: The machine seems to be stable, but VM is mainly unbalanced: [root@ls3016 /root]# vmstat 5 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id [...] 9 3 0 0 1460016 1588 11284 0 0 0 0 109 23524 4 96 0 9 3 1 7552 557432 1004 19320 0 1607 0 402 186 42582 2 89 9 11 1 1 41972 111368 424 53740 0 6884 2 1721 277 25904 0 89 10 11 1 0 48084 11896 276 59404 0 1133 1 284 181 4439 0 95 5 13 2 2 48352 466952 180 52960 5 158 4 39 230 6381 2 98 0 10 3 1 53400 934204 248 59940 498 1442 128 363 272 3953 1 99 0 11 3 1 52624 878696 300 59820 248 50 81 13 148 971 0 100 0 11 1 0 4556 883852 316 16164 855 0 214 1 127 25188 3 97 0 12 0 0 3936 525620 316 15544 0 0 0 0 109 33969 4 96 0 12 0 0 3936 2029556 316 15544 0 0 0 0 123 19659 4 96 0 11 1 0 3936 686856 316 15544 0 0 0 0 117 14370 3 97 0 12 0 0 3936 388176 320 15544 0 0 0 0 121 7477 3 97 0 10 3 1 47660 5216 88 19992 0 9353 0 2341 757 1267 0 97 3 VM: killing process ipctst 6 6 1 36792 484880 152 26892 65 12307 21 3078 1619 2184 0 94 6 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 10 1 1 39620 66736 148 29364 8 494 2 125 327 1980 0 100 0 VM: killing process ipctst 9 2 1 46536 627356 116 31072 87 8675 23 2169 1784 1412 0 96 4 10 0 1 46664 617368 116 31200 0 26 0 6 258 112 0 100 0 10 0 1 47300 607184 116 31832 0 126 0 32 291 110 0 100 0 So we are swapping out with lots of free memory and killing random processes. The machine also becomes quite unresponsive compared to pre4 on the same tests. Greetings Christoph -- Christoph Rohland Tel: +49 6227 748201 SAP AG Fax: +49 6227 758201 LinuxLab Email: cr@sap.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
Christoph Rohland <cr@sap.com> writes: > Linus Torvalds <torvalds@transmeta.com> writes: > > > On 9 May 2000, Christoph Rohland wrote: > > > > > Linus Torvalds <torvalds@transmeta.com> writes: > > > > > > > Try out the really recent one - pre7-8. So far it hassome good reviews, > > > > and I've tested it both on a 20MB machine and a 512MB one.. > > > > I append the mem and task info from sysrq. Mem info seems to not > > > change after lockup. > > > > I suspect that if you do right-alt + scrolllock, you'll see it looping on > > a spinlock. Which is why the memory info isn't changing ;) > > > > But I'll double-check the shm code (I didn't test anything that did any > > shared memory, for example). > > Juan Quintela's patch fixes the lockup. shm paging locked up on the > page lock. > > Now I can give more data about pre7-8. After a short run I can say the > following: > > The machine seems to be stable, but VM is mainly unbalanced: > > [root@ls3016 /root]# vmstat 5 > procs memory swap io system cpu > r b w swpd free buff cache si so bi bo in cs us sy id > > [...] > > 9 3 0 0 1460016 1588 11284 0 0 0 0 109 23524 4 96 0 > 9 3 1 7552 557432 1004 19320 0 1607 0 402 186 42582 2 89 9 > 11 1 1 41972 111368 424 53740 0 6884 2 1721 277 25904 0 89 10 [ too many lines error, truncating... ] > 9 2 1 46536 627356 116 31072 87 8675 23 2169 1784 1412 0 96 4 > 10 0 1 46664 617368 116 31200 0 26 0 6 258 112 0 100 0 > 10 0 1 47300 607184 116 31832 0 126 0 32 291 110 0 100 0 > > So we are swapping out with lots of free memory and killing random > processes. The machine also becomes quite unresponsive compared to > pre4 on the same tests. > I'll second this! I checked pre7-8 briefly, but I/O & MM interaction is bad. Lots of swapping, lots of wasted CPU cycles and lots of dead writer processes (write(2): out of memory, while there is 100MB in the page cache). Back to my patch and working on the solution for the 20-24 MB & 1GB machines. Anybody with spare 1GB RAM to help development? :) -- Zlatko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
>>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes: Linus> Try out the really recent one - pre7-8. So far it hassome good Linus> reviews, and I've tested it both on a 20MB machine and a 512MB Linus> one.. pre7-8 still isn't completely fixed, but it is better than pre6. Try doing something like 'cp -a linux-2.3.99-pre7-8 foobar' and watching kswapd in top (or qps, el al). On my dual-proc box, kswapd still maxes out one of the cpus. Tar doesn't seem to show it, but bzcat can get an occasional segfault on large files. The filesystem, though, has 1k rather than 4k blocks. Yeah, just tested again on a fs w/ 4k blocks. kswapd only used 50% to 65% of a cpu, but that was an ide drive and the former was on a scsi drive.[1] OTOH, in pre6 X would hit (or at least report) 2^32-1 major faults after only a few hours of usage. That bug is gone in pre7-8. [1] asus p2b-ds mb using onboard adaptec scsi and piix ide; drives are all IBM ultrastars and deskstars. -JimC -- James H. Cloos, Jr. <URL:http://jhcloos.com/public_key> 1024D/ED7DAEA6 <cloos@jhcloos.com> E9E9 F828 61A4 6EA9 0F2B 63E7 997A 9F17 ED7D AEA6 Save Trees: Get E-Gold! <URL:http://jhcloos.com/go?e-gold> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
Ok, there's a pre7-9 out there, and the biggest change versus pre7-8 is actually how block fs dirty data is flushed out. Instead of just waking up kflushd and hoping for the best, we actually just write it out (and even wait on it, if absolutely required). Which makes the whole process much more streamlined, and makes the numbers more repeatable. It also fixes the problem with dirty buffer cache data much more efficiently than the kflushd approach, and mmap002 is not a problem any more. At least for me. [ I noticed that mmap002 finishes a whole lot faster if I never actually wait for the writes to complete, but that had some nasty behaviour under low memory circumstances, so it's not what pre7-9 actually does. I _suspect_ that I should start actually waiting for pages only when priority reaches 0 - comments welcomed, see fs/buffer.c and the sync_page_buffers() function ] kswapd is still quite aggressive, and will show higher CPU time than before. This is a tweaking issue - I suspect it is too aggressive right now, but it needs more testing and feedback. Just the dirty buffer handling made quite an enormous difference, so please do test this if you hated earlier pre7 kernels. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [PATCH] Recent VM fiasco - fixed
Some more explanations on the differences between pre7-8 and pre7-9.. Basically pre7-9 survives mmap002 quite gracefully, and I think it does so for all the right reasons. It's not tuned for that load at all, it's just that mmap002 was really good at showing two weak points of the mm layer: - try_to_free_pages() could actually return success without freeing a single page (just moving pages around to the swap cache). This was bad, because it could cause us to get into a situation where we "successfully" free'd pages without ever adding any to the list. Which would, for all the obvious reasons, cause problems later when we couldn't allocate a page after all.. - The "sync_page_buffers()" thing to sync pages directly to disk rather than wait for bdflush to do it for us (and have people run out of memory before bdflush got around to the right pages). Sadly, as it was set up, try_to_free_buffers() doesn't even get the "urgency" flag, so right now it doesn't know whether it should wait for previous write-outs or not. So it always does, even though for non-critical allocations it should just ignore locked buffers. Fixing these things suddenly made mmap002 behave quite well. I'll make the change to pass in the priority to sync_page_buffers() so that I'll get the increased performance from not waiting when I don't have to, but it starts to look like pre7 is getting in shape. Linus On Wed, 10 May 2000, Linus Torvalds wrote: > > Ok, there's a pre7-9 out there, and the biggest change versus pre7-8 is > actually how block fs dirty data is flushed out. Instead of just waking up > kflushd and hoping for the best, we actually just write it out (and even > wait on it, if absolutely required). > > Which makes the whole process much more streamlined, and makes the numbers > more repeatable. It also fixes the problem with dirty buffer cache data > much more efficiently than the kflushd approach, and mmap002 is not a > problem any more. At least for me. > > [ I noticed that mmap002 finishes a whole lot faster if I never actually > wait for the writes to complete, but that had some nasty behaviour under > low memory circumstances, so it's not what pre7-9 actually does. I > _suspect_ that I should start actually waiting for pages only when > priority reaches 0 - comments welcomed, see fs/buffer.c and the > sync_page_buffers() function ] > > kswapd is still quite aggressive, and will show higher CPU time than > before. This is a tweaking issue - I suspect it is too aggressive right > now, but it needs more testing and feedback. > > Just the dirty buffer handling made quite an enormous difference, so > please do test this if you hated earlier pre7 kernels. > > Linus > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
[patch] balanced highmem subsystem under pre7-9
IMO high memory should not be balanced. Stock pre7-9 tried to balance high memory once it got below the treshold (causing very bad VM behavior and high kswapd usage) - this is incorrect because there is nothing special about the highmem zone, it's more like an 'extension' of the normal zone, from which specific caches can turn. (patch attached) another problem is that even during a mild test the DMA zone gets emptied easily - but on a big RAM box kswapd has to work _alot_ to fill it up. In fact on an 8GB box it's completely futile to fill up the DMA zone. What worked for me is this zone-chainlist trick in the zone setup code: case ZONE_NORMAL: zone = pgdat->node_zones + ZONE_NORMAL; if (zone->size) zonelist->zones[j++] = zone; ++ break; case ZONE_DMA: zone = pgdat->node_zones + ZONE_DMA; if (zone->size) zonelist->zones[j++] = zone; no 'normal' allocation chain leads to the ZONE_DMA zone, except GFP_DMA and GFP_ATOMIC - both of them rightfully access the DMA zone. this is a RL problem, without the above a 8GB box under load crashes pretty quickly due to failed SCSI-layer DMA allocations. (i think those allocations are silly in the first place.) the above is suboptimal on boxes which have total RAM within one order of magnitude of 16MB (the DMA zone stays empty most of the time and is unaccessible to various caches) - so maybe the following (not yet implemented) solution would be generic and acceptable: allocate 5% of total RAM or 16MB to the DMA zone (via fixing up zone sizes on bootup), whichever is smaller, in 2MB increments. Disadvantage of this method: eg. it wastes 2MB RAM on a 8MB box. We could probably live with 64kb increments (there are 64kb ISA DMA constraints the sound drivers and some SCSI drivers are hitting) - is this really true? If nobody objects i'll implement this later one (together with the assymetric allocation chain trick) - there will be a 64kb DMA pool allocated on the smallest boxes, which should be acceptable even on a 4MB box. We could turn off the DMA zone altogether on most boxes, if it wasnt for the SCSI layer allocating DMA pages even for PCI drivers ... Comments? Ingo
--- linux/mm/page_alloc.c.orig Thu May 11 02:10:34 2000 +++ linux/mm/page_alloc.c Thu May 11 16:03:48 2000 @@ -553,9 +566,14 @@ mask = zone_balance_min[j]; else if (mask > zone_balance_max[j]) mask = zone_balance_max[j]; - zone->pages_min = mask; - zone->pages_low = mask*2; - zone->pages_high = mask*3; + if (j == ZONE_HIGHMEM) { + zone->pages_low = zone->pages_high = + zone->pages_min = 0; + } else { + zone->pages_min = mask; + zone->pages_low = mask*2; + zone->pages_high = mask*3; + } zone->low_on_memory = 0; zone->zone_wake_kswapd = 0; zone->zone_mem_map = mem_map + offset;
Re: [patch] balanced highmem subsystem under pre7-9
On Fri, 12 May 2000, Ingo Molnar wrote: >IMO high memory should not be balanced. Stock pre7-9 tried to balance high >memory once it got below the treshold (causing very bad VM behavior and >high kswapd usage) - this is incorrect because there is nothing special >about the highmem zone, it's more like an 'extension' of the normal zone, >from which specific caches can turn. (patch attached) IMHO that is an hack to workaround the currently broken design of the MM. And it will also produce bad effect since you won't age the recycle the cache in the highmem zone correctly. Without classzone design you will always have kswapd and the page allocator that shrink memory even if not necessary. Please check as reference the very detailed explanation I posted around two weeks ago on linux-mm in reply to Linus. What you're trying to workaround on the highmem part is exactly the same problem you also have between the normal zone and the dma zone. Why don't you also just take 3mbyte always free from the dma zone and you never shrink the normal zone? Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [patch] balanced highmem subsystem under pre7-9
On Fri, 12 May 2000, Andrea Arcangeli wrote: > >IMO high memory should not be balanced. Stock pre7-9 tried to balance high > >memory once it got below the treshold (causing very bad VM behavior and > >high kswapd usage) - this is incorrect because there is nothing special > >about the highmem zone, it's more like an 'extension' of the normal zone, > >from which specific caches can turn. (patch attached) > > IMHO that is an hack to workaround the currently broken design of the MM. > And it will also produce bad effect since you won't age the recycle the > cache in the highmem zone correctly. what bad effects? the LRU list of the pagecache is a completely independent mechanizm. Highmem pages are LRU-freed just as effectively as normal pages. The pagecache LRU list is not per-zone but (IMHO correctly) global, so the particular zone of highmem pages is completely transparent and irrelevant to the LRU mechanizm. I cannot see any bad effects wrt. LRU recycling and the highmem zone here. (let me know if you ment some different recycling mechanizm) > What you're trying to workaround on the highmem part is exactly the > same problem you also have between the normal zone and the dma zone. > Why don't you also just take 3mbyte always free from the dma zone and > you never shrink the normal zone? i'm not working around anything. Highmem _should not be balanced_, period. It's a superset of normal memory, and by just balancing normal memory (and adding highmem free count to the total) we are completely fine. Highmem is also a temporary phenomenon, it will probably disappear in a few years once 64-bit systems and proper 64-bit DMA becomes commonplace. (and small devices will do 32-bit + 32-bit DMA.) 'balanced' means: 'keep X amount of highmem free'. What is your point in keeping free highmem around? the DMA zone resizing suggestion from yesterday is i believe conceptually correct as well, _want to_ isolate normal allocators from these 'emergency pools'. IRQ handlers cannot wait for more free RAM. about classzone. This was the initial idea how to do balancing when the zoned allocator was implemented (along with per-zone kswapd threads or per-zone queues), but it just gets too complex IMHO. Why dont you give the simpler suggestion from yesterday a thought? We have only one zone essentially which has to be balanced, ZONE_NORMAL. ZONE_DMA is and should become special because it also serves as an atomic pool for IRQ allocations. (ZONE_HIGHMEM is special and uninteresting as far as memory balance goes, as explained above.) So we only have ZONE_NORMAL to worry about. Zonechains are perfect ways of defining fallback routes. i've had a nicely balanced (heavily loaded) 8GB box for the past couple of weeks, just by doing (yesterday's) slight trivial changes to the zone-chains and watermarks. The default settings in the stock kernel were not tuned, but all the mechanizm is there. LRU is working, there was always DMA RAM around, no classzones necessery here. So what is exactly the case you are trying to balance? Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [patch] balanced highmem subsystem under pre7-9
On Fri, 12 May 2000, Ingo Molnar wrote: >what bad effects? the LRU list of the pagecache is a completely >independent mechanizm. Highmem pages are LRU-freed just as effectively as >normal pages. The pagecache LRU list is not per-zone but (IMHO correctly) >global, so the particular zone of highmem pages is completely transparent It shouldn't be global but per-NUMA-node as I have in the classzone patch. >and irrelevant to the LRU mechanizm. I cannot see any bad effects wrt. LRU >recycling and the highmem zone here. (let me know if you ment some >different recycling mechanizm) See line 320 of filemap.c in 2.3.99-pre7-pre9. (ignore the fact it will recycle 1 page, it's just because they didn't expected pages_high to be zero) >'balanced' means: 'keep X amount of highmem free'. What is your point in >keeping free highmem around? Assuming there is no point, you still want to free also from the highmem zone while doing LRU aging of the cache. And if you don't keep X amount of highmem free you'll break if an irq will do a GFP_HIGHMEM allocation. Note also that with highmem I don't mean not the memory between 1giga and 64giga, but the memory between 0 and 64giga. When you allocate with GFP_HIGHUSER you ask to the MM a page between 0 and 64giga. And in turn what is the point of keeping X amount of normal/regular memory free? You just try to keep such X amount of memory free in the DMA zone, so why you also try to keep it free on the normal zone? The problem is the same. Please read my emails on linux-mm of a few weeks ago about classzone approch. I can forward them to linux-kernel if there is interest (I don't know if there's a web archive but I guess there is). If the current strict zone approch wouldn't be broken we could as well choose to split the ZONE_HIGHMEM in 10/20 zones to scales 10/20 times better during allocations, no? Is this argulemnt enough to make you to at least ring a bell that the current design is flawed? The flaw is that we pay that with drawbacks and by having the VM that does the wrong thing because it have no enough information (it only see a little part of the picture). You can't fix it without looking the whole picture (the classzone). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [patch] balanced highmem subsystem under pre7-9
On Fri, 12 May 2000, Andrea Arcangeli wrote: > On Fri, 12 May 2000, Ingo Molnar wrote: > > >what bad effects? the LRU list of the pagecache is a completely > >independent mechanizm. Highmem pages are LRU-freed just as effectively as > >normal pages. The pagecache LRU list is not per-zone but (IMHO correctly) > >global, so the particular zone of highmem pages is completely transparent > > It shouldn't be global but per-NUMA-node as I have in the classzone patch. *nod* This change is in my source tree too (but the active/inactive page list thing doesn't work yet). > >and irrelevant to the LRU mechanizm. I cannot see any bad effects wrt. LRU > >recycling and the highmem zone here. (let me know if you ment some > >different recycling mechanizm) > > See line 320 of filemap.c in 2.3.99-pre7-pre9. (ignore the fact > it will recycle 1 page, it's just because they didn't expected > pages_high to be zero) Indeed, pages_high for the higmem zone probably shouldn't be zero. pages_min and pages_low: 0 pages_high: 128??? (free up to 512kB of high memory) > >'balanced' means: 'keep X amount of highmem free'. What is your point in > >keeping free highmem around? > > Assuming there is no point, you still want to free also from the > highmem zone while doing LRU aging of the cache. True, but this just involves setting the watermarks right. The current code supports the balancing just fine. > And if you don't keep X amount of highmem free you'll break if > an irq will do a GFP_HIGHMEM allocation. GFP_HIGHMEM will automatically fallback to the NORMAL zone. There's no problem here. > Note also that with highmem I don't mean not the memory between > 1giga and 64giga, but the memory between 0 and 64giga. Why do you keep insisting on meaning other things with words than what everybody else means with them? ;) > Please read my emails on linux-mm of a few weeks ago about > classzone approch. I've read them and it's overly complex and doesn't make much sense for what we need. > I can forward them to linux-kernel if there is interest (I don't > know if there's a web archive but I guess there is). http://mail.nl.linux.org/linux-mm/ http://www.linux.eu.org/Linux-MM/ > If the current strict zone approch wouldn't be broken we could > as well choose to split the ZONE_HIGHMEM in 10/20 zones to > scales 10/20 times better during allocations, no? This would work just fine, except for the fact that we have only one pagecache_lock ... maybe we want to have multiple pagecache_locks based on a hash of the inode number? ;) > Is this argulemnt enough to make you to at least ring a bell > that the current design is flawed? But we *can* split the HIGHMEM zone into a bunch of smaller ones without affecting performance. Just set zone->pages_min and zone->pages_low to 0 and zone->pages_high to some smallish value. Then we can teach the allocator to skip the zone if: 1) no obscenely large amount of free pages 2) zone is locked by somebody else (TryLock(zone->lock)) This will work just fine with the current code (plus these two minor tweaks). No big changes are needed to support this idea. regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [patch] balanced highmem subsystem under pre7-9
On Fri, 12 May 2000, Rik van Riel wrote: > But we *can* split the HIGHMEM zone into a bunch of smaller > ones without affecting performance. Just set zone->pages_min > and zone->pages_low to 0 and zone->pages_high to some smallish > value. Then we can teach the allocator to skip the zone if: > 1) no obscenely large amount of free pages > 2) zone is locked by somebody else (TryLock(zone->lock)) whats the point of this splitup? (i suspect there is a point, i just cannot see it now. thanks.) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [patch] balanced highmem subsystem under pre7-9
[ sorry for the late reply ] On Fri, 12 May 2000, Ingo Molnar wrote: >On Fri, 12 May 2000, Rik van Riel wrote: > >> But we *can* split the HIGHMEM zone into a bunch of smaller >> ones without affecting performance. Just set zone->pages_min >> and zone->pages_low to 0 and zone->pages_high to some smallish >> value. Then we can teach the allocator to skip the zone if: >> 1) no obscenely large amount of free pages >> 2) zone is locked by somebody else (TryLock(zone->lock)) > >whats the point of this splitup? (i suspect there is a point, i just >cannot see it now. thanks.) I quote email from Rik of 25 Apr 2000 23:10:56 on linux-mm: -- Message-ID: <Pine.LNX.4.21.0004252240280.14340-100000@duckman.conectiva> -- We can do this just fine. Splitting a box into a dozen more zones than what we have currently should work just fine, except for (as you say) higher cpu use by kwapd. If I get my balancing patch right, most of that disadvantage should be gone as well. Maybe we *do* want to do this on bigger SMP boxes so each processor can start out with a separate zone and check the other zone later to avoid lock contention? -------------------------------------------------------------- I still strongly think that the current zone strict mem balancing design is very broken (and I also think to be right since I believe to see the whole picture) but I don't think I can explain my arguments better and/or more extensively of how I just did in linux-mm some week ago. If you see anything wrong in my reasoning please let me know. The interesting thread was "Re: 2.3.x mem balancing" (the start were off list) in linux-mm. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [patch] balanced highmem subsystem under pre7-9
On Thu, 18 May 2000, Andrea Arcangeli wrote: > I still strongly think that the current zone strict mem > balancing design is very broken (and I also think to be right > since I believe to see the whole picture) but I don't think I > can explain my arguments better and/or more extensively of how I > just did in linux-mm some week ago. The balancing as of pre9-2 works like this: - LRU list per pgdat - kswapd runs and makes sure every zone has > zone->pages_low free pages, after that it stops - kswapd frees up to zone->pages_high pages, depending on what pages we encounter in the LRU queue, this will make sure that the zone with most least recently used pages will have more free pages - __alloc_pages() allocates all pages up to zone->pages_low on every zone before waking up kswapd, this makes sure more pages from the least loaded zone will be used than from more loaded zones, this will make sure balancing between zones happens I'm curious what would be so "very broken" about this? AFAICS it does most of what the classzone patch would achieve, at lower complexity and better readability. regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [patch] balanced highmem subsystem under pre7-9
On Fri, 19 May 2000, Rik van Riel wrote: >I'm curious what would be so "very broken" about this? You start eating from ZONE_DMA before you made empty ZONE_NORMAL. >AFAICS it does most of what the classzone patch would achieve, >at lower complexity and better readability. I disagree. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/
Re: [patch] balanced highmem subsystem under pre7-9
On Fri, 19 May 2000, Andrea Arcangeli wrote: > On Fri, 19 May 2000, Rik van Riel wrote: > > >I'm curious what would be so "very broken" about this? > > You start eating from ZONE_DMA before you made empty ZONE_NORMAL. THIS IS NOT A BUG! It's a feature. I don't see why you insist on calling this a problem. We do NOT keep free memory around just for DMA allocations. We fundamentally keep free memory around because the buddy allocator (_any_ allocator, in fact) needs some slop in order to do a reasonable job at allocating contiguous page regions, for example. We keep free memory around because that way we have a "buffer" to allocate from atomically, so that when network traffic occurs or there is other behaviour that requires memory without being able to free it on the spot, we have memory to give. Keeping only DMA memory around would be =bad=. It would mean, for example, that when a new packet comes in on the network, it would always be allocated from the DMA region, because the normal zone hasn't even been balanced ("why balance it when we still have DMA memory?"). And that would be a huge mistake, because that would mean, for example, that by selecting the right allocation patterns and by opening sockets without reading the data they receive the right way, somebody could force all of DMA memory to be used up by network allocations that wouldn't be free'd. In short, your very fundamental premise is BROKEN, Andrea. We want to keep normal memory around, even if there is low memory available. The same is true of high memory, for similar reasons. Face it. The original zone-only code had problems. One of the worst problems was that it would try to free up a lot of "normal" memory if it got low on DMA memory. Those problems have pretty much been fixed, and they had _nothing_ to do with your "class" patches. They were bugs, plain and simple, not design mistakes. If you think you should have zero free normal pages, YOU have a design mistake. We should not be that black-and-white. The whole point in having the min/low/max stuff is to make memory allocation less susceptible to border conditions, and turn a black-and-white situation into more of a "levels of gray" situation. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/