From: Alan Cox <a...@terrorserver.swansea.linux.org.uk> Subject: 2.2.0 Bug summary Date: 1998/12/29 Message-ID: <fa.m25408v.141ks8r@ifi.uio.no>#1/1 X-Deja-AN: 426631192 Original-Date: Tue, 29 Dec 1998 01:46:20 GMT Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-Id: <199812290146.BAA12687@terrorserver.swansea.linux.org.uk> To: linux-ker...@vger.rutgers.edu, torva...@transmeta.com X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu Organization: Internet mailing list Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Ok this is the collated 'bad bug set', also the -ac diffs divided up into the relevant sections Unfixed and definitely needing fixes o TCP slow performance problem is still not merged from DaveM o Run two processes that keep rejoining multicast groups on an SMP box - crash o Spam the box remotely with syn floods and other crap, it leaks memory o select/poll magically break at some number of handles without an error o Procfs has locking errors on mm's o IDE probe often guesses wrong. Linux is impossible to install on these ranges of PCs. Needs fixing badly. o isdn4linux is old not CVS version. Basically unusable. If its not changing for 2.2.0 it should be commented out or deleted o eata-dma driver crashes the machine if at any instant it cant grab atomic isa dma memory. (Possible fix mark it obsolete and use eata.c which works fine) o Video4linux bttv tends to crash machines grabbing - fix around needs merging and the driver updating o You can't mount an ext2fs cdrom. (Block size error). Works in 2.0 o generic_file_mmap and MSDOS/UMSDOS disagree over who clears blocks o bootp autobooting stuff corrupts other hosts arp stuff it seems o DaveM reports a pile of VMA operations done without locks held. o IDE defaults to multimode on causing serial problems, corruption with some drives, and hangs on boot with others. o Dual 486 boards won't boot SMP kernels o Tulip driver/fast routing stuff needs to be resolved. If they cant be merged the default tulip should be a current one. Unfixed but not vital o NFS client over tcp doesnt work o NFS readahead is too low o NFS performance to 8K page sized BSD boxes sucks rocks, 2.0.x is about 5 times faster o Linus VM is still 20% slower than sct vm on an 8Mb machine [benchmarks kernel build and netscape] o fchmod on AF_UNIX sockets doesnt work like BSD o IPv6 calls set_multicast_list in the wrong context o TCP fails to handle small SO_SNDBUF/RCVBUF settings o Make xconfig needs layout fixes o Need to review all CONFIG_EXPERIMENTAL tags Fixed in -ac patches For Linus: o AVL tree vm avoids bad perfomance problems o MediaGX crashes on boot o Certain numbers of scsi disks dont seem to work o VFS clears setuid/gid flags wrongly on directories o COSA credited twice o string.h egcs fixes o Some further time fixes o Various time fixes submitted o KNFSD patches. With them knfsd seems to work ok. With the current tree it doesnt work at all. Probably this is "Experimental for 2.2" o AMD stepping ident, K6 ident o What the hell is going on in time.c, on a low memory box picking 586 gives better performance for a 486 and several other chips without TSC registers. That patch piece is a bad way to save 1K o Various config combinations don't build o FTAPE doesnt work in .132/2.2.0pre o Various of the time_* changes to net/* are one out o Ted's last serial patch is missing (setserial crashes box) o IBMMCA doesnt work on the model 77 internal scsi o Trond's last NFS fix o include/linux/sysctl.h is exposed to user tasks even with glibc, but isnt strictly ANSI compliant o SYS5 shm debugging slows stuff down measurably -ifdef it o DVD's trip an isofs sanity check wrongly Unsure: o Large file array support (will be required by vendors for several big name products). This is a tricky one. Im wearing too many hats to judge this objectively. Vendors will probably ship this anyway or something similar. Linus doesnt want: o QlogicFC - no big problem, its seperate its clean and vendors can ship it and other driver addons easily as they do now. Its a nobrainer to install of the net. Stale ?: o ADFS updates o Load unversioned modules into versioned kernels when doing request_module etc. o Crashes and zero page scribbles using ptrace. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: 2.2.0 Bug summary Date: 1999/01/01 Message-ID: <fa.iqgpdmv.1d0sp15@ifi.uio.no> X-Deja-AN: 427612386 Original-Date: Thu, 31 Dec 1998 19:00:18 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.981231182534.658A-100000@laser.bogus> References: <fa.m25408v.141ks8r@ifi.uio.no> To: Alan Cox <a...@terrorserver.swansea.linux.org.uk> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Reply-To: Andrea Arcangeli <and...@e-mind.com> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Tue, 29 Dec 1998, Alan Cox wrote: > o Linus VM is still 20% slower than sct vm on an 8Mb machine > [benchmarks kernel build and netscape] Today I start playing with Linus's vm in 2.2.0-pre1 and I changed the semantics of many things and I added heuristic to avoid that one process trashing memory will hang other "normal" processes. This my new VM I developed today is _far_ better than sct's ac11 vm and anything I tried before. I would like if somebody could try it also on low memory machines and feedback what happens there. I don't have enough spare time to test it on many kind of hardware too. The same benchmark that was taking 106 sec on clean 2.2.0-pre1 to dirtifying 160Mbyte of virtual memory (run with 128RAM and 72swap of phis mem), now runs in 90 sec but this is not the most important thing, the good point is that the cache/buffer/swap levels now are perfectly stable and all other processes runs fine and get not out of cache even if there's a memory trahser running at the same time. Comments? Ah, the shrink_mmap limit was wrong since we account only not referenced pages. Patch against 2.2.0-pre1: Patch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: [patch] new-vm improvement [Re: 2.2.0 Bug summary] Date: 1999/01/01 Message-ID: <fa.iqghe6v.1d08ph7@ifi.uio.no>#1/1 X-Deja-AN: 427617980 Original-Date: Thu, 31 Dec 1998 19:34:40 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.981231193257.330B-100000@laser.bogus> References: <fa.iqgpdmv.1d0sp15@ifi.uio.no> To: Alan Cox <a...@terrorserver.swansea.linux.org.uk> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 31 Dec 1998, Andrea Arcangeli wrote: > Comments? > > Ah, the shrink_mmap limit was wrong since we account only not referenced > pages. > > Patch against 2.2.0-pre1: whoops in the last email I forget to change a bit the subject (adding [patch]) and this printk: Index: linux/mm/vmscan.c diff -u linux/mm/vmscan.c:1.1.1.1.2.43 linux/mm/vmscan.c:1.1.1.1.2.45 --- linux/mm/vmscan.c:1.1.1.1.2.43 Thu Dec 31 17:56:27 1998 +++ linux/mm/vmscan.c Thu Dec 31 19:41:06 1998 @@ -449,11 +449,7 @@ case 0: /* swap_out() failed to swapout */ if (shrink_mmap(priority, gfp_mask)) - { - printk("swapout 0 shrink 1\n"); return 1; - } - printk("swapout 0 shrink 0\n"); return 0; case 1: /* this would be the best but should not happen right now */ Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] Date: 1999/01/01 Message-ID: <fa.iqv61tv.16gg7bi@ifi.uio.no>#1/1 X-Deja-AN: 427757351 Original-Date: Fri, 1 Jan 1999 17:44:55 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990101171008.1145B-100000@laser.bogus> References: <fa.iqghe6v.1d08ph7@ifi.uio.no> To: Benjamin Redelings I <brede...@ucsd.edu>, "Stephen C. Tweedie" <s...@redhat.com>, Linus Torvalds <torva...@transmeta.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Reply-To: Andrea Arcangeli <and...@e-mind.com> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu I' ll try to comment my latest VM patch. The patch basically do two things. It add an heuristic to block trashing tasks in try_to_free_pages() and allow normal tasks to run fine in the meantime. It returns to the old do_try_to_free_pages() way to do things. I think the reason the old way was no longer working well is that we are using swap_out() as other freeing-methods while swapout has really nothing to do with them. To get VM stability under low memory we must use both swap_out() (that put pages from the user process Vmemory to the swap cache) and shrink_mmap() in a new method. My new method put user pages in the swap cache because there we can handle aging very well. Then shrink_mmap() can free a not refernced page to really do some progress in the memory freeing (and not only in the swapout). So basically my patch cause sure the system to swapout more than we was used to do, but most of the time we will not need a swapin to reput the pages in the process Vmemory. Somebody reported a big slowdown of the trashing application. Right now I don't know which bit of the patch caused this slowdown (yesterday my benchmark here didn't showed this slowdown). My new trashing_memory heuristic will probably decrease performance for the trashing application (but hey you know that if you need performance you can alwaws buy more RAM ;), but it will improve a lot performance for normal not-trashing tasks. I' ll try to change do_free_user_and_cache() to see if I can achieve something better. I changed also the swap_out() since the best way to choose a process it to compare the raw RSS I think. And I don' t want that swap_cnt is decreased of something every time something is swapped out. I want that the kernel will continue passing throught all the pages of one process once it started playing with it (if it will still exists of course ;). I changed also the pressure of swap_out() since it make no sense to me to pass more than one time over the VM of all tasks in the system. Now at priority 6 swap_out() is trying to swapout something at max from nr_tasks/7 (low bound to 1 task). I changed also the pressure of shrink_mmap() because it was making no sense to me to do two passes on just not referenced pages. I also changed swapout() allowing it to return 0 1 or more. 0 means that swap_out() is been not able to put in the swap cache something. 1 means that swap_out() is been able to swapout something and has also freed up one page (how??? it can't right now because the page should always be still at least present in the swap cache) 2 means that swap_out() has swapped out 1 page and that the page is still referenced somewhere (probably by the swap cache) So in case 2 and case 0 we must use shrink_mmap() to really do some progress in the page freeing. This the idea that my new do_free_user_and_cache() follows. Comments? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] Date: 1999/01/01 Message-ID: <fa.ing1dev.1e00p93@ifi.uio.no> X-Deja-AN: 427806048 Original-Date: Fri, 1 Jan 1999 21:02:29 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990101203728.301B-100000@laser.bogus> References: <fa.iqv61tv.16gg7bi@ifi.uio.no> To: Benjamin Redelings I <brede...@ucsd.edu>, "Stephen C. Tweedie" <s...@redhat.com>, Linus Torvalds <torva...@transmeta.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Reply-To: Andrea Arcangeli <and...@e-mind.com> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu I rediffed my VM patch against test1-patch-2.2.0-pre3.gz. I also fixed some bug (not totally critical but..) pointed out by Linus in my last code. I also changed the shrink_mmap(0) to shrink_mmap(priority) because it was completly sucking a lot performance. There is no need to do a shrink_mmap(0) for example if the cache/buffer are under min. In such case we must allow the swap_out() to grow the cache before start shrinking it. So basically this new patch is _far_ more efficient than the last one (I never seen so good/stable/fast behavior before!). This my new patch is against testing/test1-patch-2.2.0-pre3.gz that is against v2.1/2.2.0-pre2 that is against patch-2.2.0-pre1-vs-2.1.132.gz (where is this last one now?). Ah, from testing/test1-patch-2.2.0-pre3.gz was missing the trashing memory initialization that will allow every process to do a fast start. Patch If this patch is decreasing performance for you (eventually due too much memory swapped out) you can try this incremental patch (I never tried here btw): Index: mm//vmscan.c =================================================================== RCS file: /var/cvs/linux/mm/vmscan.c,v retrieving revision 1.1.1.1.2.49 diff -u -r1.1.1.1.2.49 vmscan.c --- vmscan.c 1999/01/01 19:29:19 1.1.1.1.2.49 +++ linux/mm/vmscan.c 1999/01/01 19:51:22 @@ -441,6 +441,9 @@ static int do_free_user_and_cache(int priority, int gfp_mask) { + if (shrink_mmap(priority, gfp_mask)) + return 1; + switch (swap_out(priority, gfp_mask)) { default: I written a swap benchmark that is dirtifying 160Mbyte of VM. For the first loop 2.2-pre1 was taking 106 sec, for the second loop 120 and then worse. test1-pre3 + my new patch in this email, instead takes 120 sec in the first loop (since it's allocating it's probably slowed down a bit by the trashing_memory heuristic, and that's right), then it takes 90 sec in the second loop and 77 sec in the third loop!! and the system was far to be idle (as when I measured 2.2-pre1), but I was using it without special regards and was perfectly usable (2.2-pre1 was unusable instead). Comments? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Steve Bergman <st...@netplus.net> Subject: Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] Date: 1999/01/02 Message-ID: <fa.coatldv.pi2vin@ifi.uio.no>#1/1 X-Deja-AN: 427867234 Original-Date: Fri, 01 Jan 1999 17:46:26 -0600 Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-ID: <368D5E52.FE8B7B8@netplus.net> References: <fa.ing1dev.1e00p93@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> Original-References: <Pine.LNX.3.96.990101203728.301B-100...@laser.bogus> X-Accept-Language: en Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Andrea Arcangeli wrote: > > Please stop and try my new patch against Linus's test1-pre3 (that just > merge some of my new stuff). I got the patch and I must say I'm impressed. I ran my "117 image" test and got these results: [Note: This loads 117 different images at the same time using 117 separate instances of 'xv' started in the background and results in ~ 165 MB of swap area usage. The machine is an AMD K6-2 300 with 128MB] 2.1.131-ac11 172 sec (This was previously the best) 2.2.0-pre1 + Arcangeli's 1st patch 400 sec test1-pre + Arcangeli's 2nd patch 119 sec (!) Processor utilization was substantially greater with the new patch compared to either of the others. Before it starts using swap, memory is being consumed at ~ 4MB/sec. After it starts to swap out, it streams out at ~ 2MB/sec. The performance is ~ 45% better than ac11 and ~ 70% better than 2.2.0-pre1 in this test. I was going to test the low memory case but got side tracked. Thanks, Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] Date: 1999/01/02 Message-ID: <fa.no4l96v.1j7mk8i@ifi.uio.no> X-Deja-AN: 427932608 Original-Date: Fri, 1 Jan 1999 22:55:09 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990101225111.16066K-100000@penguin.transmeta.com> References: <fa.coatldv.pi2vin@ifi.uio.no> To: Steve Bergman <st...@netplus.net> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Fri, 1 Jan 1999, Steve Bergman wrote: > > I got the patch and I must say I'm impressed. I ran my "117 image" test > and got these results: > > 2.1.131-ac11 172 sec (This was previously the best) > 2.2.0-pre1 + Arcangeli's 1st patch 400 sec > test1-pre + Arcangeli's 2nd patch 119 sec (!) Would you care to do some more testing? In particular, I'd like to hear how basic 2.2.0pre3 works (that's essentially the same as test1-pre, with only minor updates)? I'd like to calibrate the numbers against that, rather than against kernels that I haven't actually ever run myself. The other thing I'd like to hear is how pre3 looks with this patch, which should behave basically like Andrea's latest patch but without the obfuscation he put into his patch.. Linus ----- Code - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Steve Bergman <st...@netplus.net> Subject: Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] Date: 1999/01/02 Message-ID: <fa.fkjva6v.42carq@ifi.uio.no>#1/1 X-Deja-AN: 427952017 Original-Date: Sat, 02 Jan 1999 02:33:50 -0600 Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-ID: <368DD9EE.D19A4D61@netplus.net> References: <fa.no4l96v.1j7mk8i@ifi.uio.no> To: unlisted-recipients:; (no To-header on input) Original-References: <Pine.LNX.3.95.990101225111.16066K-100...@penguin.transmeta.com> X-Accept-Language: en Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Linus Torvalds wrote: > > On Fri, 1 Jan 1999, Steve Bergman wrote: > > > > I got the patch and I must say I'm impressed. I ran my "117 image" test > > and got these results: > > > > 2.1.131-ac11 172 sec (This was previously the best) > > 2.2.0-pre1 + Arcangeli's 1st patch 400 sec > > test1-pre + Arcangeli's 2nd patch 119 sec (!) > > Would you care to do some more testing? In particular, I'd like to hear > how basic 2.2.0pre3 works (that's essentially the same as test1-pre, with > only minor updates)? I'd like to calibrate the numbers against that, > rather than against kernels that I haven't actually ever run myself. > > The other thing I'd like to hear is how pre3 looks with this patch, which > should behave basically like Andrea's latest patch Hi Linus, Andrea sent another patch to correct a problem with i/o bound processes, which he also posted to linux-kernel. The performance in this test is unchanged. Here are the results: 2.1.131-ac11 172 sec 2.2.0-pre1 + Arcangeli's 1st patch 400 sec test1-pre + Arcangeli's 2nd patch 119 sec test1-pre + Arcangeli's 3rd patch 119 sec test1-pre + Arcangeli's 3rd patch 117 sec (changed to priority = 9 in mm/vmscan.c) 2.2.0-pre3 175 sec 2.2.0-pre3 + Linus's patch 129 sec RH5.2 Stock (2.0.36-0.7) 280 sec I noticed that in watching the 'vmstat 1' during the test that '2.2.0+Linus patch' was not *quite* as smooth as the Archangeli patches, in that there were periods of 2 or 3 seconds in which the swap out rate would fall to ~800k/sec and then jump back up to 1.8-2.5MB/sec. I have only run your patch once though. I'll check it further tomorrow to confirm that that is really the case. Note how much better 2.2 is doing compared to 2.0.36-0.7 in this situation. I should be available for a good part of this weekend for further testing; Just let me know. As a reference: AMD K6-2 300 128MB ram 2GB seagate scsi2 dedicated to swap Data drive is 6.5GB UDMA Steve Bergman st...@netplus.net - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm improvement , [Re: 2.2.0 Bug summary]] Date: 1999/01/05 Message-ID: <fa.j1f026v.100m7ju@ifi.uio.no> X-Deja-AN: 429190288 Original-Date: Mon, 4 Jan 1999 19:08:00 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990104183954.1944B-100000@laser.bogus> References: <fa.in0bdmv.1fg2oh2@ifi.uio.no> To: Steve Bergman <st...@netplus.net> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Reply-To: Andrea Arcangeli <and...@e-mind.com> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu I have a new revolutionary patch. The main thing is that I killed kswapd just to make Rik happy ;). Ah and my last patches had a little bug that was sure hurting performances against Linus's VM since I was stopping kswapd when nr_free_pages > freepages.high was true and not as right Linus was doing when nr_free_pages > freepages.high + swap_cluster. So I was causing a lot of kswapd wakeup. There was also a not improved thing in the trashing_memory heuristic, that is to remove the trashing bit only if PF_MEMALLOC is not set. Ah and the swapout code seems to like a linear and not exponential priority handling. Probably it likes more to succeed than shrink_mmap(). If you'll try it let me know. I am interested about the image load test (that should be the most near to the real world). With this patch the swapout performances are doubled. The swapout benchmark that was used to take 100 sec with my old code and with Linus's VM, now run in 50sec! Now I go to 6Mbyte at sec (3so and 3si) instead of 3Mbyte sec (1.5so, 1.5si). 6mbyte/sec is the performance reported by hdparm -t btw ;). And all the system is perfectly fluid (far more fuild than with the old code). I open an xterm without wait seconds. The cache get not kiked out. It seems really great here. When the system goes OOM it seems to recover fine. Here arca-vm-6 against 2.2.0-pre4: Patch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm improvement , [Re: 2.2.0 Bug summary]] Date: 1999/01/05 Message-ID: <fa.ns557uv.1t7ulgm@ifi.uio.no>#1/1 X-Deja-AN: 429053804 Original-Date: Mon, 4 Jan 1999 12:56:27 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990104125147.32215U-100000@penguin.transmeta.com> References: <fa.j1f026v.100m7ju@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Mon, 4 Jan 1999, Andrea Arcangeli wrote: > > I have a new revolutionary patch. The main thing is that I killed kswapd > just to make Rik happy ;). Ehh.. You may have made Rik happy, but you totally missed the reason for kswapd. And while your patch looked interesting (a lot cleaner than the previous ones, and I _like_ patches that remove code), the fact that you killed kswapd means that it is essentially useless. Basically, we _have_ to have kswapd, and I'll tell you why: - imagine running low on memory due to GFP_ATOMIC - imagine not having any normal processes that do memory alloction. Boom. You just killed the machine with your patch, because maybe the GPF_ATOMIC things are what the machine is doing. Imagine a machine that acts as a router - it might not even be running any normal user processes at _all_, but it had damn well better make sure that memory is always available some way. "kswapd" did that for us, and Rik's happiness counts as nothing in face of basic facts of life like that. Sorry. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: arca-vm-8 [Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm , improvement , [Re: 2.2.0 Bug summary]]] Date: 1999/01/07 Message-ID: <fa.ofpdegv.k4a0g7@ifi.uio.no>#1/1 X-Deja-AN: 429891336 Original-Date: Wed, 6 Jan 1999 15:35:01 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990106153252.7800D-100000@penguin.transmeta.com> References: <fa.ingbd7v.1f0qph0@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Oh, well.. Based on what the arca-[678] patches did, there's now a pre-5 out there. Not very similar, but it should incorporate the basic idea: namely much more aggressively asynchronous swap-outs from a process context. Comment away, Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: ebiederm+e...@ccr.net (Eric W. Biederman) Subject: Re: arca-vm-8 [Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm , improvement , [Re: 2.2.0 Bug summary]]] Date: 1999/01/08 Message-ID: <fa.g9iq3pv.snq0bp@ifi.uio.no>#1/1 X-Deja-AN: 430009088 Original-Date: 06 Jan 1999 22:30:59 -0600 Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <m1aezvg0vw.fsf@flinx.ccr.net> References: <fa.ofpdegv.k4a0g7@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> Original-References: <Pine.LNX.3.95.990106153252.7800D-100...@penguin.transmeta.com> X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu Organization: Internet mailing list Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu >>>>> "LT" == Linus Torvalds <torva...@transmeta.com> writes: LT> Oh, well.. Based on what the arca-[678] patches did, there's now a pre-5 LT> out there. Not very similar, but it should incorporate the basic idea: LT> namely much more aggressively asynchronous swap-outs from a process LT> context. LT> Comment away, 1) With your comments on PG_dirty/(what shrink_mmap should do) you have worked out what needs to happen for the mapped in memory case, and I haven't quite gotten there. Thank You. 2) I have tested using PG_dirty from shrink_mmap and it is a performance problem because it loses all locality of reference, and because it forces shrink_mmap into a dual role, of freeing and writing pages, which need seperate tuning. Linus is this a case you feel is important to tune for 2.2? If so I would be happy to play with it. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: arca-vm-8 [Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm , improvement , [Re: 2.2.0 Bug summary]]] Date: 1999/01/08 Message-ID: <fa.obqbegv.h4k0gc@ifi.uio.no>#1/1 X-Deja-AN: 430359625 Original-Date: Thu, 7 Jan 1999 09:56:03 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990107093746.4270H-100000@penguin.transmeta.com> References: <fa.g9iq3pv.snq0bp@ifi.uio.no> To: "Eric W. Biederman" <ebiederm+e...@ccr.net> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On 6 Jan 1999, Eric W. Biederman wrote: > > 1) With your comments on PG_dirty/(what shrink_mmap should do) you > have worked out what needs to happen for the mapped in memory case, > and I haven't quite gotten there. Thank You. Note that it is not finalized. That's why I didn't write the code (which should be fairly simple), because it has some fairly subtle issues and thus becomes a 2.3.x thing, I very much suspect. Basically, my rule of thumb for the changes I did was: "it should have the same code paths as the old code". What that means is that I didn't actually do any changes that changed real code: I did only changes that changed _behaviour_. That way I can be reasonably hopeful that there are no new bugs introduced even though performance is very different. I _do_ have some early data that seems to say that this _has_ uncovered a very old deadlock condition: something that could happen before but was almost impossible to trigger. The deadlock I suspect is: - we're low on memory - we allocate or look up a new block on the filesystem. This involves getting the ext2 superblock lock, and doing a "bread()" of the free block bitmap block. - this causes us to try to allocate a new buffer, and we are so low on memory that we go into try_to_free_pages() to find some more memory. - try_to_free_pages() finds a shared memory file to page out. - trying to page that out, it looks up the buffers on the filesystem it needs, but deadlocks on the superblock lock. Note that this could happen before too (I've not removed any of the codepaths that could lead to it), but it was dynamically _much_ less likely to happen. I'm not even sure it really exists, but I have some really old reports that _could_ be due to this, and a few more recent ones (that I never could explain). And I have a few _really_ recent ones from here internally at transmeta that looks like it's triggering more easily these days. (Note that this is not actually pre5-related: I've been chasing this on and off for some time, and it seems to have just gotten easier to trigger, which is why I finally have a theory on what is going on - just a theory though, and I may be completely off the mark). The positive news is that if I'm right in my suspicions it can only happen with shared writable mappings or shared memory segments. The bad news is that the bug appears rather old, and no immediate solution presents itself. > 2) I have tested using PG_dirty from shrink_mmap and it is a > performance problem because it loses all locality of reference, > and because it forces shrink_mmap into a dual role, of freeing and > writing pages, which need seperate tuning. Exactly. This is part of the complexity. The right solution (I _think_) is to conceptually always mark it PG_dirty in vmscan, and basically leave all the nasty cases to the filemap physical page scan. But in the simple cases (ie a swap-cached page that is only mapped by one process and doesn't have any other users), you'd start the IO "early". That would essentially mean that normal single mappings get the good locality, while the case we really suck at right now (multiple mappings which can all dirty the page) would not cause excessive page-outs. Basically, I think that the stuff we handle now with the swap-cache we do well on already, and we'd only really want to handle the shared memory case with PG_dirty. But I think this is a 2.3 issue, and I only added the comment (and the PG_dirty define) for now. > Linus is this a case you feel is important to tune for 2.2? > If so I would be happy to play with it. It might be something good to test out, but I really don't want patches at this date (unless your patches also fix the above deadlock problem, which I can't see them doing ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: arca-vm-8 [Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm , improvement , [Re: 2.2.0 Bug summary]]] Date: 1999/01/09 Message-ID: <fa.oa9tg0v.jkq00l@ifi.uio.no>#1/1 X-Deja-AN: 430431821 Original-Date: Thu, 7 Jan 1999 14:57:34 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990107144729.5025P-100000@penguin.transmeta.com> References: <fa.obqbegv.h4k0gc@ifi.uio.no> To: "Eric W. Biederman" <ebiederm+e...@ccr.net> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 7 Jan 1999, Linus Torvalds wrote: > > The deadlock I suspect is: > - we're low on memory > - we allocate or look up a new block on the filesystem. This involves > getting the ext2 superblock lock, and doing a "bread()" of the free > block bitmap block. > - this causes us to try to allocate a new buffer, and we are so low on > memory that we go into try_to_free_pages() to find some more memory. > - try_to_free_pages() finds a shared memory file to page out. > - trying to page that out, it looks up the buffers on the filesystem it > needs, but deadlocks on the superblock lock. Confirmed. Hpa was good enough to reproduce this, and my debugging code caught the (fairly deep) deadlock: system_call -> sys_write -> ext2_file_write -> ext2_getblk -> ext2_alloc_block -> ** gets superblock lock ** ext2_new_block -> getblk -> refill_freelist -> grow_buffers -> __get_free_pages -> try_to_free_pages -> swap_out -> swap_out_process -> swap_out_vma -> try_to_swap_out -> filemap_swapout -> filemap_write_page -> ext2_file_write -> ext2_getblk -> ext2_alloc_block -> __wait_on_super ** BOOM - we want the superblock lock again ** and I suspect the fix is fairly simple: I'll just add back the __GFP_IO bit (we kind of used to have one that did something similar) which will make the swap-out code not write out shared pages when it allocates buffers. The better fix would actually be to make sure that filesystems do not hold locks around these kinds of blocking operations, but that is harder to do at this late stage. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Savochkin Andrey Vladimirovich <s...@msu.ru> Subject: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/10 Message-ID: <fa.e9ql71v.9no40g@ifi.uio.no>#1/1 X-Deja-AN: 430851879 Original-Date: Sat, 9 Jan 1999 12:43:04 +0300 Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <19990109124304.C26523@castle.nmd.msu.ru> References: <fa.oa9tg0v.jkq00l@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> Original-References: <Pine.LNX.3.95.990107093746.4270H-100...@penguin.transmeta.com> <Pine.LNX.3.95.990107144729.5025P-100...@penguin.transmeta.com> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list Mime-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, I've found an another deadlock. Two processes were locked trying to grab an inode write semaphore. Their call traces are (in diff format): Using `map-2.2.0pre5-1' to map addresses to symbols. Trace: c010f038 <__down+58/90> Trace: c018d080 <__down_failed+8/c> Trace: c011abaa <filemap_write_page+a6/15c> Trace: c011acad <filemap_swapout+4d/60> Trace: c011e2ae <try_to_swap_out+10a/1ac> Trace: c011e45a <swap_out_vma+10a/174> Trace: c011e521 <swap_out_process+5d/8c> Trace: c011e60b <swap_out+bb/e4> Trace: c011e75b <try_to_free_pages+4b/70> Trace: c011ef61 <__get_free_pages+b5/1dc> -Trace: c0119cd7 <try_to_read_ahead+2f/124> -Trace: c011a970 <filemap_nopage+170/304> -Trace: c0118888 <do_no_page+54/e4> -Trace: c01189e4 <handle_mm_fault+cc/168> +Trace: c0118375 <do_wp_page+19/210> +Trace: c0118a3a <handle_mm_fault+122/168> Trace: c010ce9f <do_page_fault+143/364> I suspect that one of the processes grabbed the semaphore and then deadlocked trying to do it again. Probably the process invoked write() with the data having been swapped out. The page fault handler tried to free some memory and try_to_free_pages decided to write out dirty pages of a shared mapping. By accident the dirty pages happened to belong to the file the process had started to write to. A simple solution will be to check if the inode semaphore is held before trying to write pages out and skip the mapping if it is. However it doesn't seem to be a very good solution because if the most memory is occupied by dirty pages of a shared mapping then writing the pages out is the most right thing to do. Best wishes Andrey V. Savochkin On Thu, Jan 07, 1999 at 02:57:34PM -0800, Linus Torvalds wrote: [snip] > Confirmed. Hpa was good enough to reproduce this, and my debugging code > caught the (fairly deep) deadlock: > > system_call -> > sys_write -> > ext2_file_write -> > ext2_getblk -> > ext2_alloc_block -> ** gets superblock lock ** > ext2_new_block -> > getblk -> > refill_freelist -> > grow_buffers -> > __get_free_pages -> > try_to_free_pages -> > swap_out -> > swap_out_process -> > swap_out_vma -> > try_to_swap_out -> > filemap_swapout -> > filemap_write_page -> > ext2_file_write -> > ext2_getblk -> > ext2_alloc_block -> > __wait_on_super ** BOOM - we want the superblock lock again ** - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/10 Message-ID: <fa.oca1fgv.hkq1gf@ifi.uio.no>#1/1 X-Deja-AN: 430944742 Original-Date: Sat, 9 Jan 1999 10:00:27 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990109095521.2572A-100000@penguin.transmeta.com> References: <fa.e9ql71v.9no40g@ifi.uio.no> To: Savochkin Andrey Vladimirovich <s...@msu.ru> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Sat, 9 Jan 1999, Savochkin Andrey Vladimirovich wrote: > > I've found an another deadlock. Yes. This is a case I knew about, and that Alan already mentioned. Trying to write from a shared mapping has a path that can take the write semaphore twice. This one is a whole lot harder to fix - the previous one needed only a simple extra flag, this one is truly nasty. The cleanest solution I can think of is actually to allow semaphores to be recursive. I can do that with minimal overhead (just one extra instruction in the non-contention case), so it's not too bad, and I've wanted to do it for certain other things, but it's still a nasty piece of code to mess around with. Oh, well. I don't think I have much choice. Making the swap-out routines refuse to touch an inode that is busy is a sure way to allow people to let bad users lock down infinite amounts of memory. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Steve Bergman <st...@netplus.net> Subject: Re: Results: pre6 vs pre6+zlatko's_patch vs pre5 vs arcavm13 Date: 1999/01/10 Message-ID: <fa.flid25v.55mbrj@ifi.uio.no>#1/1 X-Deja-AN: 431007061 Original-Date: Sat, 09 Jan 1999 18:28:50 -0600 Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-ID: <3697F442.222A2301@netplus.net> References: <fa.fd4f7mv.1tlkl86@ifi.uio.no> Original-References: <Pine.LNX.3.96.990107001448.1242B-100...@laser.bogus> <36942ACA.3F8C0...@netplus.net> <3697DA94.F0F32...@netplus.net> X-Accept-Language: en Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Steve Bergman wrote: I ran the "image test" (loading 116 jpg images simultaneously) on the latest patches and got these results in 128MB (I end up with ~ 160MB in swap): pre6+zlatko's_patch 2:35 pre6 2:27 pre5 1:58 arcavm13 9:13 Arcavm13 (the star performer in the low memory test) is having problems here. Pre5, which I performed about the same as the others in the my low memory test and which I ignored in my even lower 12MB test looks quite good here. Based on it's good performance here, I decided to run the 12MB kernel compile test on it, as well. (See what happens when I try to cut corners...) In 12MB: pre6+zlatko_patch 22:14 383206 204482 57823 pre6 20:54 352934 191210 48678 pre5 19:35 334680 183732 93427 arcavm13 19:45 344452 180243 38977 Pre5 is looking good. Based upon the tests that I have run, anyway. I agree with the person who expressed a distrust of benchmarks. But numbers are necessary for tuning. "Feels faster" is just not a very trustworthy thing. So I also agree with one of the responses: "Try out your favorite apps and time some portion of them and post any interesting numbers." (paraphrased) Benchmarks are not the problem. The problem is the lack of comprehensiveness, or the tunnel-vision if you prefer, that benchmarks can lead one into. Find a way to quantify the things that you do everyday and post the results. -Thanks -Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: Results: pre6 vs pre6+zlatko's_patch vs pre5 vs arcavm13 Date: 1999/01/11 Message-ID: <fa.odpnfov.h401o8@ifi.uio.no>#1/1 X-Deja-AN: 431067627 Original-Date: Sat, 9 Jan 1999 21:35:40 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990109213225.4665G-100000@penguin.transmeta.com> References: <fa.flid25v.55mbrj@ifi.uio.no> To: Steve Bergman <st...@netplus.net> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Sat, 9 Jan 1999, Steve Bergman wrote: > > I ran the "image test" (loading 116 jpg images simultaneously) on the latest > patches and got these results in 128MB (I end up with ~ 160MB in swap): > > pre6+zlatko's_patch 2:35 > pre6 2:27 > pre5 1:58 > arcavm13 9:13 Can you run pre6+zlatko with just the mm/page_alloc.c one-liner reverted to pre5? That is, take pre6+zlatko, and just change try_to_free_pages(gfp_mask, freepages.high - nr_free_pages); back to try_to_free_pages(gfp_mask, SWAP_CLUSTER_MAX); That particular one-liner was almost certainly a mistake, it was done on the mistaken assumption that the clustering problem was due to insufficient write-time clustering - while zlatko found that it was actually due to fragmentation in the swap area. With zlatkos patch, the original SWAP_CLUSTER_MAX is probably better and almost certainly results in smoother behaviour due to less extreme free_pages.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/11 Message-ID: <fa.ocppgov.h4a389@ifi.uio.no>#1/1 X-Deja-AN: 431087276 Original-Date: Sat, 9 Jan 1999 13:50:14 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990109134233.3478A-100000@penguin.transmeta.com> References: <fa.oca1fgv.hkq1gf@ifi.uio.no> To: Savochkin Andrey Vladimirovich <s...@msu.ru> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Sat, 9 Jan 1999, Linus Torvalds wrote: > > The cleanest solution I can think of is actually to allow semaphores to be > recursive. I can do that with minimal overhead (just one extra instruction > in the non-contention case), so it's not too bad, and I've wanted to do it > for certain other things, but it's still a nasty piece of code to mess > around with. > > Oh, well. I don't think I have much choice. Does anybody know semaphores by heart? I've got code that may well work, but the race conditions for semaphores are nasty. As mentioned, this only adds a single instruction to the common non-contended case, and I really do believe it should be correct, but it is completely untested (so it might not work at all), and it would be good to have somebody with some theory go through this.. Basically, these simple changes should make it ok to do recursive semaphore grabs, so down(&sem); down(&sem); up(&sem); up(&sem); should work and leave the semaphore unlocked. Anybody? Semaphore theory used to be really popular at Universities, so there must be somebody who has some automated proving program somewhere.. Linus ----- Code - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Steve Bergman <st...@netplus.net> Subject: Re: Results: pre6 vs pre6+zlatko's_patch vs pre5 vs arcavm13 Date: 1999/01/11 Message-ID: <fa.fg2j6lv.1ulomob@ifi.uio.no>#1/1 X-Deja-AN: 431148214 Original-Date: Sun, 10 Jan 1999 12:43:45 -0600 Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-ID: <3698F4E1.715105C6@netplus.net> References: <fa.odpnfov.h401o8@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> Original-References: <Pine.LNX.3.95.990109213225.4665G-100...@penguin.transmeta.com> X-Accept-Language: en Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Linus Torvalds wrote: > Can you run pre6+zlatko with just the mm/page_alloc.c one-liner reverted > to pre5? That is, take pre6+zlatko, and just change > > try_to_free_pages(gfp_mask, freepages.high - nr_free_pages); > > back to > > try_to_free_pages(gfp_mask, SWAP_CLUSTER_MAX); > OK, here are the updated results: 'Image test' in 128MB: pre6+zlatko's_patch 2:35 and with requested change 3:09 pre6 2:27 pre5 1:58 arcavm13 9:13 I also ran the kernel compile test: In 12MB: Elapsed Maj. Min. Swaps ----- ------ ------ ----- pre6+zlatko_patch 22:14 383206 204482 57823 and with requested change 22:23 378662 198194 51445 pre6 20:54 352934 191210 48678 pre5 19:35 334680 183732 93427 arcavm13 19:45 344452 180243 38977 The change seems to have hurt it in both cases. What I am seeing on pre6 and it's derivitives is a *lot* of *swapin* activity. Pre5 almost exclusively swaps *out* during the image test, averaging about 1.25MB/sec (spends a lot of time at around 2000k/sec) with very little swapping in. All the pre6 derivitives swap *in* quite heavily during the test. The 'so' number sometimes drops to 0 for seconds at a time. It also looks like pre6 swaps out slightly more overall (~165MB vs 160MB). -Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Stephen C. Tweedie" <s...@redhat.com> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/11 Message-ID: <fa.j3h0m5v.ikge1s@ifi.uio.no>#1/1 X-Deja-AN: 431164481 Original-Date: Sun, 10 Jan 1999 16:59:43 GMT Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-Id: <199901101659.QAA00922@dax.scot.redhat.com> References: <fa.ocppgov.h4a389@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> Original-References: <Pine.LNX.3.95.990109095521.2572A-100...@penguin.transmeta.com> <Pine.LNX.3.95.990109134233.3478A-100...@penguin.transmeta.com> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, On Sat, 9 Jan 1999 13:50:14 -0800 (PST), Linus Torvalds <torva...@transmeta.com> said: > On Sat, 9 Jan 1999, Linus Torvalds wrote: >> >> The cleanest solution I can think of is actually to allow semaphores to be >> recursive. I can do that with minimal overhead (just one extra instruction >> in the non-contention case), so it's not too bad, and I've wanted to do it >> for certain other things, but it's still a nasty piece of code to mess >> around with. Ack. I've been having a closer look, and making the superblock lock recursive doesn't work: the ext2fs allocation code is definitely not reentrant. In particular, the bitmap buffers can get evicted out from under our feet if we reenter the block allocation code, leading to nasty filesystem and/or memory corruption. The allocation code can also get confused if the bitmap contents change between checking the group descriptor for a block group and reading in the bitmap itself, leading to potential ENOSPC errors turning up wrongly. Preventing recursive VM access to the filesystem while we have the superblock lock seems the only easy way out short of making the allocation/truncate code fully reentrant. On the other hand, it does look as if the inode deadlock is dealt with OK if we just make that semaphore recursive; I can't see anywhere that dies if we make that change. This does somewhat imply that we may need to make a distinction between reentrant and non-reentrant semaphores if we go down this route. --Stephen. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/11 Message-ID: <fa.oc9dgov.hka2o1@ifi.uio.no>#1/1 X-Deja-AN: 431148217 Original-Date: Sun, 10 Jan 1999 10:35:10 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990110103201.7668D-100000@penguin.transmeta.com> References: <fa.j3h0m5v.ikge1s@ifi.uio.no> To: "Stephen C. Tweedie" <s...@redhat.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Sun, 10 Jan 1999, Stephen C. Tweedie wrote: > > Ack. I've been having a closer look, and making the superblock lock > recursive doesn't work That's fine - the superblock lock doesn't need to be re-entrant, because __GFP_IO is quite sufficient for that one. The thing I want to make re-entrant is just semaphore accesses: at the point where we would otherwise deadlock on the writer semaphore it's much better to just allow nested writes. I suspect all filesystems can already handle nested writes - they are a lot easier to handle than truly concurrent ones. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Savochkin Andrey Vladimirovich <s...@msu.ru> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/12 Message-ID: <fa.iham6bv.1v0esiv@ifi.uio.no>#1/1 X-Deja-AN: 431454453 Original-Date: Mon, 11 Jan 1999 17:11:38 +0300 Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <19990111171138.A9675@castle.nmd.msu.ru> References: <fa.oc9dgov.hka2o1@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> Original-References: <199901101659.QAA00...@dax.scot.redhat.com> <Pine.LNX.3.95.990110103201.7668D-100...@penguin.transmeta.com> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list Mime-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Sun, Jan 10, 1999 at 10:35:10AM -0800, Linus Torvalds wrote: > The thing I want to make re-entrant is just semaphore accesses: at the > point where we would otherwise deadlock on the writer semaphore it's much > better to just allow nested writes. I suspect all filesystems can already > handle nested writes - they are a lot easier to handle than truly > concurrent ones. You're an optimist, aren't you? :-) In any case I've checked your recursive semaphore code on a news server which reliably deadlocked with the previous kernels. The code seems to work well. Best wishes Andrey V. Savochkin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Stephen C. Tweedie" <s...@redhat.com> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/11 Message-ID: <fa.j516l6v.g4mdhm@ifi.uio.no>#1/1 X-Deja-AN: 431198177 Original-Date: Sun, 10 Jan 1999 22:49:47 GMT Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-Id: <199901102249.WAA01684@dax.scot.redhat.com> References: <fa.oc9dgov.hka2o1@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> Original-References: <199901101659.QAA00...@dax.scot.redhat.com> <Pine.LNX.3.95.990110103201.7668D-100...@penguin.transmeta.com> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, On Sun, 10 Jan 1999 10:35:10 -0800 (PST), Linus Torvalds <torva...@transmeta.com> said: > On Sun, 10 Jan 1999, Stephen C. Tweedie wrote: >> >> Ack. I've been having a closer look, and making the superblock lock >> recursive doesn't work > That's fine - the superblock lock doesn't need to be re-entrant, because > __GFP_IO is quite sufficient for that one. I'm no longer convinced about that. I think it's much much worse. A bread() on an ext2 bitmap buffer with the superblock held is only safe if the IO can complete without _ever_ relying on a GFP_IO allocation. That means that any interrupt allocations required in that space have to be satisfiable by kswapd without GFP_IO, or kswapd could deadlock on us. It means that if our superblock-locked IO has to stall waiting for an nbd server process or a raid daemon, then those daemons cannot safely do GFP_IO. It's really gross. I think it's actually ugly enough that we cannot make it safe: we can really only be sure if we prevent all GFP_IO from any process which might be involved in our deadlock loop, or if we avoid doing any IO with the superblock lock held. It really looks as if the right way around this is to prevent GFP_IO from deadlocking in the first place, by moving the asynchronous page writes out of kswapd/try_to_free_page and into a separate worker thread. That way we can continue to try to reclaim memory somewhere else without deadlocking. In that case the only thing we are left having to worry about is doing a synchronous swapout, where we end up blocking waiting for the IO thread to complete. In fact, to make it really safe we'd need to avoid synchronous swapout altogether: otherwise we can have A kswiod nbd server process lock_super(); bread(ndb device); try_to_free_page(); rw_swap_page_async(); filemap_write_page(); lock_super(); wait_on_buffer(); try_to_free_page(); rw_swap_page_sync(); Oops, kswiod is stalled. Can we get away without synchronous swapout? Notice that in this case, kswiod may be blocked but kswapd itself will not be. As long as the nbd server does not try to do a synchronous swap, it won't deadlock on kswiod. In other words, it is safe to wait for avaibility of another free page, but it is not safe to wait for completion of any single, specific swap IO. If kswapd itself no longer performs the IO, then we can always free more memory, until we get to the complete death stage where there are absolutely no clean pages left in the system. If we do this, then both the inode and the superblock deadlocks disappear. --Stephen. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: ebiederm+e...@ccr.net (Eric W. Biederman) Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/11 Message-ID: <fa.f9uhn3v.1s3akh8@ifi.uio.no>#1/1 X-Deja-AN: 431269781 Original-Date: 11 Jan 1999 00:04:11 -0600 Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <m1aezq4a78.fsf@flinx.ccr.net> References: <fa.j516l6v.g4mdhm@ifi.uio.no> To: "Stephen C. Tweedie" <s...@redhat.com> Original-References: <199901101659.QAA00...@dax.scot.redhat.com> <Pine.LNX.3.95.990110103201.7668D-100...@penguin.transmeta.com> <199901102249.WAA01...@dax.scot.redhat.com> X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu >>>>> "ST" == Stephen C Tweedie <s...@redhat.com> writes: ST> Hi, ST> On Sun, 10 Jan 1999 10:35:10 -0800 (PST), Linus Torvalds ST> <torva...@transmeta.com> said: >> On Sun, 10 Jan 1999, Stephen C. Tweedie wrote: >>> >>> Ack. I've been having a closer look, and making the superblock lock >>> recursive doesn't work >> That's fine - the superblock lock doesn't need to be re-entrant, because >> __GFP_IO is quite sufficient for that one. ST> I'm no longer convinced about that. I think it's much much worse. A ST> bread() on an ext2 bitmap buffer with the superblock held is only safe ST> if the IO can complete without _ever_ relying on a GFP_IO allocation. ST> That means that any interrupt allocations required in that space have to ST> be satisfiable by kswapd without GFP_IO, or kswapd could deadlock ST> on us. Well interrupts use GFP_ATOMIC . . . ST> It means that if our superblock-locked IO has to stall waiting for an ST> nbd server process or a raid daemon, then those daemons cannot safely do ST> GFP_IO. It's really gross. Right. And the flag not to do I/O doesn't propogate across processes. This sounds like a variation of the priority inheritance problem. I wonder if this is why there are some known deadlocks with raid? ST> I think it's actually ugly enough that we cannot make it safe: we can ST> really only be sure if we prevent all GFP_IO from any process which ST> might be involved in our deadlock loop, or if we avoid doing any IO with ST> the superblock lock held. ST> In fact, to make it really safe we'd need to avoid synchronous swapout ST> altogether: otherwise we can have ST> Can we get away without synchronous swapout? Notice that in this case, ST> kswiod may be blocked but kswapd itself will not be. As long as the nbd ST> server does not try to do a synchronous swap, it won't deadlock on ST> kswiod. In other words, it is safe to wait for avaibility of another ST> free page, but it is not safe to wait for completion of any single, ST> specific swap IO. If kswapd itself no longer performs the IO, then we ST> can always free more memory, until we get to the complete death stage ST> where there are absolutely no clean pages left in the system. ST> If we do this, then both the inode and the superblock deadlocks ST> disappear. Sounds good. I have a daemon just about ready to go, hopefully I can post it tommorrow for preliminary testing. It looks like my work for 2.3 in a small part can help deadlocks after all. It walks the page tables and just writes out dirty pages, and marks them clean but it doesn't remove them from processes. So it can get an early jump on writing things out. Then if we are hitting a low memory situation (because pages become dirty quickly), we can just wake it up, more often. Currently we are doing totally asynchonous swapping but from the context of the process that needs memory, (so the locks are in different processes). Adding a second daemon will play havoc on our balancing but it shouldn't affect anything else. Grr. I forgot about sysv shm. It is the only thing doing synchronous swapping right now. Oh, and just as a side note we are currently unfairly penalizing threaded programs by doing for_each_task instead of for_each_mm in the swapout code... Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/11 Message-ID: <fa.odadgnv.mka0o6@ifi.uio.no>#1/1 X-Deja-AN: 431421183 Original-Date: Mon, 11 Jan 1999 09:55:59 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990111095116.4886B-100000@penguin.transmeta.com> References: <fa.iham6bv.1v0esiv@ifi.uio.no> To: Savochkin Andrey Vladimirovich <s...@msu.ru> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Mon, 11 Jan 1999, Savochkin Andrey Vladimirovich wrote: > On Sun, Jan 10, 1999 at 10:35:10AM -0800, Linus Torvalds wrote: > > The thing I want to make re-entrant is just semaphore accesses: at the > > point where we would otherwise deadlock on the writer semaphore it's much > > better to just allow nested writes. I suspect all filesystems can already > > handle nested writes - they are a lot easier to handle than truly > > concurrent ones. > > You're an optimist, aren't you? :-) No, drugged to my eye-brows. > In any case I've checked your recursive semaphore code on a news server > which reliably deadlocked with the previous kernels. > The code seems to work well. I found a rather nasty race in my implementation - it's basically impossible to triggerin real life, but quite frankly I don't want to have semaphores that have a really subtle bug in them. However much I tried, I couldn't make the race go away without using a spinlock in the critical path of the semaphore, something which I very much want to avoid. Unless I find a good recursive semaphore implementation (and I'm starting to despair about finding one that is lock-free for the non-contention case), I'll have to come up with something else (like letting only kswapd swap out pages as has been discussed here). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Stephen C. Tweedie" <s...@redhat.com> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/12 Message-ID: <fa.j2h8klv.jkof1j@ifi.uio.no>#1/1 X-Deja-AN: 431732138 Original-Date: Tue, 12 Jan 1999 16:06:38 GMT Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-Id: <199901121606.QAA04800@dax.scot.redhat.com> References: <fa.f9uhn3v.1s3akh8@ifi.uio.no> To: ebiederm+e...@ccr.net (Eric W. Biederman) Original-References: <199901101659.QAA00...@dax.scot.redhat.com> <Pine.LNX.3.95.990110103201.7668D-100...@penguin.transmeta.com> <199901102249.WAA01...@dax.scot.redhat.com> <m1aezq4a78....@flinx.ccr.net> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, On 11 Jan 1999 00:04:11 -0600, ebiederm+e...@ccr.net (Eric W. Biederman) said: > Oh, and just as a side note we are currently unfairly penalizing > threaded programs by doing for_each_task instead of for_each_mm in the > swapout code... I know, on my TODO list... --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/12 Message-ID: <fa.ni5h7uv.1n7elgn@ifi.uio.no>#1/1 X-Deja-AN: 431760196 Original-Date: Tue, 12 Jan 1999 09:54:50 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990112095401.17705A-100000@penguin.transmeta.com> References: <fa.j2h8klv.jkof1j@ifi.uio.no> To: "Stephen C. Tweedie" <s...@redhat.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Tue, 12 Jan 1999, Stephen C. Tweedie wrote: > > On 11 Jan 1999 00:04:11 -0600, ebiederm+e...@ccr.net (Eric W. Biederman) > said: > > > Oh, and just as a side note we are currently unfairly penalizing > > threaded programs by doing for_each_task instead of for_each_mm in the > > swapout code... > > I know, on my TODO list... Actually, this one is _really_ easy to fix. The truly trivial fix is to just move "swap_cnt" into the mm structure, and you're all done. You'd still walk the list with for_each_task(), but it no longer matters. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Zlatko Calusic <Zlatko.Calu...@CARNet.hr> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/12 Message-ID: <fa.g9nljkv.tiatab@ifi.uio.no>#1/1 X-Deja-AN: 431785322 Original-Date: 12 Jan 1999 19:44:45 +0100 Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <87d84kl49u.fsf@atlas.CARNet.hr> References: <fa.ni5h7uv.1n7elgn@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> Original-References: <Pine.LNX.3.95.990112095401.17705A-100...@penguin.transmeta.com> X-Orcpt: rfc822;linux-kernel-outgoing-dig X-Face: -{{$jeB1W-K.U*M}?5mPbqpi4lh3mpjD9T,~LDH/7U]*Xf9["_k>Ijnnce{CZ-ZK_%]g=vL cAZD>] jb0OwfLx4*;XgFN0=P7\,5a(k;szUfM0\sKEv?*MLehyoE@!M1mY:`P1w)s7WHkOg8&8oE"; 0_&*NFyrQMzNv^NW2}:Ifyx`#Rc%]7kazg49XSW>[Pe)s-0^O!Lttfv9-EYr,M2fp)VEE8p]GOiMzA 6Zad, 9ZXunk1k9MO'Yamy(?el@B8Fj1 Organization: Internet mailing list MIME-Version: 1.0 User-Agent: Gnus/5.070069 (Pterodactyl Gnus v0.69) XEmacs/21.2(beta8) (Artemis) Reply-To: Zlatko.Calu...@CARNet.hr Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Linus Torvalds <torva...@transmeta.com> writes: > On Tue, 12 Jan 1999, Stephen C. Tweedie wrote: > > > > On 11 Jan 1999 00:04:11 -0600, ebiederm+e...@ccr.net (Eric W. Biederman) > > said: > > > > > Oh, and just as a side note we are currently unfairly penalizing > > > threaded programs by doing for_each_task instead of for_each_mm in the > > > swapout code... > > > > I know, on my TODO list... > > Actually, this one is _really_ easy to fix. > > The truly trivial fix is to just move "swap_cnt" into the mm structure, > and you're all done. You'd still walk the list with for_each_task(), but > it no longer matters. > > Linus > Not related to this, but I (hopefully correctly) observed that SHM swap I/O is done synchronously. Could somebody spare a minute to explain why is that so, and what needs to be done to make SHM swapping asynchronous? Also, while we're at MM fixes, I'm appending below a small patch that will improve interactive feel. After number of async pages gets bigger than pager_daemon.swap_cluster (= SWAP_CLUSTER_MAX), swapin readahead becomes synchronous, and that hurts performance. It is better to skip readahead in such situations, and that is also more fair to swapout. Andrea came to exactly the same conclusion, independent of me (on the same day :)). diff -urN linux-pre-7/mm/page_alloc.c linux/mm/page_alloc.c --- linux-pre-7/mm/page_alloc.c Tue Jan 11 07:28:06 1999 +++ linux/mm/page_alloc.c Tue Jan 11 07:29:44 1999 @@ -358,6 +358,8 @@ for (i = 1 << page_cluster; i > 0; i--) { if (offset >= swapdev->max) return; + if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster) + return; if (!swapdev->swap_map[offset] || swapdev->swap_map[offset] == SWAP_MAP_BAD || test_bit(offset, swapdev->swap_lockmap)) Regards, -- Zlatko - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel <r...@humbolt.geo.uu.nl> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/13 Message-ID: <fa.lgh1brv.kahgm@ifi.uio.no>#1/1 X-Deja-AN: 431953647 Original-Date: Tue, 12 Jan 1999 22:46:08 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.03.9901122245090.4656-100000@mirkwood.dummy.home> References: <fa.g9nljkv.tiatab@ifi.uio.no> To: Zlatko Calusic <Zlatko.Calu...@CARNet.hr> X-Sender: r...@mirkwood.dummy.home X-Authentication-Warning: mirkwood.dummy.home: riel owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On 12 Jan 1999, Zlatko Calusic wrote: > After number of async pages gets bigger than > pager_daemon.swap_cluster (= SWAP_CLUSTER_MAX), swapin readahead > becomes synchronous, and that hurts performance. It is better to > skip readahead in such situations, and that is also more fair to > swapout. Andrea came to exactly the same conclusion, independent > of me (on the same day :)). IIRC this facility was in the original swapin readahead implementation. That only leaves the question who removed it and why :)) cheers, Rik -- If a Microsoft product fails, who do you sue? +-------------------------------------------------------------------+ | Linux memory management tour guide. r...@humbolt.geo.uu.nl | | Scouting Vries cubscout leader. http://humbolt.geo.uu.nl/~riel | +-------------------------------------------------------------------+ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/13 Message-ID: <fa.iq05e6v.1agkoh5@ifi.uio.no>#1/1 X-Deja-AN: 432177938 Original-Date: Wed, 13 Jan 1999 14:45:09 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990113144203.284C-100000@laser.bogus> References: <fa.lgh1brv.kahgm@ifi.uio.no> To: Rik van Riel <r...@humbolt.geo.uu.nl> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Tue, 12 Jan 1999, Rik van Riel wrote: > IIRC this facility was in the original swapin readahead > implementation. That only leaves the question who removed > it and why :)) There's another thing I completly disagree and that I just removed here. It's the alignment of the offset field. I see no one point in going back instead of only doing real read_ahead_. Maybe I am missing something? Index: page_alloc.c =================================================================== RCS file: /var/cvs/linux/mm/page_alloc.c,v retrieving revision 1.1.1.8 retrieving revision 1.1.1.1.2.29 diff -u -r1.1.1.8 -r1.1.1.1.2.29 --- page_alloc.c 1999/01/11 21:24:23 1.1.1.8 +++ linux/mm/page_alloc.c 1999/01/12 23:00:04 1.1.1.1.2.29 @@ -353,10 +352,10 @@ unsigned long offset = SWP_OFFSET(entry); struct swap_info_struct *swapdev = SWP_TYPE(entry) + swap_info; - offset = (offset >> page_cluster) << page_cluster; - for (i = 1 << page_cluster; i > 0; i--) { - if (offset >= swapdev->max) + if (offset >= swapdev->max || + /* don't block on I/O for doing readahead -arca */ + atomic_read(&nr_async_pages) > pager_daemon.max_async_pages) return; if (!swapdev->swap_map[offset] || swapdev->swap_map[offset] == SWAP_MAP_BAD || Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Stephen C. Tweedie" <s...@redhat.com> Subject: [PATCH] Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/13 Message-ID: <fa.j5hgktv.gkod9m@ifi.uio.no>#1/1 X-Deja-AN: 432211188 Original-Date: Wed, 13 Jan 1999 17:55:56 GMT Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-Id: <199901131755.RAA06476@dax.scot.redhat.com> References: <fa.iq05e6v.1agkoh5@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com>, Linus Torvalds <torva...@transmeta.com> Original-References: <Pine.LNX.4.03.9901122245090.4656-100...@mirkwood.dummy.home> <Pine.LNX.3.96.990113144203.284C-100...@laser.bogus> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, On Wed, 13 Jan 1999 14:45:09 +0100 (CET), Andrea Arcangeli <and...@e-mind.com> said: > On Tue, 12 Jan 1999, Rik van Riel wrote: >> IIRC this facility was in the original swapin readahead >> implementation. That only leaves the question who removed >> it and why :)) > There's another thing I completly disagree and that I just removed here. > It's the alignment of the offset field. I see no one point in going back > instead of only doing real read_ahead_. > Maybe I am missing something? Yes, very much so. When paging in binaries, you often have locality of reference in both directions --- a set of functions compiled from a single source file will occupy adjacent pages in VM, but you are as likely to call a function at the end of the region first as one at the beginning. It is very common to get backwards locality as a result. The big advantage of doing aligned clusters for readin is twofold: first, it means that you get as much of a readahead advantage for these backwards access patterns as for forward accesses. Secondly, it means that you are reading in complete tiles which are guaranteed to have no gaps between them, so any two accesses in adjacent tiles are sufficient to read in the complete set of nearby pages without missing any gaps between them: it avoids having to do yet another IO to fill in the few pages missed by a strictly forward-looking readahead function. > + /* don't block on I/O for doing readahead -arca */ > + atomic_read(&nr_async_pages) > pager_daemon.max_async_pages) > return; I think this is the wrong solution: far better to do the patch below, which simply exempts reads from nr_async_pages altogether. I originally added nr_async_pages to serve two functions: to allow kswapd to determine how much memory it was already in the process of freeing, and to act as a throttle on the number of write IOs submitted when swapping. We don't need a similar throttling action for reads, because every place where we do VM readahead, each readahead IO cluster is followed by a synchronous read on one page. We don't throttle the async readaheads on normal file IO, for example. --Stephen ---------------------------------------------------------------- --- mm/page_io.c~ Mon Dec 28 21:56:29 1998 +++ mm/page_io.c Tue Jan 12 16:45:55 1999 @@ -58,7 +58,8 @@ } /* Don't allow too many pending pages in flight.. */ - if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster) + if (rw == WRITE && + atomic_read(&nr_async_pages) > pager_daemon.swap_cluster) wait = 1; p = &swap_info[type]; @@ -170,7 +171,7 @@ atomic_dec(&page->count); return; } - if (!wait) { + if (rw == WRITE && !wait) { set_bit(PG_decr_after, &page->flags); atomic_inc(&nr_async_pages); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [PATCH] Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/13 Message-ID: <fa.ir0dduv.1bg8o90@ifi.uio.no>#1/1 X-Deja-AN: 432240066 Original-Date: Wed, 13 Jan 1999 19:52:03 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990113191421.185E-100000@laser.bogus> References: <fa.j5hgktv.gkod9m@ifi.uio.no> To: "Stephen C. Tweedie" <s...@redhat.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Wed, 13 Jan 1999, Stephen C. Tweedie wrote: > I think this is the wrong solution: far better to do the patch below, > which simply exempts reads from nr_async_pages altogether. I > originally added nr_async_pages to serve two functions: to allow > kswapd to determine how much memory it was already in the process of > freeing, and to act as a throttle on the number of write IOs submitted > when swapping. > > We don't need a similar throttling action for reads, because every > place where we do VM readahead, each readahead IO cluster is followed > by a synchronous read on one page. We don't throttle the async > readaheads on normal file IO, for example. Note that we don't need nr_async_pages at all. Here when the limit of nr_async_pages is low it's only a bottleneck for swapout performances. I have not removed it (because it could be useful to decrease swapout I/O if somebody needs this strange feature), but I have added a page_daemon.max_async_pages and set it to something like 256. Now I check nr_async_pages against the new max_async_pages. I _guess_ (not checked) that the _only_ reason Steve seen arca-vm-16 so high improved changing SWAP_CLUSTER_MAX to 512 instead of 32 is the removal of the nr_async_pages bottleneck. Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Stephen C. Tweedie" <s...@redhat.com> Subject: Re: [PATCH] Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/13 Message-ID: <fa.j3hijlv.ikqdht@ifi.uio.no>#1/1 X-Deja-AN: 432292951 Original-Date: Wed, 13 Jan 1999 22:10:12 GMT Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-Id: <199901132210.WAA07391@dax.scot.redhat.com> References: <fa.ir0dduv.1bg8o90@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> Original-References: <199901131755.RAA06...@dax.scot.redhat.com> <Pine.LNX.3.96.990113191421.185E-100...@laser.bogus> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, On Wed, 13 Jan 1999 19:52:03 +0100 (CET), Andrea Arcangeli <and...@e-mind.com> said: > Note that we don't need nr_async_pages at all. Here when the limit of > nr_async_pages is low it's only a bottleneck for swapout performances. I > have not removed it (because it could be useful to decrease swapout I/O if > somebody needs this strange feature), but I have added a > page_daemon.max_async_pages and set it to something like 256. Now I check > nr_async_pages against the new max_async_pages. The problem is that if you do this, it is easy for the swapper to generate huge amounts of async IO without actually freeing any real memory: there's a question of balancing the amount of free memory we have available right now with the amount which we are in the process of freeing. Setting the nr_async_pages bound to 256 just makes the swapper keen to send a whole 1MB of memory out to disk at a time, which is a bit steep on an 8MB box. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [PATCH] Re: MM deadlock [was: Re: arca-vm-8...] Date: 1999/01/13 Message-ID: <fa.ob97fgv.jkc1g5@ifi.uio.no>#1/1 X-Deja-AN: 432306316 Original-Date: Wed, 13 Jan 1999 14:30:32 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990113142730.6104G-100000@penguin.transmeta.com> References: <fa.j3hijlv.ikqdht@ifi.uio.no> To: "Stephen C. Tweedie" <s...@redhat.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Wed, 13 Jan 1999, Stephen C. Tweedie wrote: > > The problem is that if you do this, it is easy for the swapper to > generate huge amounts of async IO without actually freeing any real > memory: there's a question of balancing the amount of free memory we > have available right now with the amount which we are in the process of > freeing. Setting the nr_async_pages bound to 256 just makes the swapper > keen to send a whole 1MB of memory out to disk at a time, which is a bit > steep on an 8MB box. Note that this should be much less of a problem with the current swapout strategies, but yes, basically we definitely do want to have _some_ way of maintaining a sane "maximum number of pages in flight" thing. The right solution may be to do the check in some other place, rather than fairly deep inside the swap logic. It's not a big deal, I suspect. Anyway, there's a real pre7 out there now, and it doesn't change a lot of th issues discussed here. I wanted to get something stable and working. I still need to get the recursive semaphore thing (or other approach) done, but basically I think we're at 2.2.0 already apart from that issue, and that we can continue this discussion as a "occasional tweaks" thing. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/