Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news.tele.dk! small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist Newsgroups: fa.linux.kernel Return-Path: <linux-kernel-ow...@vger.kernel.org> Original-Date: Fri, 5 Oct 2001 11:21:34 -0700 (PDT) From: Ben Smith <b...@google.com> Reply-To: b...@google.com To: linux-ker...@vger.kernel.org cc: Gerald Aigner <ger...@google.com> Subject: kswapd problems with 2.4.{9,10} Original-Message-ID: <Pine.LNX.4.21.0110051119070.19656-100000@tide.corp.google.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-ow...@vger.kernel.org Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org Organization: Internet mailing list Date: Fri, 5 Oct 2001 18:22:55 GMT Message-ID: <fa.nbn0tov.202lq7@ifi.uio.no> Lines: 1089 I've run into what I think is a problem with linux 2.4.{9,10} and a memory intensive application we use here. The application works by readonly mmapping large chunks of disk into memory and then mlocking them into place. As the application runs, it replaces old chunks mapped in with new chunks that it wants to map, after unlocking the old chunks. The problem occurs after the program runs for a while. kswapd starts going crazy, consuming all of one CPU's cycles. If I quit my application and restart it, kswapd goes crazy immediately. The machine behaves this way until I reboot (it never recovers). This application works without problems on 2.2.19. For example, I have a machine with 2G of RAM, 2 1Ghz PIII's, no swap, and a stock linux 2.4.9 kernel. My application mmaps two 407M chunks of memory into RAM, and then mlocks them to force them to load from disk. After accessing the first chunk for a while, it unlocks the first chunk, un-maps it, and then loads a new file in its place, mmaping and mlocking the new file. After loading some number (around 5) of these chunks (only 2 at a time), kswapd starts consuming 100% of one CPU, and the mmapping and mlocking of new chunks slows down significantly. If I let the program continue running, the machine eventually hangs. I've tried 2.4.10 and it behaves much worse, locking the machine up sooner. I've created a simple test program that reproduces this behavior. WARNING, this program will slowly make your machine unusable. It needs to be run as root to be able to mlock memory. The test program needs blocks of data on disk to mmap -- the following shell command will create sample blocks: bash# for i in `seq 0 19`; do dd if=/dev/zero of=/export/hda3/tmp/chunk$i count=6682555 bs=64; done I've attached the test program at the bottom of this message, as well as the kernel config I've used to build my kernel. Please contact me if you have further questions. - Ben Ben Smith Google Inc Test program - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news.tele.dk! small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist Newsgroups: fa.linux.kernel Return-Path: <linux-kernel-ow...@vger.kernel.org> Original-Date: Tue, 9 Oct 2001 02:59:15 +0200 From: Andrea Arcangeli <and...@suse.de> To: Ben Smith <b...@google.com> Cc: linux-ker...@vger.kernel.org, Gerald Aigner <ger...@google.com> Subject: Re: kswapd problems with 2.4.{9,10} Original-Message-ID: <20011009025915.H726@athlon.random> Original-References: <Pine.LNX.4.21.0110051119070.19656-100...@tide.corp.google.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <Pine.LNX.4.21.0110051119070.19656-100000@tide.corp.google.com>; from ben@google.com on Fri, Oct 05, 2001 at 11:21:34AM -0700 X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc Sender: linux-kernel-ow...@vger.kernel.org Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org Organization: Internet mailing list Date: Tue, 9 Oct 2001 01:01:13 GMT Message-ID: <fa.g8c6ogv.1igcg29@ifi.uio.no> References: <fa.nbn0tov.202lq7@ifi.uio.no> Lines: 47 On Fri, Oct 05, 2001 at 11:21:34AM -0700, Ben Smith wrote: > I've attached the test program at the bottom of this message, as well > as the kernel config I've used to build my kernel. Please contact me > if you have further questions. I entered this problem since a few minutes. thanks for the testcase, it really helps understanding your problem. First note is that using mlock for such purpose sounds wrong. Linux provides madvise(start,len,MADV_WILLNEED) that will pagein the stuff efficiently. You can just pagin the whole thing at once without the hack you're doing of mlocking in chunks to avoid running into the ram/2 locked ram limit and to avoid turning down the machine with all locked ram etc... if you know you've enough cache. However current madvise implementation won't map the pagetables in (so you'll still generate minor faults with cache hits) and the cache could be collected away if you run other apps. But it sounds lots saner than mlock. And in theory we could also change madvise to map the pagetables in, it wouldn't be painful but before doing so I guess we prefer some real world number that shows a noticeable improvement. Anyways it would be interesting to know if the problem goes away with madvise. While rewriting the vma lookup engine I cared about mmap/mprotect/mremap but I didn't care about mlock, so at the moment it is also not doing the vma merging (so you're generating many unmerged vmas with your mlock/munmlock around the vma areas, madvise would be more efficient in this sense too because mlock even if it would merge vmas it would still need to create new vmsa during your pagein loop), and I suspect not many people are stressing the code that generates the new vma with mlock so I don't exclude you're triggering a core bug in the mlock.c file rather than a VM problem (didn't checked the code yet, once I'll check the code I'll implement the vma merging there too). It's too early to be sure though. Also make sure to run it on top of 2.4.10 based kernel where kswapd should know when it has to do useful work or not (you said you tested 2.4.10 but your .config was for 2.4.9). You may also want to try again on top of my next/future -aa that should provide more reliable allocations for highmem systems, I'm still in the testing stage on my 128mbyte ram box with an emulated very unblanaced highmem setup (but kswapd logic is unchanged at the moment since I don't see anything wrong there). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!news.tele.dk! small.news.tele.dk!4.1.16.34!cpk-news-hub1.bbnplanet.com!news.gtei.net! newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua! solar.carrier.kiev.ua!not-for-mail From: Daniel Phillips <phill...@bonn-fries.net> Newsgroups: lucky.linux.kernel Subject: Google's mm problem - not reproduced on 2.4.13 Date: Wed, 31 Oct 2001 18:09:18 +0000 (UTC) Organization: unknown Lines: 15 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <E15yzlQ-00021P-00@starship.berlin> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Trace: solar.carrier.kiev.ua 1004551759 213 193.193.193.124 (31 Oct 2001 18:09:19 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Wed, 31 Oct 2001 18:09:19 +0000 (UTC) X-Mailer: KMail [version 1.3.2] X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Rik van Riel Good morning Ben, I just tried your test program with 2.4.13, 2 Gig, and it ran without problems. Could you try that over there and see if you get the same result? If it does run, the next move would be to check with 3.5 Gig. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!cpk-news-hub1.bbnplanet.com!news.gtei.net! newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua! solar.carrier.kiev.ua!not-for-mail From: Daniel Phillips <phill...@bonn-fries.net> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Wed, 31 Oct 2001 20:41:47 +0000 (UTC) Organization: unknown Lines: 16 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <E15z28m-0000vb-00@starship.berlin> References: <E15yzlQ-00021P-00@starship.berlin> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Trace: solar.carrier.kiev.ua 1004560908 7094 193.193.193.124 (31 Oct 2001 20:41:48 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Wed, 31 Oct 2001 20:41:48 +0000 (UTC) X-Mailer: KMail [version 1.3.2] In-Reply-To: <E15yzlQ-00021P-00@starship.berlin> X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Rik van Riel On October 31, 2001 07:06 pm, Daniel Phillips wrote: > I just tried your test program with 2.4.13, 2 Gig, and it ran without > problems. Could you try that over there and see if you get the same result? > If it does run, the next move would be to check with 3.5 Gig. Ben reports that his test with 2 Gig memory runs fine, as it does for me, but that it locks up tight with 3.5 Gig, requiring power cycle. Since I only have 2 Gig here I can't reproduce that (yet). -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!cpk-news-hub1.bbnplanet.com!news.gtei.net! newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua! solar.carrier.kiev.ua!not-for-mail From: Andrea Arcangeli <and...@suse.de> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Wed, 31 Oct 2001 20:47:53 +0000 (UTC) Organization: unknown Lines: 20 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <20011031214540.D1291@athlon.random> References: <E15yzlQ-00021P-00@starship.berlin> <E15z28m-0000vb-00@starship.berlin> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: solar.carrier.kiev.ua 1004561274 7404 193.193.193.124 (31 Oct 2001 20:47:54 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Wed, 31 Oct 2001 20:47:54 +0000 (UTC) Content-Disposition: inline User-Agent: Mutt/1.3.12i In-Reply-To: <E15z28m-0000vb-00@starship.berlin>; from phillips@bonn-fries.net on Wed, Oct 31, 2001 at 09:39:12PM +0100 X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Daniel Phillips On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote: > On October 31, 2001 07:06 pm, Daniel Phillips wrote: > > I just tried your test program with 2.4.13, 2 Gig, and it ran without > > problems. Could you try that over there and see if you get the same result? > > If it does run, the next move would be to check with 3.5 Gig. > > Ben reports that his test with 2 Gig memory runs fine, as it does for me, but > that it locks up tight with 3.5 Gig, requiring power cycle. Since I only > have 2 Gig here I can't reproduce that (yet). are you sure it isn't an oom condition. can you reproduce on 2.4.14pre5aa1? mainline (at least before pre6) could deadlock with too much mlocked memory. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!feed2.news.rcn.net!rcn!dca6-feed2.news.digex.net! intermedia!newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua! carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail From: Daniel Phillips <phill...@bonn-fries.net> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Wed, 31 Oct 2001 21:06:20 +0000 (UTC) Organization: unknown Lines: 24 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <E15z2WJ-0000wc-00@starship.berlin> References: <E15yzlQ-00021P-00@starship.berlin> <E15z28m-0000vb-00@starship.berlin> <20011031214540.D1291@athlon.random> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Trace: solar.carrier.kiev.ua 1004562381 8159 193.193.193.124 (31 Oct 2001 21:06:21 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Wed, 31 Oct 2001 21:06:21 +0000 (UTC) X-Mailer: KMail [version 1.3.2] In-Reply-To: <20011031214540.D1291@athlon.random> X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Andrea Arcangeli On October 31, 2001 09:45 pm, Andrea Arcangeli wrote: > On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote: > > On October 31, 2001 07:06 pm, Daniel Phillips wrote: > > > I just tried your test program with 2.4.13, 2 Gig, and it ran without > > > problems. Could you try that over there and see if you get the same result? > > > If it does run, the next move would be to check with 3.5 Gig. > > > > Ben reports that his test with 2 Gig memory runs fine, as it does for me, but > > that it locks up tight with 3.5 Gig, requiring power cycle. Since I only > > have 2 Gig here I can't reproduce that (yet). > > are you sure it isn't an oom condition. can you reproduce on > 2.4.14pre5aa1? mainline (at least before pre6) could deadlock with too > much mlocked memory. I don't know, I can't reproduce it here, I don't have enough memory. Ben? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!cpk-news-hub1.bbnplanet.com!news.gtei.net! newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua! solar.carrier.kiev.ua!not-for-mail From: Ben Smith <b...@google.com> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Wed, 31 Oct 2001 22:17:02 +0000 (UTC) Organization: unknown Lines: 36 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <3BE07730.60905@google.com> References: <E15yzlQ-00021P-00@starship.berlin> <E15z28m-0000vb-00@starship.berlin> <20011031214540.D1291@athlon.random> <E15z2WJ-0000wc-00@starship.berlin> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Trace: solar.carrier.kiev.ua 1004566623 11279 193.193.193.124 (31 Oct 2001 22:17:03 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Wed, 31 Oct 2001 22:17:03 +0000 (UTC) User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.5) Gecko/20011012 X-Accept-Language: en-us X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Daniel Phillips > On October 31, 2001 09:45 pm, Andrea Arcangeli wrote: > >>On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote: >> >>>On October 31, 2001 07:06 pm, Daniel Phillips wrote: >>> >>>>I just tried your test program with 2.4.13, 2 Gig, and it ran >>>>without problems. Could you try that over there and see if you >>>>get the same result? If it does run, the next move would be to >>>>check with 3.5 Gig. >>>> >>>Ben reports that his test with 2 Gig memory runs fine, as it does >>>for me, but that it locks up tight with 3.5 Gig, requiring power >>>cycle. Since I only have 2 Gig here I can't reproduce that (yet). >>> >>are you sure it isn't an oom condition. can you reproduce on >>2.4.14pre5aa1? mainline (at least before pre6) could deadlock with >>too much mlocked memory. >> > > I don't know, I can't reproduce it here, I don't have enough memory. > Ben? My test application gets killed (I believe by the oom handler). dmesg complains about a lot of 0-order allocation failures. For this test, I'm running with 2.4.14pre5aa1, 3.5gb of RAM, 2 PIII 1Ghz. - Ben Ben Smith Google, Inc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!cpk-news-hub1.bbnplanet.com!news.gtei.net! newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua! solar.carrier.kiev.ua!not-for-mail From: Daniel Phillips <phill...@bonn-fries.net> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Thu, 1 Nov 2001 01:13:12 +0000 (UTC) Organization: unknown Lines: 42 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <E15z5Zm-000067-00@starship.berlin> References: <E15yzlQ-00021P-00@starship.berlin> <E15z28m-0000vb-00@starship.berlin> <20011031214540.D1291@athlon.random> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Trace: solar.carrier.kiev.ua 1004577193 21294 193.193.193.124 (1 Nov 2001 01:13:13 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Thu, 1 Nov 2001 01:13:13 +0000 (UTC) X-Mailer: KMail [version 1.3.2] In-Reply-To: <20011031214540.D1291@athlon.random> X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Andrea Arcangeli On October 31, 2001 09:45 pm, Andrea Arcangeli wrote: > On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote: > > On October 31, 2001 07:06 pm, Daniel Phillips wrote: > > > I just tried your test program with 2.4.13, 2 Gig, and it ran without > > > problems. Could you try that over there and see if you get the same result? > > > If it does run, the next move would be to check with 3.5 Gig. > > > > Ben reports that his test with 2 Gig memory runs fine, as it does for me, but > > that it locks up tight with 3.5 Gig, requiring power cycle. Since I only > > have 2 Gig here I can't reproduce that (yet). > > are you sure it isn't an oom condition. The way the test code works is, it keeps mlocking more blocks of memory until one of the mlocks fails, and then it does the rest of its work with that many blocks of memory. It's hard to see how we could get a legitimate oom with that strategy. > can you reproduce on > 2.4.14pre5aa1? mainline (at least before pre6) could deadlock with too > much mlocked memory. OK, he tried it with pre5aa1: ben> My test application gets killed (I believe by the oom handler). dmesg ben> complains about a lot of 0-order allocation failures. For this test, ben> I'm running with 2.4.14pre5aa1, 3.5gb of RAM, 2 PIII 1Ghz. *Just in case* it's oom-related I've asked Ben to try it with one less than the maximum number of memory blocks he can allocate. If it does turn out to be oom, it's still a bug, right? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua! bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail From: Ben Smith <b...@google.com> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Thu, 1 Nov 2001 01:21:10 +0000 (UTC) Organization: unknown Lines: 18 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <3BE0A2C1.70600@google.com> References: <E15yzlQ-00021P-00@starship.berlin> <E15z28m-0000vb-00@starship.berlin> <20011031214540.D1291@athlon.random> <E15z5Zm-000067-00@starship.berlin> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Trace: solar.carrier.kiev.ua 1004577671 21480 193.193.193.124 (1 Nov 2001 01:21:11 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Thu, 1 Nov 2001 01:21:11 +0000 (UTC) User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.5) Gecko/20011012 X-Accept-Language: en-us X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Daniel Phillips > *Just in case* it's oom-related I've asked Ben to try it with one less than > the maximum number of memory blocks he can allocate. I've run this test with my 3.5G machine, 3 blocks instead of 4 blocks, and it has the same behavior (my app gets killed, 0-order allocation failures, and the system stays up. - Ben Ben Smith Google, Inc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua! bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail From: Rik van Riel <r...@conectiva.com.br> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Thu, 1 Nov 2001 01:48:19 +0000 (UTC) Organization: unknown Lines: 26 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <Pine.LNX.4.33L.0110312341030.2963-100000@imladris.surriel.com> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Trace: solar.carrier.kiev.ua 1004579300 22748 193.193.193.124 (1 Nov 2001 01:48:20 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Thu, 1 Nov 2001 01:48:20 +0000 (UTC) X-X-Sender: <r...@imladris.surriel.com> In-Reply-To: <3BE0A2C1.70600@google.com> X-spambait: aardv...@kernelnewbies.org X-spammeplease: aardv...@nl.linux.org X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Ben Smith On Wed, 31 Oct 2001, Ben Smith wrote: > > *Just in case* it's oom-related I've asked Ben to try it with one less than > > the maximum number of memory blocks he can allocate. > > I've run this test with my 3.5G machine, 3 blocks instead of 4 blocks, > and it has the same behavior (my app gets killed, 0-order allocation > failures, and the system stays up. If you still have swap free at the point where the process gets killed, or if the memory is file-backed, then we are positive it's a kernel bug. regards, Rik -- DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ http://www.surriel.com/ http://distro.conectiva.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua! bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail From: Ben Smith <b...@google.com> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Thu, 1 Nov 2001 01:58:50 +0000 (UTC) Organization: unknown Lines: 28 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <3BE0AB8D.3040400@google.com> References: <Pine.LNX.4.33L.0110312341030.2963-100000@imladris.surriel.com> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Trace: solar.carrier.kiev.ua 1004579931 23109 193.193.193.124 (1 Nov 2001 01:58:51 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Thu, 1 Nov 2001 01:58:51 +0000 (UTC) User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.5) Gecko/20011012 X-Accept-Language: en-us X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Rik van Riel >>>*Just in case* it's oom-related I've asked Ben to try it with one less than >>>the maximum number of memory blocks he can allocate. >>> >>I've run this test with my 3.5G machine, 3 blocks instead of 4 blocks, >>and it has the same behavior (my app gets killed, 0-order allocation >>failures, and the system stays up. >> > > If you still have swap free at the point where the process > gets killed, or if the memory is file-backed, then we are > positive it's a kernel bug. This machine is configured without a swap file. The memory is file backed, though (read-only mmap, followed by a mlock). - Ben Ben Smith Google, Inc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua! bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail From: Sven Heinicke <s...@research.nj.nec.com> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Fri, 2 Nov 2001 17:55:18 +0000 (UTC) Organization: unknown Lines: 67 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <15330.56589.291830.542215@abasin.nj.nec.com> References: <E15yzlQ-00021P-00@starship.berlin> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: solar.carrier.kiev.ua 1004723719 6568 193.193.193.124 (2 Nov 2001 17:55:19 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Fri, 2 Nov 2001 17:55:19 +0000 (UTC) In-Reply-To: <3BE07730.60905@google.com> X-Mailer: VM 6.72 under 21.1 (patch 14) "Cuyahoga Valley" XEmacs Lucid X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Daniel Phillips Ben Smith writes: > > On October 31, 2001 09:45 pm, Andrea Arcangeli wrote: > > > >>On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote: > >> > >>>On October 31, 2001 07:06 pm, Daniel Phillips wrote: > >>> > >>>>I just tried your test program with 2.4.13, 2 Gig, and it ran > >>>>without problems. Could you try that over there and see if you > >>>>get the same result? If it does run, the next move would be to > >>>>check with 3.5 Gig. > >>>> > >>>Ben reports that his test with 2 Gig memory runs fine, as it does > >>>for me, but that it locks up tight with 3.5 Gig, requiring power > >>>cycle. Since I only have 2 Gig here I can't reproduce that (yet). > >>> > >>are you sure it isn't an oom condition. can you reproduce on > >>2.4.14pre5aa1? mainline (at least before pre6) could deadlock with > >>too much mlocked memory. > >> > > > > I don't know, I can't reproduce it here, I don't have enough memory. > > Ben? > > My test application gets killed (I believe by the oom handler). dmesg > complains about a lot of 0-order allocation failures. For this test, > I'm running with 2.4.14pre5aa1, 3.5gb of RAM, 2 PIII 1Ghz. > - Ben > > Ben Smith > Google, Inc > This is a System with 4G of memory and regular swap. With 2 Pentium III 1Ghz processors. On 2.4.14-pre6aa1 it happily runs until: munmap'ed 7317d000 Loading data at 7317d000 for slot 2 Load (/mnt/sdb/sven/chunk10) succeeded! mlocking slot 2, 7317d000 mlocking at 7317d000 of size 1048576 Connection to hera closed by remote host. Connection to hera closed. Where is kills my ssh and other programs. fills my /var/log/messages with: Nov 2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1d2/0) Nov 2 11:29:07 ps2 syslogd: select: Cannot allocate memory Nov 2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1d2/0) Nov 2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1f0/0) Nov 2 11:29:07 ps2 last message repeated 2 times a bunch of times. Then doesn't free the mmaped memory until file system is unmounted. It never starts going into swap. 2.4.14-pre5aa1 does about the same. Sven - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua! bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail From: Andrea Arcangeli <and...@suse.de> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Fri, 2 Nov 2001 18:03:32 +0000 (UTC) Organization: unknown Lines: 16 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <20011102190046.B6003@athlon.random> References: <E15yzlQ-00021P-00@starship.berlin> <E15z28m-0000vb-00@starship.berlin> <20011031214540.D1291@athlon.random> <E15z2WJ-0000wc-00@starship.berlin> <3BE07730.60905@google.com> <15330.56589.291830.542215@abasin.nj.nec.com> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: solar.carrier.kiev.ua 1004724212 6984 193.193.193.124 (2 Nov 2001 18:03:32 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Fri, 2 Nov 2001 18:03:32 +0000 (UTC) Content-Disposition: inline User-Agent: Mutt/1.3.12i In-Reply-To: <15330.56589.291830.542215@abasin.nj.nec.com>; from sven@research.nj.nec.com on Fri, Nov 02, 2001 at 12:51:09PM -0500 X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Sven Heinicke On Fri, Nov 02, 2001 at 12:51:09PM -0500, Sven Heinicke wrote: > a bunch of times. Then doesn't free the mmaped memory until file > system is unmounted. It never starts going into swap. thanks for testing. This matches the idea that those pages doesn't want to be unmapped for whatever reason (and because there's an mlock in our way at the moment I'd tend to point my finger in that direction rather than into the vm direction). I'll look more closely into this testcase shortly. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! news.tele.dk!small.news.tele.dk!newsfeed4.cidera.com!newsfeed1.cidera.com! Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua! not-for-mail From: Daniel Phillips <phill...@bonn-fries.net> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Fri, 2 Nov 2001 18:20:41 +0000 (UTC) Organization: unknown Lines: 28 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <20011102181758Z16039-4784+420@humbolt.nl.linux.org> References: <E15yzlQ-00021P-00@starship.berlin> <15330.56589.291830.542215@abasin.nj.nec.com> <20011102190046.B6003@athlon.random> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Trace: solar.carrier.kiev.ua 1004725242 7533 193.193.193.124 (2 Nov 2001 18:20:42 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Fri, 2 Nov 2001 18:20:42 +0000 (UTC) X-Mailer: KMail [version 1.3.2] In-Reply-To: <20011102190046.B6003@athlon.random> X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Andrea Arcangeli On November 2, 2001 07:00 pm, Andrea Arcangeli wrote: > On Fri, Nov 02, 2001 at 12:51:09PM -0500, Sven Heinicke wrote: > > a bunch of times. Then doesn't free the mmaped memory until file > > system is unmounted. It never starts going into swap. > > thanks for testing. This matches the idea that those pages doesn't want > to be unmapped for whatever reason (and because there's an mlock in our > way at the moment I'd tend to point my finger in that direction rather > than into the vm direction). I'll look more closely into this testcase > shortly. The mlock handling looks dead simple: vmscan.c 227 if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) 228 return count; It's hard to see how that could be wrong. Plus, this test program does run under 2.4.9, it just uses way too much CPU on that kernel. So I'd say mm bug. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news2.google.com!news1.google.com!sn-xit-02! supernews.com!newsfeed.direct.ca!look.ca!netnews.com!xfer02.netnews.com! newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua! solar.carrier.kiev.ua!not-for-mail From: torva...@transmeta.com (Linus Torvalds) Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Fri, 2 Nov 2001 20:33:52 +0000 (UTC) Organization: Transmeta Corporation Lines: 29 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <9ruvkd$jh1$1@penguin.transmeta.com> References: <E15yzlQ-00021P-00@starship.berlin> <15330.56589.291830.542215@abasin.nj.nec.com> <20011102190046.B6003@athlon.random> <20011102181758Z16039-4784+420@humbolt.nl.linux.org> NNTP-Posting-Host: solar.carrier.kiev.ua X-Trace: solar.carrier.kiev.ua 1004733233 13317 193.193.193.124 (2 Nov 2001 20:33:53 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Fri, 2 Nov 2001 20:33:53 +0000 (UTC) X-Authentication-Warning: palladium.transmeta.com: mail set sender to n...@transmeta.com using -f X-Orig-X-Trace: palladium.transmeta.com 1004733026 25483 127.0.0.1 (2 Nov 2001 20:30:26 GMT) X-Orig-X-Complaints-To: n...@transmeta.com X-Orig-NNTP-Posting-Date: 2 Nov 2001 20:30:26 GMT Cache-Post-Path: palladium.transmeta.com!unkn...@penguin.transmeta.com X-Cache: nntpcache 2.4.0b5 (see http://www.nntpcache.org/) X-Mailing-List: linux-kernel@vger.kernel.org In article <20011102181758Z16039-4784+...@humbolt.nl.linux.org>, Daniel Phillips <phill...@bonn-fries.net> wrote: > >It's hard to see how that could be wrong. Plus, this test program does run >under 2.4.9, it just uses way too much CPU on that kernel. So I'd say mm >bug. So how much memory is mlocked? The locked memory will stay in the inactive list (it won't even ever be activated, because we don't bother even scanning the mapped locked regions), and the inactive list fills up with pages that are completely worthless. And the kernel will decide that because most of the unfreeable pages are mapped, it needs to do VM scanning, which obviously doesn't help. Why _does_ this thing do mlock, anyway? What's the point? And how much does it try to lock? If root wants to shoot himself in the head by mlocking all of memory, that's not a VM problem, that's a stupid administrator problem. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua! bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail From: Ben Smith <b...@google.com> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Fri, 2 Nov 2001 21:13:01 +0000 (UTC) Organization: unknown Lines: 36 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <3BE30B3D.1080505@google.com> References: <E15yzlQ-00021P-00@starship.berlin> <15330.56589.291830.542215@abasin.nj.nec.com> <20011102190046.B6003@athlon.random> <20011102181758Z16039-4784+420@humbolt.nl.linux.org> <9ruvkd$jh1$1@penguin.transmeta.com> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Trace: solar.carrier.kiev.ua 1004735582 15308 193.193.193.124 (2 Nov 2001 21:13:02 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Fri, 2 Nov 2001 21:13:02 +0000 (UTC) User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.5) Gecko/20011012 X-Accept-Language: en-us X-Mailing-List: linux-kernel@vger.kernel.org X-Comment-To: Linus Torvalds > So how much memory is mlocked? In the 3.5G case, we lock 4 blocks (4 * 427683520 bytes, or 1.631M). There is code in the kernel that prevents more than 1/2 of all physical pages from being mlocked: mlock.c:215-218: (in do_mlock) /* we may lock at most half of physical memory... */ /* (this check is pretty bogus, but doesn't hurt) */ if (locked > num_physpages/2) goto out; For 2.2 we were have a patch that increases this to 90% or 60M, but we don't use this patch on 2.4 yet. > Why _does_ this thing do mlock, anyway? What's the point? And how much > does it try to lock? Latency. We know exactly what data should remain in memory, so we're trying to prevent the vm from paging out the wrong data. It makes a huge difference in performance. - Ben Ben Smith Google, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!cpk-news-hub1.bbnplanet.com!news.gtei.net! newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua! solar.carrier.kiev.ua!not-for-mail From: torva...@transmeta.com (Linus Torvalds) Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Fri, 2 Nov 2001 21:27:37 +0000 (UTC) Organization: Transmeta Corporation Lines: 30 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <9rv2nc$kgi$1@penguin.transmeta.com> References: <E15yzlQ-00021P-00@starship.berlin> <20011102181758Z16039-4784+420@humbolt.nl.linux.org> <9ruvkd$jh1$1@penguin.transmeta.com> <3BE30B3D.1080505@google.com> NNTP-Posting-Host: solar.carrier.kiev.ua X-Trace: solar.carrier.kiev.ua 1004736458 15805 193.193.193.124 (2 Nov 2001 21:27:38 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Fri, 2 Nov 2001 21:27:38 +0000 (UTC) X-Authentication-Warning: palladium.transmeta.com: mail set sender to n...@transmeta.com using -f X-Orig-X-Trace: palladium.transmeta.com 1004736194 27011 127.0.0.1 (2 Nov 2001 21:23:14 GMT) X-Orig-X-Complaints-To: n...@transmeta.com X-Orig-NNTP-Posting-Date: 2 Nov 2001 21:23:14 GMT Cache-Post-Path: palladium.transmeta.com!unkn...@penguin.transmeta.com X-Cache: nntpcache 2.4.0b5 (see http://www.nntpcache.org/) X-Mailing-List: linux-kernel@vger.kernel.org In article <3BE30B3D.1080...@google.com>, Ben Smith <b...@google.com> wrote: > >For 2.2 we were have a patch that increases this to 90% or 60M, but we >don't use this patch on 2.4 yet. Well, you'll also deadlock your machine if you happen to lock down the lowmemory area on x86. Sounds like a _bad_ idea. Anyway, I posted a suggested patch that should fix the behaviour, but it doesn't fix the fundamental problem with locking the wrong kinds of pages (ie you're definitely on your own if you happen to lock down most of the low 1GB of an intel machine). >Latency. We know exactly what data should remain in memory, so we're >trying to prevent the vm from paging out the wrong data. It makes a huge >difference in performance. It would be interesting to hear whether that is equally true in the new VM that doesn't necessarily page stuff out unless it can show that the memory pressure is actually from VM mappings. How big is your mlock area during real load? Still the "max the kernel will allow"? Or is that just a benchmark/test kind of thing? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! newsfeed.direct.ca!look.ca!feed2.news.rcn.net!rcn!dca6-feed2.news.digex.net! intermedia!newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua! carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail From: Ben Smith <b...@google.com> Newsgroups: lucky.linux.kernel Subject: Re: Google's mm problem - not reproduced on 2.4.13 Date: Fri, 2 Nov 2001 22:45:44 +0000 (UTC) Organization: unknown Lines: 52 Sender: n...@solar.carrier.kiev.ua Approved: newsmas...@lucky.net Message-ID: <3BE3215A.9000302@google.com> References: <E15yzlQ-00021P-00@starship.berlin> <20011102181758Z16039-4784+420@humbolt.nl.linux.org> <9ruvkd$jh1$1@penguin.transmeta.com> <3BE30B3D.1080505@google.com> <9rv2nc$kgi$1@penguin.transmeta.com> NNTP-Posting-Host: solar.carrier.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Trace: solar.carrier.kiev.ua 1004741145 19470 193.193.193.124 (2 Nov 2001 22:45:45 GMT) X-Complaints-To: usenet@solar.carrier.kiev.ua NNTP-Posting-Date: Fri, 2 Nov 2001 22:45:45 +0000 (UTC) User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.5) Gecko/20011012 X-Accept-Language: en-us X-Mailing-List: linux-kernel@vger.kernel.org > Anyway, I posted a suggested patch that should fix the behaviour, but it > doesn't fix the fundamental problem with locking the wrong kinds of > pages (ie you're definitely on your own if you happen to lock down most > of the low 1GB of an intel machine). I've tried the patch you sent and it doesn't help. I applied the patch to 2.4.13-pre7 and it hung the machine in the same way (ctrl-alt-del didn't work). The last few lines of vmstat before the machine hung look like this: 0 1 0 0 133444 5132 3367312 0 0 31196 0 1121 2123 0 6 94 0 1 0 0 63036 5216 3435920 0 0 34338 14 1219 2272 0 5 95 2 0 1 0 6156 1828 3494904 0 0 31268 0 1130 2198 0 23 77 1 0 1 0 3596 864 3498488 0 0 2720 16 1640 1068 0 88 12 > It would be interesting to hear whether that is equally true in the new > VM that doesn't necessarily page stuff out unless it can show that the > memory pressure is actually from VM mappings. > > How big is your mlock area during real load? Still the "max the kernel > will allow"? Or is that just a benchmark/test kind of thing? I haven't had a chance to try my real app yet, but my test application is a good simulation of what the real program does, minus any of the accessing of the data that it maps. Since it's the only application running, and for performance reasons we'd need all of our data in memory, we map the "max the kernel will allow". As another note, I've re-written my test application to use madvise instead of mlock, on a suggestion from Andrea. It also doesn't work. For 2.4.13, after running for a while, my test app hangs, using one CPU, and kswapd consumes the other CPU. I was eventually able to kill my test app. I've also re-written my test app to use anonymous mmap, followed by a mlock and read()'s. This actually does work without problems, but doesn't really do what we want for other reasons. - Ben Ben Smith Google, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu! news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!colt.net!diablo.theplanet.net! easynet-monga!easynet.net!news1.ebone.net!news.ebone.net!news.net.uni-c.dk! uninett.no!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist Newsgroups: fa.linux.kernel Return-Path: <linux-kernel-ow...@vger.kernel.org> Original-Date: Sun, 18 Nov 2001 09:24:34 +0100 From: Andrea Arcangeli <and...@suse.de> To: linux-ker...@vger.kernel.org Cc: b...@google.com, brown...@irridia.com, phill...@bonn-fries.net, Linus Torvalds <torva...@transmeta.com>, Marcelo Tosatti <marc...@conectiva.com.br> Subject: 2.4.15pre6aa1 (fixes google VM problem) Original-Message-ID: <20011118092434.A1331@athlon.random> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.12i X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc Sender: linux-kernel-ow...@vger.kernel.org Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org Organization: Internet mailing list Date: Sun, 18 Nov 2001 08:27:13 GMT Message-ID: <fa.gjhdfjv.mu790@ifi.uio.no> Lines: 137 It would be interesting if people experiencing the VM problems originally reported by google (but also trivially reproducible with simple cache operations) could verify that this update fixes those troubles. I wrote some documentation on the bug and the relevant fix in the vm-14 section below. Thanks. If all works right on Monday I will port the fix to mainline (it's basically only a matter of extracting a few bits from the vm-14 patch, it's not really controversial but I didn't had much time to extract it yet, the reason it's not in a self contained patch from the first place is because of the way it was written). Comments are welcome of course, I don't think there's another way around it though, even if we would generate a logical swap cache not in function of the swap entry that still wouldn't solve the problem of mlocked highmem users [or very frequently accessed ptes] in the lowmem zones. The lowmem ram wasted for this purpose is very minor compared to the total waste of all the highmem zones, and the algorithm I implemented adapts in function of the amount of highmem so the lowmem waste is proportial with the potential highmem waste. However the lower_zone_reserve defaults could be changed, I choosen the current defaults in a conservative manner. URL: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6aa1.bz2 ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6aa1/ Only in 2.4.15pre1aa1: 00_lvm-1.0.1-rc4-3.bz2 Only in 2.4.15pre6aa1: 00_lvm-1.0.1-rc4-4.bz2 Rest of the rc4 diffs rediffed. Only in 2.4.15pre1aa1: 00_rwsem-fair-23 Only in 2.4.15pre6aa1: 00_rwsem-fair-24 Only in 2.4.15pre1aa1: 00_rwsem-fair-23-recursive-4 Only in 2.4.15pre6aa1: 00_rwsem-fair-24-recursive-5 Rediffed. Only in 2.4.15pre1aa1: 00_strnlen_user-x86-ret1-1 Merged in mainline. Only in 2.4.15pre1aa1: 10_lvm-deadlock-fix-1 Now in mainline. Only in 2.4.15pre1aa1: 10_lvm-incremental-1 Only in 2.4.15pre6aa1: 10_lvm-incremental-2 Part of it in mainline, rediffed the rest. Only in 2.4.15pre1aa1: 10_vm-13 Only in 2.4.15pre6aa1: 10_vm-14 This should be the first kernel out there without the google VM troubles (that are affecting more than just google testcase). The broken piece of VM was this kind of loop in the allocator: for (;;) { zone_t *z = *(zone++); if (!z) break; if (zone_free_pages(z, order) > z->pages_low) { page = rmqueue(z, order); if (page) return page; } } and the above logic is present in all 2.4 kernels out there (2.3 as well). So the bug has nearly nothing to do with the memory balancing engine as most of us would expect, it's an allocator zone balancing bug instead in a piece of code that one would assume to be obviously correct. The problem cames from the fact that all the ZONE_NORMAL can be allocated with unfreeable highmem users (like anon pages when no swap is available). If that happens the machine runs out of memory no matter what (even if there are 63G of cache clean ready to be freed). Mainline deadlocks because of the infinite loop in the allocator, -aa was ""correctly"" just killing tasks as soon as the normal zone was filled of mlocked cache or anon pages with no swap. The fix is to have a per-classzone per-zone set of watermarks (see the zone->watermarks[class_idx] array). Seems to work fine here. Of course this means potentially wasting some memory when the highmem zone is huge but there's no other way around it and the potential waste of all the highmem memory is huge compared to a very small waste of the normal zone (it could be more finegrined of course, for example we don't keep track if an allocation will generate a page freeable from the VM or not, but those are minor issues and not easily solvable anyways [we pin pages with a get_page and we certainly don't want to migrate pages across zones within get_page], and the core problem should be just fixed). Since the logic is generic and applies also to the zone dma vs zone normal (not only zone normal vs zone highmem) this should be tested a bit on the lowmem boxes too (I just took care of the lowmem boxes in theory, but I didn't tested it in practice). In short now we reserve a part of the lower zones for the lower classzone allocations. The algorithm I wrote calculates the "reserved portion" in function of the size of the higher zone (higher zone means the "zone" that matches the "classzone"). For example a 1G machine will reserve a very little part of the zone_normal. A 64G machine is going to reserve all the 800mbyte of zone normal for allocations from the normal classzone instead (this is fine because it would be a total waste if a 64G machine would risk to run OOM because the zone normal is all occupied by unfreeable highmem users that would much better stay in the highmem zone instead). The ratio between higher zone size and reserved lower zone size, is selectable via boot option ala memfrac= (the new option is called lower_zone_reserve=). Default values should work well (they as usual doesn't need to be perfect, but they can be changed if you've suggestions), the boot option is there just in case. Only in 2.4.15pre6aa1: 10_vm-14-no-anonlru-1 Only in 2.4.15pre6aa1: 10_vm-14-no-anonlru-1-simple-cache-1 Backed out the anon pages from the lru again, mainly to avoid to swapout too easily and because this is going to be tested on the big boxes with no swap at all anyways. Only in 2.4.15pre1aa1: 50_uml-patch-2.4.13-5.bz2 Only in 2.4.15pre6aa1: 50_uml-patch-2.4.14-2.bz2 Latest Jeff's uml update. Only in 2.4.15pre1aa1: 60_tux-2.4.13-ac5-B0.bz2 Only in 2.4.15pre6aa1: 60_tux-2.4.13-ac5-B1.bz2 Latest Ingo's tux update. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news.tele.dk! small.news.tele.dk!195.158.233.21!news1.ebone.net!news.ebone.net! news.net.uni-c.dk!uninett.no!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist Newsgroups: fa.linux.kernel Return-Path: <linux-kernel-ow...@vger.kernel.org> Original-Date: Mon, 19 Nov 2001 18:40:27 +0100 From: Andrea Arcangeli <and...@suse.de> To: linux-ker...@vger.kernel.org Cc: b...@google.com, brown...@irridia.com, phill...@bonn-fries.net, Linus Torvalds <torva...@transmeta.com>, Marcelo Tosatti <marc...@conectiva.com.br> Subject: Re: 2.4.15pre6aa1 (fixes google VM problem) Original-Message-ID: <20011119184027.Q1331@athlon.random> Original-References: <20011118092434.A1...@athlon.random> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="Md/poaVZ8hnGTzuv" Content-Disposition: inline User-Agent: Mutt/1.3.12i In-Reply-To: <20011118092434.A1331@athlon.random>; from andrea@suse.de on Sun, Nov 18, 2001 at 09:24:34AM +0100 X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc Sender: linux-kernel-ow...@vger.kernel.org Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org Organization: Internet mailing list Date: Mon, 19 Nov 2001 17:44:01 GMT Message-ID: <fa.gjhhg4v.mq6pl@ifi.uio.no> References: <fa.gjhdfjv.mu790@ifi.uio.no> Lines: 276 On Sun, Nov 18, 2001 at 09:24:34AM +0100, Andrea Arcangeli wrote: > If all works right on Monday I will port the fix to mainline (it's Ok here it is against 2.4.15pre6: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.15pre6/ zone-watermarks-1 (also attached to the email). Untested on top of mainline but should be safe to apply. also avoids GFP_ATOMIC from interrupts to eat the PF_MEMALLOC (longstanding fix from Manfred). Andrea Fixes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com! news.tele.dk!small.news.tele.dk!193.213.112.26!newsfeed1.ulv.nextra.no! nextra.com!uninett.no!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist Newsgroups: fa.linux.kernel Return-Path: <linux-kernel-ow...@vger.kernel.org> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Original-Date: Mon, 19 Nov 2001 10:57:35 -0800 (PST) From: Linus Torvalds <torva...@transmeta.com> To: Andrea Arcangeli <and...@suse.de> cc: <linux-ker...@vger.kernel.org>, <b...@google.com>, <brown...@irridia.com>, <phill...@bonn-fries.net>, Marcelo Tosatti <marc...@conectiva.com.br> Subject: Re: 2.4.15pre6aa1 (fixes google VM problem) In-Reply-To: <20011119184027.Q1331@athlon.random> Original-Message-ID: <Pine.LNX.4.33.0111191036010.8281-100000@penguin.transmeta.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Scanned-By: MIMEDefang 2.1 (www dot roaringpenguin dot com slash mimedefang) Sender: linux-kernel-ow...@vger.kernel.org Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org Organization: Internet mailing list Date: Mon, 19 Nov 2001 19:04:39 GMT Message-ID: <fa.ocp7evv.l4c732@ifi.uio.no> References: <fa.gjhhg4v.mq6pl@ifi.uio.no> Lines: 82 On Mon, 19 Nov 2001, Andrea Arcangeli wrote: > > Ok here it is against 2.4.15pre6: > > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.15pre6/ > zone-watermarks-1 Hmm.. I see what you're trying to do, but it seems overly complicated to me. Correct me if I'm wrong, but what you really want to say is basically: If we're doing a allocation from many zones, we don't want to allow an allocator that can use big zones to deplete the small zones. You do this by building up this "per-zone-per-classzone" array, which basically says that "if you had a big classzone, your minimum requirements for the next zonelist are higher". Now, I'd rather look at it from another angle: the fact is that the simple for-loop that allows any allocator to allocate equal amounts of memory from any zone it wants is kind of unfair. So the for-loop is arguably broken. So we currently have a for-loop that looks like for (;;) { zone_t *z = *(zone++); .. min = z->pages_low; .. } and the basic problem is that the above loop doesn't have any "memory": we really want it to remember the fact that it has had an earlier zone that was perhaps large, and not just see each new zone as an independent allocation decision. So why not have the much simpler patch to just say: min = 0; for (;;) { zone_t *z = *(zone++); .. min = (min >> 2) + z->pages_low; .. } or similar that simply _ages_ the "min" according to previous zones that we've already tried. That makes the data structures much simpler, and shows much more clearly what it is we are actually trying to do. We're trying to say that the size of the previous zones in the allocation list _does_ matter. Basically now we have a "history" of how much memory we have already looked at. (The "(min >> 2) + new" is obviously just a first try, I'm not claiming it's a particularly good aging function, but it's the standard kind of exponential aging approach). With something like the above, the threshold of allocation in smaller zones is much higher: let's say that your HIGHMEM zone is four times as big as your NORMAL zone, then a HIGHMEM allocation will want to see twice as much memory in the NORMAL zone than a NORMAL allocation would want to. See what I'm saying? The above algorithm more closely follows what we really want to do, and by doing so it makes the code much simpler to follow (no "What does this 'z->watermarks[class_idx].low' thing mean?" questions), not to mention causing simpler data structures etc. The actual _behaviour_ should be pretty close to yours (modulo the differences in calculating the watermarks - your "lower_zone_reserve_ratio" setup is not quite the same thing as just shifting by 2 every time). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu! logbridge.uoregon.edu!hub1.nntpserver.com!news-out.spamkiller.net! propagator-la!news-in-la.newsfeeds.com!news-in.superfeed.net! newsfeed.media.kyoto-u.ac.jp!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist Newsgroups: fa.linux.kernel Return-Path: <linux-kernel-ow...@vger.kernel.org> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Original-Date: Mon, 19 Nov 2001 12:38:25 -0800 (PST) From: Linus Torvalds <torva...@transmeta.com> To: Andrea Arcangeli <and...@suse.de> cc: <linux-ker...@vger.kernel.org>, <b...@google.com>, <brown...@irridia.com>, <phill...@bonn-fries.net>, Marcelo Tosatti <marc...@conectiva.com.br> Subject: Re: 2.4.15pre6aa1 (fixes google VM problem) In-Reply-To: <Pine.LNX.4.33.0111191036010.8281-100000@penguin.transmeta.com> Original-Message-ID: <Pine.LNX.4.33.0111191229270.8501-100000@penguin.transmeta.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Scanned-By: MIMEDefang 2.1 (www dot roaringpenguin dot com slash mimedefang) Sender: linux-kernel-ow...@vger.kernel.org Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org Organization: Internet mailing list Date: Mon, 19 Nov 2001 21:46:02 GMT Message-ID: <fa.odptcnv.k4a5bd@ifi.uio.no> References: <fa.ocp7evv.l4c732@ifi.uio.no> Lines: 38 On Mon, 19 Nov 2001, Linus Torvalds wrote: > > So why not have the much simpler patch to just say: > > min = 0; > for (;;) { > zone_t *z = *(zone++); > .. > min = (min >> 2) + z->pages_low; > .. Actually, as we already limit "pages_low" (for _all_ zones) through the use of zone_balance_max[], I don't think we need to even age the minimum pages. And instead of doing "zone->free_pages - (1UL << order)" in zone_free_pages(), we can do it much more efficiently just once for the for-loop by initializing "min" to "(1UL << order)" instead of zero. So we'd just make the loop be min = (1UL << order); for (;;) { zone_t *z = *(zone++); .. min += z->pages_low; ... instead, which is even simpler (and then just compare page->free_pages against "min" directly.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/