kswapd problems with 2.4.{9,10}

Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news.tele.dk!
small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Fri, 5 Oct 2001 11:21:34 -0700 (PDT)
From: Ben Smith <b...@google.com>
Reply-To: b...@google.com
To: linux-ker...@vger.kernel.org
cc: Gerald Aigner <ger...@google.com>
Subject: kswapd problems with 2.4.{9,10}
Original-Message-ID: <Pine.LNX.4.21.0110051119070.19656-100000@tide.corp.google.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Fri, 5 Oct 2001 18:22:55 GMT
Message-ID: <fa.nbn0tov.202lq7@ifi.uio.no>
Lines: 1089

I've run into what I think is a problem with linux 2.4.{9,10} and a
memory intensive application we use here. The application works by
readonly mmapping large chunks of disk into memory and then mlocking
them into place. As the application runs, it replaces old chunks
mapped in with new chunks that it wants to map, after unlocking the
old chunks. The problem occurs after the program runs for a
while. kswapd starts going crazy, consuming all of one CPU's
cycles. If I quit my application and restart it, kswapd goes crazy
immediately. The machine behaves this way until I reboot (it never
recovers). This application works without problems on 2.2.19.

For example, I have a machine with 2G of RAM, 2 1Ghz PIII's, no swap,
and a stock linux 2.4.9 kernel. My application mmaps two 407M chunks
of memory into RAM, and then mlocks them to force them to load from
disk. After accessing the first chunk for a while, it unlocks the
first chunk, un-maps it, and then loads a new file in its place,
mmaping and mlocking the new file. After loading some number (around
5) of these chunks (only 2 at a time), kswapd starts consuming 100% of
one CPU, and the mmapping and mlocking of new chunks slows down
significantly. If I let the program continue running, the machine
eventually hangs. I've tried 2.4.10 and it behaves much worse, locking
the machine up sooner.

I've created a simple test program that reproduces this
behavior. WARNING, this program will slowly make your machine
unusable. It needs to be run as root to be able to mlock memory. The
test program needs blocks of data on disk to mmap -- the following
shell command will create sample blocks:

bash# for i in `seq 0 19`; 
do dd if=/dev/zero of=/export/hda3/tmp/chunk$i count=6682555 bs=64; 
done

I've attached the test program at the bottom of this message, as well
as the kernel config I've used to build my kernel. Please contact me
if you have further questions.
 - Ben

Ben Smith
Google Inc

Test program

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news.tele.dk!
small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Tue, 9 Oct 2001 02:59:15 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Ben Smith <b...@google.com>
Cc: linux-ker...@vger.kernel.org, Gerald Aigner <ger...@google.com>
Subject: Re: kswapd problems with 2.4.{9,10}
Original-Message-ID: <20011009025915.H726@athlon.random>
Original-References: <Pine.LNX.4.21.0110051119070.19656-100...@tide.corp.google.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.21.0110051119070.19656-100000@tide.corp.google.com>; 
from ben@google.com on Fri, Oct 05, 2001 at 11:21:34AM -0700
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Tue, 9 Oct 2001 01:01:13 GMT
Message-ID: <fa.g8c6ogv.1igcg29@ifi.uio.no>
References: <fa.nbn0tov.202lq7@ifi.uio.no>
Lines: 47

On Fri, Oct 05, 2001 at 11:21:34AM -0700, Ben Smith wrote:
> I've attached the test program at the bottom of this message, as well
> as the kernel config I've used to build my kernel. Please contact me
> if you have further questions.

I entered this problem since a few minutes. thanks for the testcase, it
really helps understanding your problem.

First note is that using mlock for such purpose sounds wrong. Linux
provides madvise(start,len,MADV_WILLNEED) that will pagein the stuff
efficiently. You can just pagin the whole thing at once without the hack
you're doing of mlocking in chunks to avoid running into the ram/2
locked ram limit and to avoid turning down the machine with all locked
ram etc... if you know you've enough cache. However current madvise
implementation won't map the pagetables in (so you'll still generate
minor faults with cache hits) and the cache could be collected away if
you run other apps. But it sounds lots saner than mlock. And in theory
we could also change madvise to map the pagetables in, it wouldn't be
painful but before doing so I guess we prefer some real world number
that shows a noticeable improvement.

Anyways it would be interesting to know if the problem goes away with
madvise. While rewriting the vma lookup engine I cared about
mmap/mprotect/mremap but I didn't care about mlock, so at the moment it is
also not doing the vma merging (so you're generating many unmerged vmas
with your mlock/munmlock around the vma areas, madvise would be more
efficient in this sense too because mlock even if it would merge vmas it
would still need to create new vmsa during your pagein loop), and I
suspect not many people are stressing the code that generates the new
vma with mlock so I don't exclude you're triggering a core bug in the
mlock.c file rather than a VM problem (didn't checked the code yet, once
I'll check the code I'll implement the vma merging there too). It's too
early to be sure though.  Also make sure to run it on top of 2.4.10
based kernel where kswapd should know when it has to do useful work or
not (you said you tested 2.4.10 but your .config was for 2.4.9). You may
also want to try again on top of my next/future -aa that should provide
more reliable allocations for highmem systems, I'm still in the testing
stage on my 128mbyte ram box with an emulated very unblanaced highmem
setup (but kswapd logic is unchanged at the moment since I don't see
anything wrong there).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!news.tele.dk!
small.news.tele.dk!4.1.16.34!cpk-news-hub1.bbnplanet.com!news.gtei.net!
newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua!
solar.carrier.kiev.ua!not-for-mail
From: Daniel Phillips <phill...@bonn-fries.net>
Newsgroups: lucky.linux.kernel
Subject: Google's mm problem - not reproduced on 2.4.13
Date: Wed, 31 Oct 2001 18:09:18 +0000 (UTC)
Organization: unknown
Lines: 15
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <E15yzlQ-00021P-00@starship.berlin>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7BIT
X-Trace: solar.carrier.kiev.ua 1004551759 213 193.193.193.124 
(31 Oct 2001 18:09:19 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Wed, 31 Oct 2001 18:09:19 +0000 (UTC)
X-Mailer: KMail [version 1.3.2]
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Rik van Riel

Good morning Ben,

I just tried your test program with 2.4.13, 2 Gig, and it ran without 
problems.  Could you try that over there and see if you get the same result?
If it does run, the next move would be to check with 3.5 Gig.

Regards,

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!cpk-news-hub1.bbnplanet.com!news.gtei.net!
newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua!
solar.carrier.kiev.ua!not-for-mail
From: Daniel Phillips <phill...@bonn-fries.net>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Wed, 31 Oct 2001 20:41:47 +0000 (UTC)
Organization: unknown
Lines: 16
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <E15z28m-0000vb-00@starship.berlin>
References: <E15yzlQ-00021P-00@starship.berlin>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7BIT
X-Trace: solar.carrier.kiev.ua 1004560908 7094 193.193.193.124 
(31 Oct 2001 20:41:48 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Wed, 31 Oct 2001 20:41:48 +0000 (UTC)
X-Mailer: KMail [version 1.3.2]
In-Reply-To: <E15yzlQ-00021P-00@starship.berlin>
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Rik van Riel

On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> I just tried your test program with 2.4.13, 2 Gig, and it ran without 
> problems.  Could you try that over there and see if you get the same result?
> If it does run, the next move would be to check with 3.5 Gig.

Ben reports that his test with 2 Gig memory runs fine, as it does for me, but 
that it locks up tight with 3.5 Gig, requiring power cycle.  Since I only 
have 2 Gig here I can't reproduce that (yet).

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!cpk-news-hub1.bbnplanet.com!news.gtei.net!
newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua!
solar.carrier.kiev.ua!not-for-mail
From: Andrea Arcangeli <and...@suse.de>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Wed, 31 Oct 2001 20:47:53 +0000 (UTC)
Organization: unknown
Lines: 20
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <20011031214540.D1291@athlon.random>
References: <E15yzlQ-00021P-00@starship.berlin> <E15z28m-0000vb-00@starship.berlin>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: solar.carrier.kiev.ua 1004561274 7404 193.193.193.124 
(31 Oct 2001 20:47:54 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Wed, 31 Oct 2001 20:47:54 +0000 (UTC)
Content-Disposition: inline
User-Agent: Mutt/1.3.12i
In-Reply-To: <E15z28m-0000vb-00@starship.berlin>; 
from phillips@bonn-fries.net on Wed, Oct 31, 2001 at 09:39:12PM +0100
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Daniel Phillips

On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote:
> On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> > I just tried your test program with 2.4.13, 2 Gig, and it ran without 
> > problems.  Could you try that over there and see if you get the same result?
> > If it does run, the next move would be to check with 3.5 Gig.
> 
> Ben reports that his test with 2 Gig memory runs fine, as it does for me, but 
> that it locks up tight with 3.5 Gig, requiring power cycle.  Since I only 
> have 2 Gig here I can't reproduce that (yet).

are you sure it isn't an oom condition. can you reproduce on
2.4.14pre5aa1? mainline (at least before pre6) could deadlock with too
much mlocked memory.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!feed2.news.rcn.net!rcn!dca6-feed2.news.digex.net!
intermedia!newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!
carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail
From: Daniel Phillips <phill...@bonn-fries.net>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Wed, 31 Oct 2001 21:06:20 +0000 (UTC)
Organization: unknown
Lines: 24
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <E15z2WJ-0000wc-00@starship.berlin>
References: <E15yzlQ-00021P-00@starship.berlin> 
<E15z28m-0000vb-00@starship.berlin> <20011031214540.D1291@athlon.random>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7BIT
X-Trace: solar.carrier.kiev.ua 1004562381 8159 193.193.193.124 
(31 Oct 2001 21:06:21 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Wed, 31 Oct 2001 21:06:21 +0000 (UTC)
X-Mailer: KMail [version 1.3.2]
In-Reply-To: <20011031214540.D1291@athlon.random>
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Andrea Arcangeli

On October 31, 2001 09:45 pm, Andrea Arcangeli wrote:
> On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote:
> > On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> > > I just tried your test program with 2.4.13, 2 Gig, and it ran without 
> > > problems.  Could you try that over there and see if you get the same result?
> > > If it does run, the next move would be to check with 3.5 Gig.
> > 
> > Ben reports that his test with 2 Gig memory runs fine, as it does for me, but 
> > that it locks up tight with 3.5 Gig, requiring power cycle.  Since I only 
> > have 2 Gig here I can't reproduce that (yet).
> 
> are you sure it isn't an oom condition. can you reproduce on
> 2.4.14pre5aa1? mainline (at least before pre6) could deadlock with too
> much mlocked memory.

I don't know, I can't reproduce it here, I don't have enough memory.  Ben?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!cpk-news-hub1.bbnplanet.com!news.gtei.net!
newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua!
solar.carrier.kiev.ua!not-for-mail
From: Ben Smith <b...@google.com>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Wed, 31 Oct 2001 22:17:02 +0000 (UTC)
Organization: unknown
Lines: 36
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <3BE07730.60905@google.com>
References: <E15yzlQ-00021P-00@starship.berlin> 
<E15z28m-0000vb-00@starship.berlin> <20011031214540.D1291@athlon.random> 
<E15z2WJ-0000wc-00@starship.berlin>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: solar.carrier.kiev.ua 1004566623 11279 193.193.193.124 
(31 Oct 2001 22:17:03 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Wed, 31 Oct 2001 22:17:03 +0000 (UTC)
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.5) Gecko/20011012
X-Accept-Language: en-us
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Daniel Phillips

 > On October 31, 2001 09:45 pm, Andrea Arcangeli wrote:
 >
 >>On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote:
 >>
 >>>On October 31, 2001 07:06 pm, Daniel Phillips wrote:
 >>>
 >>>>I just tried your test program with 2.4.13, 2 Gig, and it ran
 >>>>without problems.  Could you try that over there and see if you
 >>>>get the same result?  If it does run, the next move would be to
 >>>>check with 3.5 Gig.
 >>>>
 >>>Ben reports that his test with 2 Gig memory runs fine, as it does
 >>>for me, but that it locks up tight with 3.5 Gig, requiring power
 >>>cycle.  Since I only have 2 Gig here I can't reproduce that (yet).
 >>>
 >>are you sure it isn't an oom condition. can you reproduce on
 >>2.4.14pre5aa1? mainline (at least before pre6) could deadlock with
 >>too much mlocked memory.
 >>
 >
 > I don't know, I can't reproduce it here, I don't have enough memory.
 > Ben?

My test application gets killed (I believe by the oom handler). dmesg
complains about a lot of 0-order allocation failures. For this test,
I'm running with 2.4.14pre5aa1, 3.5gb of RAM, 2 PIII 1Ghz.
  - Ben

Ben Smith
Google, Inc

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!cpk-news-hub1.bbnplanet.com!news.gtei.net!
newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua!
solar.carrier.kiev.ua!not-for-mail
From: Daniel Phillips <phill...@bonn-fries.net>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Thu, 1 Nov 2001 01:13:12 +0000 (UTC)
Organization: unknown
Lines: 42
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <E15z5Zm-000067-00@starship.berlin>
References: <E15yzlQ-00021P-00@starship.berlin> 
<E15z28m-0000vb-00@starship.berlin> <20011031214540.D1291@athlon.random>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7BIT
X-Trace: solar.carrier.kiev.ua 1004577193 21294 193.193.193.124 
(1 Nov 2001 01:13:13 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Thu, 1 Nov 2001 01:13:13 +0000 (UTC)
X-Mailer: KMail [version 1.3.2]
In-Reply-To: <20011031214540.D1291@athlon.random>
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Andrea Arcangeli

On October 31, 2001 09:45 pm, Andrea Arcangeli wrote:
> On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote:
> > On October 31, 2001 07:06 pm, Daniel Phillips wrote:
> > > I just tried your test program with 2.4.13, 2 Gig, and it ran without 
> > > problems.  Could you try that over there and see if you get the same 
result?
> > > If it does run, the next move would be to check with 3.5 Gig.
> > 
> > Ben reports that his test with 2 Gig memory runs fine, as it does for me, 
but 
> > that it locks up tight with 3.5 Gig, requiring power cycle.  Since I only 
> > have 2 Gig here I can't reproduce that (yet).
> 
> are you sure it isn't an oom condition.

The way the test code works is, it keeps mlocking more blocks of memory until 
one of the mlocks fails, and then it does the rest of its work with that many 
blocks of memory.  It's hard to see how we could get a legitimate oom with 
that strategy.

> can you reproduce on
> 2.4.14pre5aa1? mainline (at least before pre6) could deadlock with too
> much mlocked memory.

OK, he tried it with pre5aa1:

ben> My test application gets killed (I believe by the oom handler). dmesg
ben> complains about a lot of 0-order allocation failures. For this test,
ben> I'm running with 2.4.14pre5aa1, 3.5gb of RAM, 2 PIII 1Ghz.

*Just in case* it's oom-related I've asked Ben to try it with one less than 
the maximum number of memory blocks he can allocate.

If it does turn out to be oom, it's still a bug, right?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua!
bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail
From: Ben Smith <b...@google.com>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Thu, 1 Nov 2001 01:21:10 +0000 (UTC)
Organization: unknown
Lines: 18
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <3BE0A2C1.70600@google.com>
References: <E15yzlQ-00021P-00@starship.berlin> 
<E15z28m-0000vb-00@starship.berlin> <20011031214540.D1291@athlon.random> 
<E15z5Zm-000067-00@starship.berlin>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: solar.carrier.kiev.ua 1004577671 21480 193.193.193.124 (1 Nov 2001 01:21:11 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Thu, 1 Nov 2001 01:21:11 +0000 (UTC)
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.5) Gecko/20011012
X-Accept-Language: en-us
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Daniel Phillips

> *Just in case* it's oom-related I've asked Ben to try it with one less than 
> the maximum number of memory blocks he can allocate.

I've run this test with my 3.5G machine, 3 blocks instead of 4 blocks, 
and it has the same behavior (my app gets killed, 0-order allocation 
failures, and the system stays up.
  - Ben

Ben Smith
Google, Inc

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua!
bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail
From: Rik van Riel <r...@conectiva.com.br>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Thu, 1 Nov 2001 01:48:19 +0000 (UTC)
Organization: unknown
Lines: 26
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <Pine.LNX.4.33L.0110312341030.2963-100000@imladris.surriel.com>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Trace: solar.carrier.kiev.ua 1004579300 22748 193.193.193.124 
(1 Nov 2001 01:48:20 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Thu, 1 Nov 2001 01:48:20 +0000 (UTC)
X-X-Sender:  <r...@imladris.surriel.com>
In-Reply-To: <3BE0A2C1.70600@google.com>
X-spambait: aardv...@kernelnewbies.org
X-spammeplease: 	aardv...@nl.linux.org
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Ben Smith

On Wed, 31 Oct 2001, Ben Smith wrote:

> > *Just in case* it's oom-related I've asked Ben to try it with one less than
> > the maximum number of memory blocks he can allocate.
>
> I've run this test with my 3.5G machine, 3 blocks instead of 4 blocks,
> and it has the same behavior (my app gets killed, 0-order allocation
> failures, and the system stays up.

If you still have swap free at the point where the process
gets killed, or if the memory is file-backed, then we are
positive it's a kernel bug.

regards,

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua!
bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail
From: Ben Smith <b...@google.com>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Thu, 1 Nov 2001 01:58:50 +0000 (UTC)
Organization: unknown
Lines: 28
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <3BE0AB8D.3040400@google.com>
References: <Pine.LNX.4.33L.0110312341030.2963-100000@imladris.surriel.com>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: solar.carrier.kiev.ua 1004579931 23109 193.193.193.124 
(1 Nov 2001 01:58:51 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Thu, 1 Nov 2001 01:58:51 +0000 (UTC)
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.5) Gecko/20011012
X-Accept-Language: en-us
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Rik van Riel

>>>*Just in case* it's oom-related I've asked Ben to try it with one less than
>>>the maximum number of memory blocks he can allocate.
>>>
>>I've run this test with my 3.5G machine, 3 blocks instead of 4 blocks,
>>and it has the same behavior (my app gets killed, 0-order allocation
>>failures, and the system stays up.
>>
> 
> If you still have swap free at the point where the process
> gets killed, or if the memory is file-backed, then we are
> positive it's a kernel bug.

This machine is configured without a swap file. The memory is file backed, 

though (read-only mmap, followed by a mlock).

  - Ben

Ben Smith
Google, Inc

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua!
bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail
From: Sven Heinicke <s...@research.nj.nec.com>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Fri, 2 Nov 2001 17:55:18 +0000 (UTC)
Organization: unknown
Lines: 67
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <15330.56589.291830.542215@abasin.nj.nec.com>
References: <E15yzlQ-00021P-00@starship.berlin>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Trace: solar.carrier.kiev.ua 1004723719 6568 193.193.193.124 
(2 Nov 2001 17:55:19 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Fri, 2 Nov 2001 17:55:19 +0000 (UTC)
In-Reply-To: <3BE07730.60905@google.com>
X-Mailer: VM 6.72 under 21.1 (patch 14) "Cuyahoga Valley" XEmacs Lucid
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Daniel Phillips

Ben Smith writes:
 >  > On October 31, 2001 09:45 pm, Andrea Arcangeli wrote:
 >  >
 >  >>On Wed, Oct 31, 2001 at 09:39:12PM +0100, Daniel Phillips wrote:
 >  >>
 >  >>>On October 31, 2001 07:06 pm, Daniel Phillips wrote:
 >  >>>
 >  >>>>I just tried your test program with 2.4.13, 2 Gig, and it ran
 >  >>>>without problems.  Could you try that over there and see if you
 >  >>>>get the same result?  If it does run, the next move would be to
 >  >>>>check with 3.5 Gig.
 >  >>>>
 >  >>>Ben reports that his test with 2 Gig memory runs fine, as it does
 >  >>>for me, but that it locks up tight with 3.5 Gig, requiring power
 >  >>>cycle.  Since I only have 2 Gig here I can't reproduce that (yet).
 >  >>>
 >  >>are you sure it isn't an oom condition. can you reproduce on
 >  >>2.4.14pre5aa1? mainline (at least before pre6) could deadlock with
 >  >>too much mlocked memory.
 >  >>
 >  >
 >  > I don't know, I can't reproduce it here, I don't have enough memory.
 >  > Ben?
 > 
 > My test application gets killed (I believe by the oom handler). dmesg
 > complains about a lot of 0-order allocation failures. For this test,
 > I'm running with 2.4.14pre5aa1, 3.5gb of RAM, 2 PIII 1Ghz.
 >   - Ben
 > 
 > Ben Smith
 > Google, Inc
 > 

This is a System with 4G of memory and regular swap.  With 2 Pentium
III 1Ghz processors.

On 2.4.14-pre6aa1 it happily runs until:

munmap'ed 7317d000
Loading data at 7317d000 for slot 2
Load (/mnt/sdb/sven/chunk10) succeeded!
mlocking slot 2, 7317d000
mlocking at 7317d000 of size 1048576
Connection to hera closed by remote host.
Connection to hera closed.

Where is kills my ssh and other programs.  fills my /var/log/messages
with:

Nov  2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
Nov  2 11:29:07 ps2 syslogd: select: Cannot allocate memory
Nov  2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1d2/0)
Nov  2 11:29:07 ps2 kernel: __alloc_pages: 0-order allocation failed (gfp=0x1f0/0)
Nov  2 11:29:07 ps2 last message repeated 2 times

a bunch of times.  Then doesn't free the mmaped memory until file
system is unmounted.  It never starts going into swap.

2.4.14-pre5aa1 does about the same.

	       Sven
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua!
bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail
From: Andrea Arcangeli <and...@suse.de>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Fri, 2 Nov 2001 18:03:32 +0000 (UTC)
Organization: unknown
Lines: 16
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <20011102190046.B6003@athlon.random>
References: <E15yzlQ-00021P-00@starship.berlin> 
<E15z28m-0000vb-00@starship.berlin> <20011031214540.D1291@athlon.random> 
<E15z2WJ-0000wc-00@starship.berlin> <3BE07730.60905@google.com> 
<15330.56589.291830.542215@abasin.nj.nec.com>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: solar.carrier.kiev.ua 1004724212 6984 193.193.193.124 
(2 Nov 2001 18:03:32 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Fri, 2 Nov 2001 18:03:32 +0000 (UTC)
Content-Disposition: inline
User-Agent: Mutt/1.3.12i
In-Reply-To: <15330.56589.291830.542215@abasin.nj.nec.com>; 
from sven@research.nj.nec.com on Fri, Nov 02, 2001 at 12:51:09PM -0500
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Sven Heinicke

On Fri, Nov 02, 2001 at 12:51:09PM -0500, Sven Heinicke wrote:
> a bunch of times.  Then doesn't free the mmaped memory until file
> system is unmounted.  It never starts going into swap.

thanks for testing. This matches the idea that those pages doesn't want
to be unmapped for whatever reason (and because there's an mlock in our
way at the moment I'd tend to point my finger in that direction rather
than into the vm direction). I'll look more closely into this testcase
shortly.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
news.tele.dk!small.news.tele.dk!newsfeed4.cidera.com!newsfeed1.cidera.com!
Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!
not-for-mail
From: Daniel Phillips <phill...@bonn-fries.net>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Fri, 2 Nov 2001 18:20:41 +0000 (UTC)
Organization: unknown
Lines: 28
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <20011102181758Z16039-4784+420@humbolt.nl.linux.org>
References: <E15yzlQ-00021P-00@starship.berlin> 
<15330.56589.291830.542215@abasin.nj.nec.com> <20011102190046.B6003@athlon.random>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7BIT
X-Trace: solar.carrier.kiev.ua 1004725242 7533 193.193.193.124 (2 Nov 2001 18:20:42 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Fri, 2 Nov 2001 18:20:42 +0000 (UTC)
X-Mailer: KMail [version 1.3.2]
In-Reply-To: <20011102190046.B6003@athlon.random>
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Andrea Arcangeli

On November 2, 2001 07:00 pm, Andrea Arcangeli wrote:
> On Fri, Nov 02, 2001 at 12:51:09PM -0500, Sven Heinicke wrote:
> > a bunch of times.  Then doesn't free the mmaped memory until file
> > system is unmounted.  It never starts going into swap.
> 
> thanks for testing. This matches the idea that those pages doesn't want
> to be unmapped for whatever reason (and because there's an mlock in our
> way at the moment I'd tend to point my finger in that direction rather
> than into the vm direction). I'll look more closely into this testcase
> shortly.

The mlock handling looks dead simple:

vmscan.c
227         if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
228                 return count;

It's hard to see how that could be wrong.  Plus, this test program does run 
under 2.4.9, it just uses way too much CPU on that kernel.  So I'd say mm 
bug.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news2.google.com!news1.google.com!sn-xit-02!
supernews.com!newsfeed.direct.ca!look.ca!netnews.com!xfer02.netnews.com!
newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua!
solar.carrier.kiev.ua!not-for-mail
From: torva...@transmeta.com (Linus Torvalds)
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Fri, 2 Nov 2001 20:33:52 +0000 (UTC)
Organization: Transmeta Corporation
Lines: 29
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <9ruvkd$jh1$1@penguin.transmeta.com>
References: <E15yzlQ-00021P-00@starship.berlin> 
<15330.56589.291830.542215@abasin.nj.nec.com> <20011102190046.B6003@athlon.random> 
<20011102181758Z16039-4784+420@humbolt.nl.linux.org>
NNTP-Posting-Host: solar.carrier.kiev.ua
X-Trace: solar.carrier.kiev.ua 1004733233 13317 193.193.193.124 (2 Nov 2001 20:33:53 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Fri, 2 Nov 2001 20:33:53 +0000 (UTC)
X-Authentication-Warning: palladium.transmeta.com: 
mail set sender to n...@transmeta.com using -f
X-Orig-X-Trace: palladium.transmeta.com 1004733026 25483 127.0.0.1 
(2 Nov 2001 20:30:26 GMT)
X-Orig-X-Complaints-To: n...@transmeta.com
X-Orig-NNTP-Posting-Date: 2 Nov 2001 20:30:26 GMT
Cache-Post-Path: palladium.transmeta.com!unkn...@penguin.transmeta.com
X-Cache: nntpcache 2.4.0b5 (see http://www.nntpcache.org/)
X-Mailing-List: 	linux-kernel@vger.kernel.org

In article <20011102181758Z16039-4784+...@humbolt.nl.linux.org>,
Daniel Phillips  <phill...@bonn-fries.net> wrote:
>
>It's hard to see how that could be wrong.  Plus, this test program does run 
>under 2.4.9, it just uses way too much CPU on that kernel.  So I'd say mm 
>bug.

So how much memory is mlocked?

The locked memory will stay in the inactive list (it won't even ever be
activated, because we don't bother even scanning the mapped locked
regions), and the inactive list fills up with pages that are completely
worthless. 

And the kernel will decide that because most of the unfreeable pages are
mapped, it needs to do VM scanning, which obviously doesn't help.

Why _does_ this thing do mlock, anyway? What's the point? And how much
does it try to lock?

If root wants to shoot himself in the head by mlocking all of memory,
that's not a VM problem, that's a stupid administrator problem.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!newsfeed1.cidera.com!Cidera!news2.dg.net.ua!
bn.utel.com.ua!carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail
From: Ben Smith <b...@google.com>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Fri, 2 Nov 2001 21:13:01 +0000 (UTC)
Organization: unknown
Lines: 36
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <3BE30B3D.1080505@google.com>
References: <E15yzlQ-00021P-00@starship.berlin> 
<15330.56589.291830.542215@abasin.nj.nec.com> <20011102190046.B6003@athlon.random> 
<20011102181758Z16039-4784+420@humbolt.nl.linux.org> 
<9ruvkd$jh1$1@penguin.transmeta.com>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: solar.carrier.kiev.ua 1004735582 15308 193.193.193.124 
(2 Nov 2001 21:13:02 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Fri, 2 Nov 2001 21:13:02 +0000 (UTC)
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.5) Gecko/20011012
X-Accept-Language: en-us
X-Mailing-List: 	linux-kernel@vger.kernel.org
X-Comment-To: Linus Torvalds

> So how much memory is mlocked?

In the 3.5G case, we lock 4 blocks (4 * 427683520 bytes, or 1.631M). 
There is code in the kernel that prevents more than 1/2 of all physical 
pages from being mlocked:

mlock.c:215-218: (in do_mlock)

	/* we may lock at most half of physical memory... */
	/* (this check is pretty bogus, but doesn't hurt) */
	if (locked > num_physpages/2)
		goto out;

For 2.2 we were have a patch that increases this to 90% or 60M, but we 
don't use this patch on 2.4 yet.

> Why _does_ this thing do mlock, anyway? What's the point? And how much
> does it try to lock?

Latency. We know exactly what data should remain in memory, so we're 
trying to prevent the vm from paging out the wrong data. It makes a huge 
difference in performance.
  - Ben

Ben Smith
Google, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!cpk-news-hub1.bbnplanet.com!news.gtei.net!
newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!carrier.kiev.ua!
solar.carrier.kiev.ua!not-for-mail
From: torva...@transmeta.com (Linus Torvalds)
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Fri, 2 Nov 2001 21:27:37 +0000 (UTC)
Organization: Transmeta Corporation
Lines: 30
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <9rv2nc$kgi$1@penguin.transmeta.com>
References: <E15yzlQ-00021P-00@starship.berlin> 
<20011102181758Z16039-4784+420@humbolt.nl.linux.org> 
<9ruvkd$jh1$1@penguin.transmeta.com> <3BE30B3D.1080505@google.com>
NNTP-Posting-Host: solar.carrier.kiev.ua
X-Trace: solar.carrier.kiev.ua 1004736458 15805 193.193.193.124 
(2 Nov 2001 21:27:38 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Fri, 2 Nov 2001 21:27:38 +0000 (UTC)
X-Authentication-Warning: palladium.transmeta.com: 
mail set sender to n...@transmeta.com using -f
X-Orig-X-Trace: palladium.transmeta.com 1004736194 27011 127.0.0.1 
(2 Nov 2001 21:23:14 GMT)
X-Orig-X-Complaints-To: n...@transmeta.com
X-Orig-NNTP-Posting-Date: 2 Nov 2001 21:23:14 GMT
Cache-Post-Path: palladium.transmeta.com!unkn...@penguin.transmeta.com
X-Cache: nntpcache 2.4.0b5 (see http://www.nntpcache.org/)
X-Mailing-List: 	linux-kernel@vger.kernel.org

In article <3BE30B3D.1080...@google.com>, Ben Smith  <b...@google.com> wrote:
>
>For 2.2 we were have a patch that increases this to 90% or 60M, but we 
>don't use this patch on 2.4 yet.

Well, you'll also deadlock your machine if you happen to lock down the
lowmemory area on x86. Sounds like a _bad_ idea.

Anyway, I posted a suggested patch that should fix the behaviour, but it
doesn't fix the fundamental problem with locking the wrong kinds of
pages (ie you're definitely on your own if you happen to lock down most
of the low 1GB of an intel machine).

>Latency. We know exactly what data should remain in memory, so we're 
>trying to prevent the vm from paging out the wrong data. It makes a huge 
>difference in performance.

It would be interesting to hear whether that is equally true in the new
VM that doesn't necessarily page stuff out unless it can show that the
memory pressure is actually from VM mappings.

How big is your mlock area during real load? Still the "max the kernel
will allow"? Or is that just a benchmark/test kind of thing?

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!feed2.news.rcn.net!rcn!dca6-feed2.news.digex.net!
intermedia!newsfeed1.cidera.com!Cidera!news2.dg.net.ua!bn.utel.com.ua!
carrier.kiev.ua!solar.carrier.kiev.ua!not-for-mail
From: Ben Smith <b...@google.com>
Newsgroups: lucky.linux.kernel
Subject: Re: Google's mm problem - not reproduced on 2.4.13
Date: Fri, 2 Nov 2001 22:45:44 +0000 (UTC)
Organization: unknown
Lines: 52
Sender: n...@solar.carrier.kiev.ua
Approved: newsmas...@lucky.net
Message-ID: <3BE3215A.9000302@google.com>
References: <E15yzlQ-00021P-00@starship.berlin> 
<20011102181758Z16039-4784+420@humbolt.nl.linux.org> 
<9ruvkd$jh1$1@penguin.transmeta.com> <3BE30B3D.1080505@google.com> 
<9rv2nc$kgi$1@penguin.transmeta.com>
NNTP-Posting-Host: solar.carrier.kiev.ua
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: solar.carrier.kiev.ua 1004741145 19470 193.193.193.124 
(2 Nov 2001 22:45:45 GMT)
X-Complaints-To: usenet@solar.carrier.kiev.ua
NNTP-Posting-Date: Fri, 2 Nov 2001 22:45:45 +0000 (UTC)
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.5) Gecko/20011012
X-Accept-Language: en-us
X-Mailing-List: 	linux-kernel@vger.kernel.org

> Anyway, I posted a suggested patch that should fix the behaviour, but it
> doesn't fix the fundamental problem with locking the wrong kinds of
> pages (ie you're definitely on your own if you happen to lock down most

> of the low 1GB of an intel machine).

I've tried the patch you sent and it doesn't help. I applied the patch 
to 2.4.13-pre7 and it hung the machine in the same way (ctrl-alt-del 
didn't work). The last few lines of vmstat before the machine hung look 
like this:
  0  1  0      0 133444   5132 3367312   0   0 31196     0 1121  2123 
0   6  94
  0  1  0      0  63036   5216 3435920   0   0 34338    14 1219  2272 
0   5  95
  2  0  1      0   6156   1828 3494904   0   0 31268     0 1130  2198 
0  23  77
  1  0  1      0   3596    864 3498488   0   0  2720    16 1640  1068 
0  88  12

> It would be interesting to hear whether that is equally true in the new
> VM that doesn't necessarily page stuff out unless it can show that the
> memory pressure is actually from VM mappings.
> 
> How big is your mlock area during real load? Still the "max the kernel
> will allow"? Or is that just a benchmark/test kind of thing?

I haven't had a chance to try my real app yet, but my test application 
is a good simulation of what the real program does, minus any of the 
accessing of the data that it maps. Since it's the only application 
running, and for performance reasons we'd need all of our data in 
memory, we map the "max the kernel will allow".

As another note, I've re-written my test application to use madvise 
instead of mlock, on a suggestion from Andrea. It also doesn't work. For 
2.4.13, after running for a while, my test app hangs, using one CPU, and 
kswapd consumes the other CPU. I was eventually able to kill my test app.

I've also re-written my test app to use anonymous mmap, followed by a 
mlock and read()'s. This actually does work without problems, but 
doesn't really do what we want for other reasons.
  - Ben

Ben Smith
Google, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!
news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!colt.net!diablo.theplanet.net!
easynet-monga!easynet.net!news1.ebone.net!news.ebone.net!news.net.uni-c.dk!
uninett.no!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Sun, 18 Nov 2001 09:24:34 +0100
From: Andrea Arcangeli <and...@suse.de>
To: linux-ker...@vger.kernel.org
Cc: b...@google.com, brown...@irridia.com, phill...@bonn-fries.net,
        Linus Torvalds <torva...@transmeta.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: 2.4.15pre6aa1 (fixes google VM problem)
Original-Message-ID: <20011118092434.A1331@athlon.random>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.3.12i
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Sun, 18 Nov 2001 08:27:13 GMT
Message-ID: <fa.gjhdfjv.mu790@ifi.uio.no>
Lines: 137

It would be interesting if people experiencing the VM problems
originally reported by google (but also trivially reproducible with
simple cache operations) could verify that this update fixes those
troubles. I wrote some documentation on the bug and the relevant fix in
the vm-14 section below. Thanks.

If all works right on Monday I will port the fix to mainline (it's
basically only a matter of extracting a few bits from the vm-14 patch,
it's not really controversial but I didn't had much time to extract it
yet, the reason it's not in a self contained patch from the first place
is because of the way it was written). Comments are welcome of course, I
don't think there's another way around it though, even if we would
generate a logical swap cache not in function of the swap entry that
still wouldn't solve the problem of mlocked highmem users [or very
frequently accessed ptes] in the lowmem zones. The lowmem ram wasted for
this purpose is very minor compared to the total waste of all the
highmem zones, and the algorithm I implemented adapts in function of the
amount of highmem so the lowmem waste is proportial with the potential
highmem waste. However the lower_zone_reserve defaults could be changed,
I choosen the current defaults in a conservative manner.

URL:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6aa1.bz2
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6aa1/

Only in 2.4.15pre1aa1: 00_lvm-1.0.1-rc4-3.bz2
Only in 2.4.15pre6aa1: 00_lvm-1.0.1-rc4-4.bz2

	Rest of the rc4 diffs rediffed.

Only in 2.4.15pre1aa1: 00_rwsem-fair-23
Only in 2.4.15pre6aa1: 00_rwsem-fair-24
Only in 2.4.15pre1aa1: 00_rwsem-fair-23-recursive-4
Only in 2.4.15pre6aa1: 00_rwsem-fair-24-recursive-5

	Rediffed.

Only in 2.4.15pre1aa1: 00_strnlen_user-x86-ret1-1

	Merged in mainline.

Only in 2.4.15pre1aa1: 10_lvm-deadlock-fix-1

	Now in mainline.

Only in 2.4.15pre1aa1: 10_lvm-incremental-1
Only in 2.4.15pre6aa1: 10_lvm-incremental-2

	Part of it in mainline, rediffed the rest.

Only in 2.4.15pre1aa1: 10_vm-13
Only in 2.4.15pre6aa1: 10_vm-14

	This should be the first kernel out there without the google VM
	troubles (that are affecting more than just google testcase). The
	broken piece of VM was this kind of loop in the allocator:

	for (;;) {
		zone_t *z = *(zone++);
		if (!z)
			break;

		if (zone_free_pages(z, order) > z->pages_low) {
			page = rmqueue(z, order);
			if (page)
				return page;
		}
	}

	and the above logic is present in all 2.4 kernels out there (2.3 as well).
	So the bug has nearly nothing to do with the memory balancing engine as
	most of us would expect, it's an allocator zone balancing bug instead in
	a piece of code that one would assume to be obviously correct.

	The problem cames from the fact that all the ZONE_NORMAL can be allocated with
	unfreeable highmem users (like anon pages when no swap is available).
	If that happens the machine runs out of memory no matter what (even if
	there are 63G of cache clean ready to be freed).  Mainline deadlocks
	because of the infinite loop in the allocator, -aa was ""correctly""
	just killing tasks as soon as the normal zone was filled of mlocked
	cache or anon pages with no swap.

	The fix is to have a per-classzone per-zone set of watermarks (see the
	zone->watermarks[class_idx] array). Seems to work fine here. Of course
	this means potentially wasting some memory when the highmem zone is
	huge but there's no other way around it and the potential waste of all the
	highmem memory is huge compared to a very small waste of the normal
	zone (it could be more finegrined of course, for example we don't keep
	track if an allocation will generate a page freeable from the VM or
	not, but those are minor issues and not easily solvable anyways [we pin
	pages with a get_page and we certainly don't want to migrate pages
	across zones within get_page], and the core problem should be just fixed).

	Since the logic is generic and applies also to the zone dma vs zone
	normal (not only zone normal vs zone highmem) this should be tested a
	bit on the lowmem boxes too (I just took care of the lowmem boxes in
	theory, but I didn't tested it in practice).

	In short now we reserve a part of the lower zones for the lower
	classzone allocations. The algorithm I wrote calculates the "reserved
	portion" in function of the size of the higher zone (higher zone means
	the "zone" that matches the "classzone"). For example a 1G machine will
	reserve a very little part of the zone_normal. A 64G machine is going
	to reserve all the 800mbyte of zone normal for allocations from
	the normal classzone instead (this is fine because it would be a total
	waste if a 64G machine would risk to run OOM because the zone normal
	is all occupied by unfreeable highmem users that would much better stay
	in the highmem zone instead). The ratio between higher zone size and
	reserved lower zone size, is selectable via boot option ala memfrac=
	(the new option is called lower_zone_reserve=). Default values should
	work well (they as usual doesn't need to be perfect, but they can be
	changed if you've suggestions), the boot option is there just in case.

Only in 2.4.15pre6aa1: 10_vm-14-no-anonlru-1
Only in 2.4.15pre6aa1: 10_vm-14-no-anonlru-1-simple-cache-1

	Backed out the anon pages from the lru again, mainly to avoid to
	swapout too easily and because this is going to be tested on the
	big boxes with no swap at all anyways.

Only in 2.4.15pre1aa1: 50_uml-patch-2.4.13-5.bz2
Only in 2.4.15pre6aa1: 50_uml-patch-2.4.14-2.bz2

	Latest Jeff's uml update.

Only in 2.4.15pre1aa1: 60_tux-2.4.13-ac5-B0.bz2
Only in 2.4.15pre6aa1: 60_tux-2.4.13-ac5-B1.bz2

	Latest Ingo's tux update.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news.tele.dk!
small.news.tele.dk!195.158.233.21!news1.ebone.net!news.ebone.net!
news.net.uni-c.dk!uninett.no!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Mon, 19 Nov 2001 18:40:27 +0100
From: Andrea Arcangeli <and...@suse.de>
To: linux-ker...@vger.kernel.org
Cc: b...@google.com, brown...@irridia.com, phill...@bonn-fries.net,
        Linus Torvalds <torva...@transmeta.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: 2.4.15pre6aa1 (fixes google VM problem)
Original-Message-ID: <20011119184027.Q1331@athlon.random>
Original-References: <20011118092434.A1...@athlon.random>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="Md/poaVZ8hnGTzuv"
Content-Disposition: inline
User-Agent: Mutt/1.3.12i
In-Reply-To: <20011118092434.A1331@athlon.random>; 
from andrea@suse.de on Sun, Nov 18, 2001 at 09:24:34AM +0100
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Mon, 19 Nov 2001 17:44:01 GMT
Message-ID: <fa.gjhhg4v.mq6pl@ifi.uio.no>
References: <fa.gjhdfjv.mu790@ifi.uio.no>
Lines: 276


On Sun, Nov 18, 2001 at 09:24:34AM +0100, Andrea Arcangeli wrote:
> If all works right on Monday I will port the fix to mainline (it's

Ok here it is against 2.4.15pre6:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.15pre6/
zone-watermarks-1

(also attached to the email). Untested on top of mainline but should be
safe to apply. also avoids GFP_ATOMIC from interrupts to eat the
PF_MEMALLOC (longstanding fix from Manfred).

Andrea

Fixes

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
news.tele.dk!small.news.tele.dk!193.213.112.26!newsfeed1.ulv.nextra.no!
nextra.com!uninett.no!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Original-Date: 	Mon, 19 Nov 2001 10:57:35 -0800 (PST)
From: Linus Torvalds <torva...@transmeta.com>
To: Andrea Arcangeli <and...@suse.de>
cc: <linux-ker...@vger.kernel.org>, <b...@google.com>, <brown...@irridia.com>,
        <phill...@bonn-fries.net>, Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: 2.4.15pre6aa1 (fixes google VM problem)
In-Reply-To: <20011119184027.Q1331@athlon.random>
Original-Message-ID: <Pine.LNX.4.33.0111191036010.8281-100000@penguin.transmeta.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Scanned-By: MIMEDefang 2.1 (www dot roaringpenguin dot com slash mimedefang)
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Mon, 19 Nov 2001 19:04:39 GMT
Message-ID: <fa.ocp7evv.l4c732@ifi.uio.no>
References: <fa.gjhhg4v.mq6pl@ifi.uio.no>
Lines: 82

On Mon, 19 Nov 2001, Andrea Arcangeli wrote:

>
> Ok here it is against 2.4.15pre6:
>
> ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.15pre6/
> zone-watermarks-1

Hmm.. I see what you're trying to do, but it seems overly complicated to
me.

Correct me if I'm wrong, but what you really want to say is basically:

   If we're doing a allocation from many zones, we don't want to allow an
   allocator that can use big zones to deplete the small zones.

You do this by building up this "per-zone-per-classzone" array, which
basically says that "if you had a big classzone, your minimum requirements
for the next zonelist are higher".

Now, I'd rather look at it from another angle: the fact is that the simple
for-loop that allows any allocator to allocate equal amounts of memory
from any zone it wants is kind of unfair. So the for-loop is arguably
broken.

So we currently have a for-loop that looks like

	for (;;) {
		zone_t *z = *(zone++);
		..
		min = z->pages_low;
		..
	}

and the basic problem is that the above loop doesn't have any "memory": we
really want it to remember the fact that it has had an earlier zone that
was perhaps large, and not just see each new zone as an independent
allocation decision.

So why not have the much simpler patch to just say:

	min = 0;
	for (;;) {
		zone_t *z = *(zone++);
		..
		min = (min >> 2) + z->pages_low;
		..
	}

or similar that simply _ages_ the "min" according to previous zones that
we've already tried. That makes the data structures much simpler, and
shows much more clearly what it is we are actually trying to do. We're
trying to say that the size of the previous zones in the allocation list
_does_ matter. Basically now we have a "history" of how much memory we
have already looked at.

(The "(min >> 2) + new" is obviously just a first try, I'm not claiming
it's a particularly good aging function, but it's the standard kind of
exponential aging approach).

With something like the above, the threshold of allocation in smaller
zones is much higher: let's say that your HIGHMEM zone is four times as
big as your NORMAL zone, then a HIGHMEM allocation will want to see twice
as much memory in the NORMAL zone than a NORMAL allocation would want to.

See what I'm saying? The above algorithm more closely follows what we
really want to do, and by doing so it makes the code much simpler to
follow (no "What does this 'z->watermarks[class_idx].low' thing mean?"
questions), not to mention causing simpler data structures etc.

The actual _behaviour_ should be pretty close to yours (modulo the
differences in calculating the watermarks - your
"lower_zone_reserve_ratio" setup is not quite the same thing as just
shifting by 2 every time).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!
logbridge.uoregon.edu!hub1.nntpserver.com!news-out.spamkiller.net!
propagator-la!news-in-la.newsfeeds.com!news-in.superfeed.net!
newsfeed.media.kyoto-u.ac.jp!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Original-Date: 	Mon, 19 Nov 2001 12:38:25 -0800 (PST)
From: Linus Torvalds <torva...@transmeta.com>
To: Andrea Arcangeli <and...@suse.de>
cc: <linux-ker...@vger.kernel.org>, <b...@google.com>, <brown...@irridia.com>,
        <phill...@bonn-fries.net>, Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: 2.4.15pre6aa1 (fixes google VM problem)
In-Reply-To: <Pine.LNX.4.33.0111191036010.8281-100000@penguin.transmeta.com>
Original-Message-ID: <Pine.LNX.4.33.0111191229270.8501-100000@penguin.transmeta.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Scanned-By: MIMEDefang 2.1 (www dot roaringpenguin dot com slash mimedefang)
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Mon, 19 Nov 2001 21:46:02 GMT
Message-ID: <fa.odptcnv.k4a5bd@ifi.uio.no>
References: <fa.ocp7evv.l4c732@ifi.uio.no>
Lines: 38


On Mon, 19 Nov 2001, Linus Torvalds wrote:
>
> So why not have the much simpler patch to just say:
>
> 	min = 0;
> 	for (;;) {
> 		zone_t *z = *(zone++);
> 		..
> 		min = (min >> 2) + z->pages_low;
> 		..

Actually, as we already limit "pages_low" (for _all_ zones) through the
use of zone_balance_max[], I don't think we need to even age the minimum
pages.

And instead of doing "zone->free_pages - (1UL << order)" in
zone_free_pages(), we can do it much more efficiently just once for the
for-loop by initializing "min" to "(1UL << order)" instead of zero. So
we'd just make the loop be

	min = (1UL << order);
	for (;;) {
		zone_t *z = *(zone++);
		..
		min += z->pages_low;
		...

instead, which is even simpler (and then just compare page->free_pages
against "min" directly..

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/