2.2.0 Bug summary

From: Alan Cox <a...@terrorserver.swansea.linux.org.uk>
Subject: 2.2.0 Bug summary
Date: 1998/12/29
Message-ID: <fa.m25408v.141ks8r@ifi.uio.no>#1/1
X-Deja-AN: 426631192
Original-Date: Tue, 29 Dec 1998 01:46:20 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <199812290146.BAA12687@terrorserver.swansea.linux.org.uk>
To: linux-ker...@vger.rutgers.edu, torva...@transmeta.com
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Internet mailing list
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Ok this is the collated 'bad bug set', also the -ac diffs divided up
into the relevant sections


Unfixed and definitely needing fixes

o	TCP slow performance problem is still not merged from DaveM
o	Run two processes that keep rejoining multicast groups on an
	SMP box - crash
o	Spam the box remotely with syn floods and other crap, it leaks
	memory
o	select/poll magically break at some number of handles without
	an error
o	Procfs has locking errors on mm's
o	IDE probe often guesses wrong. Linux is impossible to install
	on these ranges of PCs. Needs fixing badly.
o	isdn4linux is old not CVS version. Basically unusable. If its
	not changing for 2.2.0 it should be commented out or deleted
o	eata-dma driver crashes the machine if at any instant it cant
	grab atomic isa dma memory. (Possible fix mark it obsolete and
	use eata.c which works fine)
o	Video4linux bttv tends to crash machines grabbing - fix around
	needs merging and the driver updating
o	You can't mount an ext2fs cdrom. (Block size error). Works in 2.0
o	generic_file_mmap and MSDOS/UMSDOS disagree over who clears
	blocks
o	bootp autobooting stuff corrupts other hosts arp stuff it seems
o	DaveM reports a pile of VMA operations done without locks held.
o	IDE defaults to multimode on causing serial problems, corruption
	with some drives, and hangs on boot with others.
o	Dual 486 boards won't boot SMP kernels
o	Tulip driver/fast routing stuff needs to be resolved. If they cant
	be merged the default tulip should be a current one.

Unfixed but not vital

o	NFS client over tcp doesnt work
o	NFS readahead is too low
o	NFS performance to 8K page sized BSD boxes sucks rocks, 2.0.x
	is about 5 times faster
o	Linus VM is still 20% slower than sct vm on an 8Mb machine
	[benchmarks kernel build and netscape]
o	fchmod on AF_UNIX sockets doesnt work like BSD
o	IPv6 calls set_multicast_list in the wrong context
o	TCP fails to handle small SO_SNDBUF/RCVBUF settings
o	Make xconfig needs layout fixes


o	Need to review all CONFIG_EXPERIMENTAL tags
	
Fixed in -ac patches

For Linus:

o	AVL tree vm avoids bad perfomance problems
o	MediaGX crashes on boot
o	Certain numbers of scsi disks dont seem to work
o	VFS clears setuid/gid flags wrongly on directories
o	COSA credited twice
o	string.h egcs fixes
o	Some further time fixes
o	Various time fixes submitted
o	KNFSD patches. With them knfsd seems to work ok. With the current
	tree it doesnt work at all. Probably this is "Experimental for 2.2"
o	AMD stepping ident, K6 ident
o	What the hell is going on in time.c, on a low memory box picking
	586 gives better performance for a 486 and several other chips
	without TSC registers. That patch piece is a bad way to save 1K
o	Various config combinations don't build
o	FTAPE doesnt work in .132/2.2.0pre
o	Various of the time_* changes to net/* are one out
o	Ted's last serial patch is missing (setserial crashes box)
o	IBMMCA doesnt work on the model 77 internal scsi
o	Trond's last NFS fix
o	include/linux/sysctl.h is exposed to user tasks even with glibc,
	but isnt strictly ANSI compliant
o	SYS5 shm debugging slows stuff down measurably -ifdef it
o	DVD's trip an isofs sanity check wrongly


Unsure:

o	Large file array support (will be required by vendors for several
	big name products). This is a tricky one. Im wearing too many hats
	to judge this objectively. Vendors will probably ship this anyway
	or something similar.

Linus doesnt want:

o	QlogicFC - no big problem, its seperate its clean and vendors
	can ship it and other driver addons easily as they do now. Its a 
	nobrainer to install of the net.

Stale ?:

o	ADFS updates
o	Load unversioned modules into versioned kernels when doing
	request_module etc.
o	Crashes and zero page scribbles using ptrace.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: 2.2.0 Bug summary
Date: 1999/01/01
Message-ID: <fa.iqgpdmv.1d0sp15@ifi.uio.no>
X-Deja-AN: 427612386
Original-Date: Thu, 31 Dec 1998 19:00:18 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.981231182534.658A-100000@laser.bogus>
References: <fa.m25408v.141ks8r@ifi.uio.no>
To: Alan Cox <a...@terrorserver.swansea.linux.org.uk>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: Andrea Arcangeli <and...@e-mind.com>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Tue, 29 Dec 1998, Alan Cox wrote:

> o	Linus VM is still 20% slower than sct vm on an 8Mb machine
> 	[benchmarks kernel build and netscape]

Today I start playing with Linus's vm in 2.2.0-pre1 and I changed the
semantics of many things and I added heuristic to avoid that one process
trashing memory will hang other "normal" processes. This my new VM I
developed today is _far_ better than sct's ac11 vm and anything I tried
before. I would like if somebody could try it also on low memory machines
and feedback what happens there.  I don't have enough spare time to test
it on many kind of hardware too. 

The same benchmark that was taking 106 sec on clean 2.2.0-pre1 to
dirtifying 160Mbyte of virtual memory (run with 128RAM and 72swap of phis
mem), now runs in 90 sec but this is not the most important thing, the
good point is that the cache/buffer/swap levels now are perfectly stable
and all other processes runs fine and get not out of cache even if there's
a memory trahser running at the same time.

Comments?

Ah, the shrink_mmap limit was wrong since we account only not referenced
pages.

Patch against 2.2.0-pre1:

Patch

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: [patch] new-vm improvement [Re: 2.2.0 Bug summary]
Date: 1999/01/01
Message-ID: <fa.iqghe6v.1d08ph7@ifi.uio.no>#1/1
X-Deja-AN: 427617980
Original-Date: Thu, 31 Dec 1998 19:34:40 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.981231193257.330B-100000@laser.bogus>
References: <fa.iqgpdmv.1d0sp15@ifi.uio.no>
To: Alan Cox <a...@terrorserver.swansea.linux.org.uk>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 31 Dec 1998, Andrea Arcangeli wrote:

> Comments?
> 
> Ah, the shrink_mmap limit was wrong since we account only not referenced
> pages.
> 
> Patch against 2.2.0-pre1:

whoops in the last email I forget to change a bit the subject (adding
[patch]) and this printk: 

Index: linux/mm/vmscan.c
diff -u linux/mm/vmscan.c:1.1.1.1.2.43 linux/mm/vmscan.c:1.1.1.1.2.45
--- linux/mm/vmscan.c:1.1.1.1.2.43	Thu Dec 31 17:56:27 1998
+++ linux/mm/vmscan.c	Thu Dec 31 19:41:06 1998
@@ -449,11 +449,7 @@
 	case 0:
 		/* swap_out() failed to swapout */
 		if (shrink_mmap(priority, gfp_mask))
-		{
-			printk("swapout 0 shrink 1\n");
 			return 1;
-		}
-		printk("swapout 0 shrink 0\n");
 		return 0;
 	case 1:
 		/* this would be the best but should not happen right now */



Andrea Arcangeli


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary]
Date: 1999/01/01
Message-ID: <fa.iqv61tv.16gg7bi@ifi.uio.no>#1/1
X-Deja-AN: 427757351
Original-Date: Fri, 1 Jan 1999 17:44:55 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990101171008.1145B-100000@laser.bogus>
References: <fa.iqghe6v.1d08ph7@ifi.uio.no>
To: Benjamin Redelings I <brede...@ucsd.edu>, "Stephen C. Tweedie" 
<s...@redhat.com>, Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: Andrea Arcangeli <and...@e-mind.com>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

I' ll try to comment my latest VM patch.

The patch basically do two things.

It add an heuristic to block trashing tasks in try_to_free_pages() and
allow normal tasks to run fine in the meantime.

It returns to the old do_try_to_free_pages() way to do things. I think the
reason the old way was no longer working well is that we are using
swap_out()  as other freeing-methods while swapout has really nothing to
do with them. 

To get VM stability under low memory we must use both swap_out() (that put
pages from the user process Vmemory to the swap cache) and shrink_mmap() 
in a new method. My new method put user pages in the swap cache because
there we can handle aging very well. Then shrink_mmap() can free a not
refernced page to really do some progress in the memory freeing (and not
only in the swapout).

So basically my patch cause sure the system to swapout more than we was
used to do, but most of the time we will not need a swapin to reput the
pages in the process Vmemory.

Somebody reported a big slowdown of the trashing application. Right now I
don't know which bit of the patch caused this slowdown (yesterday my
benchmark here didn't showed this slowdown). My new trashing_memory
heuristic will probably decrease performance for the trashing application
(but hey you know that if you need performance you can alwaws buy more RAM
;), but it will improve a lot performance for normal not-trashing tasks. 

I' ll try to change do_free_user_and_cache() to see if I can achieve
something better.

I changed also the swap_out() since the best way to choose a process it to
compare the raw RSS I think. And I don' t want that swap_cnt is decreased
of something every time something is swapped out. I want that the kernel
will continue passing throught all the pages of one process once it
started playing with it (if it will still exists of course ;). I changed
also the pressure of swap_out() since it make no sense to me to pass more
than one time over the VM of all tasks in the system. Now at priority 6
swap_out()  is trying to swapout something at max from nr_tasks/7 (low
bound to 1 task). I changed also the pressure of shrink_mmap() because it
was making no sense to me to do two passes on just not referenced pages.

I also changed swapout() allowing it to return 0 1 or more.

0 means that swap_out() is been not able to put in the swap cache
something.

1 means that swap_out() is been able to swapout something and has also
freed up one page (how??? it can't right now because the page should
always be still at least present in the swap cache)

2 means that swap_out() has swapped out 1 page and that the page is still
referenced somewhere (probably by the swap cache)

So in case 2 and case 0 we must use shrink_mmap() to really do some
progress in the page freeing.  This the idea that my new
do_free_user_and_cache() follows.

Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary]
Date: 1999/01/01
Message-ID: <fa.ing1dev.1e00p93@ifi.uio.no>
X-Deja-AN: 427806048
Original-Date: Fri, 1 Jan 1999 21:02:29 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990101203728.301B-100000@laser.bogus>
References: <fa.iqv61tv.16gg7bi@ifi.uio.no>
To: Benjamin Redelings I <brede...@ucsd.edu>, "Stephen C. Tweedie" 
<s...@redhat.com>, Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: Andrea Arcangeli <and...@e-mind.com>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

I rediffed my VM patch against test1-patch-2.2.0-pre3.gz. I also fixed
some bug (not totally critical but..) pointed out by Linus in my last
code. I also changed the shrink_mmap(0) to shrink_mmap(priority) because
it was completly sucking a lot performance. There is no need to do a
shrink_mmap(0) for example if the cache/buffer are under min. In such case
we must allow the swap_out() to grow the cache before start shrinking it.

So basically this new patch is _far_ more efficient than the last
one (I never seen so good/stable/fast behavior before!).

This my new patch is against testing/test1-patch-2.2.0-pre3.gz that is
against v2.1/2.2.0-pre2 that is against patch-2.2.0-pre1-vs-2.1.132.gz
(where is this last one now?).

Ah, from testing/test1-patch-2.2.0-pre3.gz was missing the trashing memory
initialization that will allow every process to do a fast start.

Patch

If this patch is decreasing performance for you (eventually due too much
memory swapped out) you can try this incremental patch (I never tried here
btw):

Index: mm//vmscan.c
===================================================================
RCS file: /var/cvs/linux/mm/vmscan.c,v
retrieving revision 1.1.1.1.2.49
diff -u -r1.1.1.1.2.49 vmscan.c
--- vmscan.c	1999/01/01 19:29:19	1.1.1.1.2.49
+++ linux/mm/vmscan.c	1999/01/01 19:51:22
@@ -441,6 +441,9 @@

 static int do_free_user_and_cache(int priority, int gfp_mask)
 {
+	if (shrink_mmap(priority, gfp_mask))
+		return 1;
+
 	switch (swap_out(priority, gfp_mask))
 	{
 	default:

I written a swap benchmark that is dirtifying 160Mbyte of VM. For the
first loop 2.2-pre1 was taking 106 sec, for the second loop 120 and
then worse.

test1-pre3 + my new patch in this email, instead takes 120 sec in the
first loop (since it's allocating it's probably slowed down a bit by the
trashing_memory heuristic, and that's right), then it takes 90 sec in the
second loop and 77 sec in the third loop!! and the system was far to be
idle (as when I measured 2.2-pre1), but I was using it without special
regards and was perfectly usable (2.2-pre1 was unusable instead).

Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Steve Bergman <st...@netplus.net>
Subject: Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary]
Date: 1999/01/02
Message-ID: <fa.coatldv.pi2vin@ifi.uio.no>#1/1
X-Deja-AN: 427867234
Original-Date: Fri, 01 Jan 1999 17:46:26 -0600
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-ID: <368D5E52.FE8B7B8@netplus.net>
References: <fa.ing1dev.1e00p93@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
Original-References: <Pine.LNX.3.96.990101203728.301B-100...@laser.bogus>
X-Accept-Language: en
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Andrea Arcangeli wrote:


> 
> Please stop and try my new patch against Linus's test1-pre3 (that just
> merge some of my new stuff).

I got the patch and I must say I'm impressed.  I ran my "117 image" test
and got these results:

[Note: This loads 117 different images at the same time using 117
separate instances of 'xv' started in the background and results in ~
165 MB of swap area usage.  The machine is an AMD K6-2 300 with 128MB]


2.1.131-ac11                         172 sec  (This was previously the
best)
2.2.0-pre1 + Arcangeli's 1st patch   400 sec
test1-pre  + Arcangeli's 2nd patch   119 sec (!)

Processor utilization was substantially greater with the new patch
compared to either of the others.  Before it starts using swap, memory
is being consumed at ~ 4MB/sec.  After it starts to swap out, it streams
out at ~ 2MB/sec.

The performance is ~ 45% better than ac11 and ~ 70% better than
2.2.0-pre1 in this test.  

I was going to test the low memory case but got side tracked.


Thanks,
Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/From: Linus Torvalds 
<torva...@transmeta.com>
Subject: Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary]
Date: 1999/01/02
Message-ID: <fa.no4l96v.1j7mk8i@ifi.uio.no>
X-Deja-AN: 427932608
Original-Date: Fri, 1 Jan 1999 22:55:09 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990101225111.16066K-100000@penguin.transmeta.com>
References: <fa.coatldv.pi2vin@ifi.uio.no>
To: Steve Bergman <st...@netplus.net>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



On Fri, 1 Jan 1999, Steve Bergman wrote:
>
> I got the patch and I must say I'm impressed.  I ran my "117 image" test
> and got these results:
> 
> 2.1.131-ac11                         172 sec  (This was previously the best)
> 2.2.0-pre1 + Arcangeli's 1st patch   400 sec
> test1-pre  + Arcangeli's 2nd patch   119 sec (!)

Would you care to do some more testing? In particular, I'd like to hear
how basic 2.2.0pre3 works (that's essentially the same as test1-pre, with
only minor updates)? I'd like to calibrate the numbers against that,
rather than against kernels that I haven't actually ever run myself. 

The other thing I'd like to hear is how pre3 looks with this patch, which
should behave basically like Andrea's latest patch but without the
obfuscation he put into his patch..

		Linus

-----
Code


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Steve Bergman <st...@netplus.net>
Subject: Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary]
Date: 1999/01/02
Message-ID: <fa.fkjva6v.42carq@ifi.uio.no>#1/1
X-Deja-AN: 427952017
Original-Date: Sat, 02 Jan 1999 02:33:50 -0600
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-ID: <368DD9EE.D19A4D61@netplus.net>
References: <fa.no4l96v.1j7mk8i@ifi.uio.no>
To: unlisted-recipients:; (no To-header on input)
Original-References: <Pine.LNX.3.95.990101225111.16066K-100...@penguin.transmeta.com>
X-Accept-Language: en
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Linus Torvalds wrote:
> 
> On Fri, 1 Jan 1999, Steve Bergman wrote:
> >
> > I got the patch and I must say I'm impressed.  I ran my "117 image" test
> > and got these results:
> >
> > 2.1.131-ac11                         172 sec  (This was previously the best)
> > 2.2.0-pre1 + Arcangeli's 1st patch   400 sec
> > test1-pre  + Arcangeli's 2nd patch   119 sec (!)
> 
> Would you care to do some more testing? In particular, I'd like to hear
> how basic 2.2.0pre3 works (that's essentially the same as test1-pre, with
> only minor updates)? I'd like to calibrate the numbers against that,
> rather than against kernels that I haven't actually ever run myself.
> 
> The other thing I'd like to hear is how pre3 looks with this patch, which
> should behave basically like Andrea's latest patch 

Hi Linus,

Andrea sent another patch to correct a problem with i/o bound processes,
which he also posted to linux-kernel.  The performance in this test is
unchanged.

Here are the results:


2.1.131-ac11                                    172 sec  

2.2.0-pre1 + Arcangeli's 1st patch              400 sec
test1-pre  + Arcangeli's 2nd patch              119 sec 
test1-pre  + Arcangeli's 3rd patch              119 sec
test1-pre  + Arcangeli's 3rd patch              117 sec 
(changed to priority = 9 in mm/vmscan.c)

2.2.0-pre3                                      175 sec
2.2.0-pre3 + Linus's patch                      129 sec

RH5.2 Stock (2.0.36-0.7)                        280 sec



I noticed that in watching the 'vmstat 1' during the test that
'2.2.0+Linus patch' was not *quite* as smooth as the Archangeli patches,
in that there were periods of 2 or 3 seconds in which the swap out rate
would fall to ~800k/sec and then jump back up to 1.8-2.5MB/sec.  I have
only run your patch once though.  I'll check it further tomorrow to
confirm that that is really the case.  Note how much better 2.2 is doing
compared to 2.0.36-0.7 in this situation.

I should be available for a good part of this weekend for further
testing; Just let me know.

As a reference:

AMD K6-2 300
128MB ram
2GB seagate scsi2 dedicated to swap
Data drive is 6.5GB UDMA


Steve Bergman
st...@netplus.net

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm improvement , 
[Re: 2.2.0 Bug summary]]
Date: 1999/01/05
Message-ID: <fa.j1f026v.100m7ju@ifi.uio.no>
X-Deja-AN: 429190288
Original-Date: Mon, 4 Jan 1999 19:08:00 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990104183954.1944B-100000@laser.bogus>
References: <fa.in0bdmv.1fg2oh2@ifi.uio.no>
To: Steve Bergman <st...@netplus.net>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: Andrea Arcangeli <and...@e-mind.com>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

I have a new revolutionary patch. The main thing is that I killed kswapd
just to make Rik happy ;).

Ah and my last patches had a little bug that was sure hurting performances
against Linus's VM since I was stopping kswapd when nr_free_pages >
freepages.high was true and not as right Linus was doing when
nr_free_pages > freepages.high + swap_cluster. So I was causing a lot of
kswapd wakeup.

There was also a not improved thing in the trashing_memory heuristic, that
is to remove the trashing bit only if PF_MEMALLOC is not set.

Ah and the swapout code seems to like a linear and not exponential
priority handling. Probably it likes more to succeed than shrink_mmap().

If you'll try it let me know. I am interested about the image load test
(that should be the most near to the real world). 

With this patch the swapout performances are doubled. The swapout
benchmark that was used to take 100 sec with my old code and with Linus's
VM, now run in 50sec! Now I go to 6Mbyte at sec (3so and 3si) instead of
3Mbyte sec (1.5so, 1.5si). 6mbyte/sec is the performance reported by
hdparm -t btw ;). And all the system is perfectly fluid (far more fuild
than with the old code). I open an xterm without wait seconds. The cache
get not kiked out. It seems really great here. When the system goes OOM it
seems to recover fine.

Here arca-vm-6 against 2.2.0-pre4:

Patch

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm improvement , 
[Re: 2.2.0 Bug summary]]
Date: 1999/01/05
Message-ID: <fa.ns557uv.1t7ulgm@ifi.uio.no>#1/1
X-Deja-AN: 429053804
Original-Date: Mon, 4 Jan 1999 12:56:27 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990104125147.32215U-100000@penguin.transmeta.com>
References: <fa.j1f026v.100m7ju@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Mon, 4 Jan 1999, Andrea Arcangeli wrote:
>
> I have a new revolutionary patch. The main thing is that I killed kswapd
> just to make Rik happy ;).

Ehh..

You may have made Rik happy, but you totally missed the reason for kswapd. 
And while your patch looked interesting (a lot cleaner than the previous
ones, and I _like_ patches that remove code), the fact that you killed
kswapd means that it is essentially useless. 

Basically, we _have_ to have kswapd, and I'll tell you why:
 - imagine running low on memory due to GFP_ATOMIC
 - imagine not having any normal processes that do memory alloction.

Boom. You just killed the machine with your patch, because maybe the
GPF_ATOMIC things are what the machine is doing. Imagine a machine that
acts as a router - it might not even be running any normal user processes
at _all_, but it had damn well better make sure that memory is always
available some way. "kswapd" did that for us, and Rik's happiness counts
as nothing in face of basic facts of life like that. Sorry.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: arca-vm-8 [Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm , 
improvement , [Re: 2.2.0 Bug summary]]]
Date: 1999/01/07
Message-ID: <fa.ofpdegv.k4a0g7@ifi.uio.no>#1/1
X-Deja-AN: 429891336
Original-Date: Wed, 6 Jan 1999 15:35:01 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990106153252.7800D-100000@penguin.transmeta.com>
References: <fa.ingbd7v.1f0qph0@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu


Oh, well.. Based on what the arca-[678] patches did, there's now a pre-5
out there. Not very similar, but it should incorporate the basic idea: 
namely much more aggressively asynchronous swap-outs from a process
context. 

Comment away,

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: ebiederm+e...@ccr.net (Eric W. Biederman)
Subject: Re: arca-vm-8 [Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm , 
improvement , [Re: 2.2.0 Bug summary]]]
Date: 1999/01/08
Message-ID: <fa.g9iq3pv.snq0bp@ifi.uio.no>#1/1
X-Deja-AN: 430009088
Original-Date: 06 Jan 1999 22:30:59 -0600
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <m1aezvg0vw.fsf@flinx.ccr.net>
References: <fa.ofpdegv.k4a0g7@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
Original-References: <Pine.LNX.3.95.990106153252.7800D-100...@penguin.transmeta.com>
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Internet mailing list
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

>>>>> "LT" == Linus Torvalds <torva...@transmeta.com> writes:

LT> Oh, well.. Based on what the arca-[678] patches did, there's now a pre-5
LT> out there. Not very similar, but it should incorporate the basic idea: 
LT> namely much more aggressively asynchronous swap-outs from a process
LT> context. 

LT> Comment away,

1) With your comments on PG_dirty/(what shrink_mmap should do) you
   have worked out what needs to happen for the mapped in memory case,
   and I haven't quite gotten there.  Thank You.

2) I have tested using PG_dirty from shrink_mmap and it is a
   performance problem because it loses all locality of reference,
   and because it forces shrink_mmap into a dual role, of freeing and
   writing pages, which need seperate tuning.

Linus is this a case you feel is important to tune for 2.2?
If so I would be happy to play with it.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: arca-vm-8 [Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm , 
improvement , [Re: 2.2.0 Bug summary]]]
Date: 1999/01/08
Message-ID: <fa.obqbegv.h4k0gc@ifi.uio.no>#1/1
X-Deja-AN: 430359625
Original-Date: Thu, 7 Jan 1999 09:56:03 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990107093746.4270H-100000@penguin.transmeta.com>
References: <fa.g9iq3pv.snq0bp@ifi.uio.no>
To: "Eric W. Biederman" <ebiederm+e...@ccr.net>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On 6 Jan 1999, Eric W. Biederman wrote:
> 
> 1) With your comments on PG_dirty/(what shrink_mmap should do) you
>    have worked out what needs to happen for the mapped in memory case,
>    and I haven't quite gotten there.  Thank You.

Note that it is not finalized. That's why I didn't write the code (which
should be fairly simple), because it has some fairly subtle issues and
thus becomes a 2.3.x thing, I very much suspect.

Basically, my rule of thumb for the changes I did was: "it should have the
same code paths as the old code". What that means is that I didn't
actually do any changes that changed real code: I did only changes that
changed _behaviour_.

That way I can be reasonably hopeful that there are no new bugs introduced
even though performance is very different. I _do_ have some early data
that seems to say that this _has_ uncovered a very old deadlock condition: 
something that could happen before but was almost impossible to trigger. 

The deadlock I suspect is:
 - we're low on memory
 - we allocate or look up a new block on the filesystem. This involves
   getting the ext2 superblock lock, and doing a "bread()" of the free
   block bitmap block.
 - this causes us to try to allocate a new buffer, and we are so low on
   memory that we go into try_to_free_pages() to find some more memory.
 - try_to_free_pages() finds a shared memory file to page out.
 - trying to page that out, it looks up the buffers on the filesystem it
   needs, but deadlocks on the superblock lock.

Note that this could happen before too (I've not removed any of the
codepaths that could lead to it), but it was dynamically _much_ less
likely to happen.

I'm not even sure it really exists, but I have some really old reports
that _could_ be due to this, and a few more recent ones (that I never
could explain). And I have a few _really_ recent ones from here internally
at transmeta that looks like it's triggering more easily these days.

(Note that this is not actually pre5-related: I've been chasing this on
and off for some time, and it seems to have just gotten easier to trigger,
which is why I finally have a theory on what is going on - just a theory
though, and I may be completely off the mark). 

The positive news is that if I'm right in my suspicions it can only happen
with shared writable mappings or shared memory segments. The bad news is
that the bug appears rather old, and no immediate solution presents
itself. 

> 2) I have tested using PG_dirty from shrink_mmap and it is a
>    performance problem because it loses all locality of reference,
>    and because it forces shrink_mmap into a dual role, of freeing and
>    writing pages, which need seperate tuning.

Exactly. This is part of the complexity.

The right solution (I _think_) is to conceptually always mark it PG_dirty
in vmscan, and basically leave all the nasty cases to the filemap physical
page scan. But in the simple cases (ie a swap-cached page that is only
mapped by one process and doesn't have any other users), you'd start the
IO "early".

That would essentially mean that normal single mappings get the good
locality, while the case we really suck at right now (multiple mappings
which can all dirty the page) would not cause excessive page-outs. 

Basically, I think that the stuff we handle now with the swap-cache we do
well on already, and we'd only really want to handle the shared memory
case with PG_dirty. But I think this is a 2.3 issue, and I only added the
comment (and the PG_dirty define) for now. 

> Linus is this a case you feel is important to tune for 2.2?
> If so I would be happy to play with it.

It might be something good to test out, but I really don't want patches at
this date (unless your patches also fix the above deadlock problem, which
I can't see them doing ;)

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: arca-vm-8 [Re: [patch] arca-vm-6, killed kswapd [Re: [patch] new-vm , 
improvement , [Re: 2.2.0 Bug summary]]]
Date: 1999/01/09
Message-ID: <fa.oa9tg0v.jkq00l@ifi.uio.no>#1/1
X-Deja-AN: 430431821
Original-Date: Thu, 7 Jan 1999 14:57:34 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990107144729.5025P-100000@penguin.transmeta.com>
References: <fa.obqbegv.h4k0gc@ifi.uio.no>
To: "Eric W. Biederman" <ebiederm+e...@ccr.net>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 7 Jan 1999, Linus Torvalds wrote:
> 
> The deadlock I suspect is:
>  - we're low on memory
>  - we allocate or look up a new block on the filesystem. This involves
>    getting the ext2 superblock lock, and doing a "bread()" of the free
>    block bitmap block.
>  - this causes us to try to allocate a new buffer, and we are so low on
>    memory that we go into try_to_free_pages() to find some more memory.
>  - try_to_free_pages() finds a shared memory file to page out.
>  - trying to page that out, it looks up the buffers on the filesystem it
>    needs, but deadlocks on the superblock lock.

Confirmed. Hpa was good enough to reproduce this, and my debugging code
caught the (fairly deep) deadlock: 

	system_call ->
	sys_write ->
	ext2_file_write ->
	ext2_getblk ->
	ext2_alloc_block ->	** gets superblock lock **
	ext2_new_block ->
	getblk ->
	refill_freelist ->
	grow_buffers ->
	__get_free_pages ->
	try_to_free_pages ->
	swap_out ->
	swap_out_process ->
	swap_out_vma ->
	try_to_swap_out ->
	filemap_swapout ->
	filemap_write_page ->
	ext2_file_write ->
	ext2_getblk ->
	ext2_alloc_block ->
	__wait_on_super		** BOOM - we want the superblock lock again **

and I suspect the fix is fairly simple: I'll just add back the __GFP_IO
bit (we kind of used to have one that did something similar) which will
make the swap-out code not write out shared pages when it allocates
buffers. 

The better fix would actually be to make sure that filesystems do not hold
locks around these kinds of blocking operations, but that is harder to do
at this late stage.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Savochkin Andrey Vladimirovich <s...@msu.ru>
Subject: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/10
Message-ID: <fa.e9ql71v.9no40g@ifi.uio.no>#1/1
X-Deja-AN: 430851879
Original-Date: Sat, 9 Jan 1999 12:43:04 +0300
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <19990109124304.C26523@castle.nmd.msu.ru>
References: <fa.oa9tg0v.jkq00l@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
Original-References: <Pine.LNX.3.95.990107093746.4270H-100...@penguin.transmeta.com> 
<Pine.LNX.3.95.990107144729.5025P-100...@penguin.transmeta.com>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Mime-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

I've found an another deadlock.
Two processes were locked trying to grab an inode write semaphore.
Their call traces are (in diff format):

 Using `map-2.2.0pre5-1' to map addresses to symbols.
 
 Trace: c010f038 <__down+58/90>
 Trace: c018d080 <__down_failed+8/c>
 Trace: c011abaa <filemap_write_page+a6/15c>
 Trace: c011acad <filemap_swapout+4d/60>
 Trace: c011e2ae <try_to_swap_out+10a/1ac>
 Trace: c011e45a <swap_out_vma+10a/174>
 Trace: c011e521 <swap_out_process+5d/8c>
 Trace: c011e60b <swap_out+bb/e4>
 Trace: c011e75b <try_to_free_pages+4b/70>
 Trace: c011ef61 <__get_free_pages+b5/1dc>
-Trace: c0119cd7 <try_to_read_ahead+2f/124>
-Trace: c011a970 <filemap_nopage+170/304>
-Trace: c0118888 <do_no_page+54/e4>
-Trace: c01189e4 <handle_mm_fault+cc/168>
+Trace: c0118375 <do_wp_page+19/210>
+Trace: c0118a3a <handle_mm_fault+122/168>
 Trace: c010ce9f <do_page_fault+143/364>

I suspect that one of the processes grabbed the semaphore and then deadlocked
trying to do it again.  Probably the process invoked write()
with the data having been swapped out.  The page fault handler
tried to free some memory and try_to_free_pages decided to write
out dirty pages of a shared mapping.  By accident the dirty pages
happened to belong to the file the process had started to write to.

A simple solution will be to check if the inode semaphore is held
before trying to write pages out and skip the mapping if it is.
However it doesn't seem to be a very good solution because if the most
memory is occupied by dirty pages of a shared mapping then
writing the pages out is the most right thing to do.

Best wishes
					Andrey V.
					Savochkin

On Thu, Jan 07, 1999 at 02:57:34PM -0800, Linus Torvalds wrote:
[snip]
> Confirmed. Hpa was good enough to reproduce this, and my debugging code
> caught the (fairly deep) deadlock: 
> 
> 	system_call ->
> 	sys_write ->
> 	ext2_file_write ->
> 	ext2_getblk ->
> 	ext2_alloc_block ->	** gets superblock lock **
> 	ext2_new_block ->
> 	getblk ->
> 	refill_freelist ->
> 	grow_buffers ->
> 	__get_free_pages ->
> 	try_to_free_pages ->
> 	swap_out ->
> 	swap_out_process ->
> 	swap_out_vma ->
> 	try_to_swap_out ->
> 	filemap_swapout ->
> 	filemap_write_page ->
> 	ext2_file_write ->
> 	ext2_getblk ->
> 	ext2_alloc_block ->
> 	__wait_on_super		** BOOM - we want the superblock lock again **

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/10
Message-ID: <fa.oca1fgv.hkq1gf@ifi.uio.no>#1/1
X-Deja-AN: 430944742
Original-Date: Sat, 9 Jan 1999 10:00:27 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990109095521.2572A-100000@penguin.transmeta.com>
References: <fa.e9ql71v.9no40g@ifi.uio.no>
To: Savochkin Andrey Vladimirovich <s...@msu.ru>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sat, 9 Jan 1999, Savochkin Andrey Vladimirovich wrote:
> 
> I've found an another deadlock.

Yes. This is a case I knew about, and that Alan already mentioned. Trying
to write from a shared mapping has a path that can take the write
semaphore twice.

This one is a whole lot harder to fix - the previous one needed only a
simple extra flag, this one is truly nasty.

The cleanest solution I can think of is actually to allow semaphores to be
recursive. I can do that with minimal overhead (just one extra instruction
in the non-contention case), so it's not too bad, and I've wanted to do it
for certain other things, but it's still a nasty piece of code to mess
around with. 

Oh, well. I don't think I have much choice. Making the swap-out routines
refuse to touch an inode that is busy is a sure way to allow people to
let bad users lock down infinite amounts of memory.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Steve Bergman <st...@netplus.net>
Subject: Re: Results: pre6 vs pre6+zlatko's_patch  vs pre5 vs arcavm13
Date: 1999/01/10
Message-ID: <fa.flid25v.55mbrj@ifi.uio.no>#1/1
X-Deja-AN: 431007061
Original-Date: Sat, 09 Jan 1999 18:28:50 -0600
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-ID: <3697F442.222A2301@netplus.net>
References: <fa.fd4f7mv.1tlkl86@ifi.uio.no>
Original-References: <Pine.LNX.3.96.990107001448.1242B-100...@laser.bogus> 
<36942ACA.3F8C0...@netplus.net> <3697DA94.F0F32...@netplus.net>
X-Accept-Language: en
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Steve Bergman wrote:

I ran the "image test" (loading 116 jpg images simultaneously) on the latest
patches and got these results in 128MB (I end up with ~ 160MB in swap):

pre6+zlatko's_patch	2:35
pre6			2:27
pre5			1:58
arcavm13		9:13

Arcavm13 (the star performer in the low memory test) is having problems here. 
Pre5, which I performed about the same as the others in the my low memory test
and which I ignored in my even lower 12MB test looks quite good here.  Based on
it's good performance here, I decided to run the 12MB kernel compile test on it,
as well.  (See what happens when I try to cut corners...)

In 12MB:

pre6+zlatko_patch       22:14   383206  204482  57823
pre6                    20:54   352934  191210  48678
pre5                    19:35	334680	183732	93427 
arcavm13                19:45   344452  180243  38977

Pre5 is looking good.  Based upon the tests that I have run, anyway.  I agree
with the person who expressed a distrust of benchmarks.  But numbers are
necessary for tuning.  "Feels faster" is just not a very trustworthy thing.  So
I also agree with one of the responses:

"Try out your favorite apps and time some portion of them and post any
interesting numbers." (paraphrased)

Benchmarks are not the problem.  The problem is the lack of comprehensiveness,
or the tunnel-vision if you prefer, that benchmarks can lead one into.  Find a
way to quantify the things that you do everyday and post the results.

-Thanks
-Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: Results: pre6 vs pre6+zlatko's_patch  vs pre5 vs arcavm13
Date: 1999/01/11
Message-ID: <fa.odpnfov.h401o8@ifi.uio.no>#1/1
X-Deja-AN: 431067627
Original-Date: Sat, 9 Jan 1999 21:35:40 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990109213225.4665G-100000@penguin.transmeta.com>
References: <fa.flid25v.55mbrj@ifi.uio.no>
To: Steve Bergman <st...@netplus.net>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



On Sat, 9 Jan 1999, Steve Bergman wrote:
> 
> I ran the "image test" (loading 116 jpg images simultaneously) on the latest
> patches and got these results in 128MB (I end up with ~ 160MB in swap):
> 
> pre6+zlatko's_patch	2:35
> pre6			2:27
> pre5			1:58
> arcavm13		9:13

Can you run pre6+zlatko with just the mm/page_alloc.c one-liner reverted
to pre5? That is, take pre6+zlatko, and just change 

	try_to_free_pages(gfp_mask, freepages.high - nr_free_pages);

back to

	try_to_free_pages(gfp_mask, SWAP_CLUSTER_MAX);

That particular one-liner was almost certainly a mistake, it was done on
the mistaken assumption that the clustering problem was due to
insufficient write-time clustering - while zlatko found that it was
actually due to fragmentation in the swap area. With zlatkos patch, the
original SWAP_CLUSTER_MAX is probably better and almost certainly results
in smoother behaviour due to less extreme free_pages.. 

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/11
Message-ID: <fa.ocppgov.h4a389@ifi.uio.no>#1/1
X-Deja-AN: 431087276
Original-Date: Sat, 9 Jan 1999 13:50:14 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990109134233.3478A-100000@penguin.transmeta.com>
References: <fa.oca1fgv.hkq1gf@ifi.uio.no>
To: Savochkin Andrey Vladimirovich <s...@msu.ru>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



On Sat, 9 Jan 1999, Linus Torvalds wrote:
> 
> The cleanest solution I can think of is actually to allow semaphores to be
> recursive. I can do that with minimal overhead (just one extra instruction
> in the non-contention case), so it's not too bad, and I've wanted to do it
> for certain other things, but it's still a nasty piece of code to mess
> around with. 
> 
> Oh, well. I don't think I have much choice.

Does anybody know semaphores by heart? I've got code that may well work,
but the race conditions for semaphores are nasty. As mentioned, this only
adds a single instruction to the common non-contended case, and I really
do believe it should be correct, but it is completely untested (so it
might not work at all), and it would be good to have somebody with some
theory go through this.. 

Basically, these simple changes should make it ok to do recursive
semaphore grabs, so

	down(&sem);
	down(&sem);
	up(&sem);
	up(&sem);

should work and leave the semaphore unlocked.

Anybody? Semaphore theory used to be really popular at Universities, so
there must be somebody who has some automated proving program somewhere..

		Linus

-----
Code


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Steve Bergman <st...@netplus.net>
Subject: Re: Results: pre6 vs pre6+zlatko's_patch  vs pre5 vs arcavm13
Date: 1999/01/11
Message-ID: <fa.fg2j6lv.1ulomob@ifi.uio.no>#1/1
X-Deja-AN: 431148214
Original-Date: Sun, 10 Jan 1999 12:43:45 -0600
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-ID: <3698F4E1.715105C6@netplus.net>
References: <fa.odpnfov.h401o8@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
Original-References: <Pine.LNX.3.95.990109213225.4665G-100...@penguin.transmeta.com>
X-Accept-Language: en
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Linus Torvalds wrote:

> Can you run pre6+zlatko with just the mm/page_alloc.c one-liner reverted
> to pre5? That is, take pre6+zlatko, and just change
> 
>         try_to_free_pages(gfp_mask, freepages.high - nr_free_pages);
> 
> back to
> 
>         try_to_free_pages(gfp_mask, SWAP_CLUSTER_MAX);
> 

OK, here are the updated results:

'Image test' in 128MB:

pre6+zlatko's_patch     	2:35
and with requested change	3:09
pre6                    	2:27
pre5                    	1:58
arcavm13                	9:13


I also ran the kernel compile test:

In 12MB:
				Elapsed	Maj.	Min.	Swaps
				-----	------	------	-----
pre6+zlatko_patch       	22:14   383206  204482  57823
and with requested change	22:23	378662	198194	51445
pre6                    	20:54   352934  191210  48678
pre5                    	19:35   334680  183732  93427 
arcavm13                	19:45   344452  180243  38977

The change seems to have hurt it in both cases.  What I am seeing on pre6 and
it's derivitives is a *lot* of *swapin* activity.  Pre5 almost exclusively swaps
*out* during the image test, averaging about 1.25MB/sec (spends a lot of time at
around 2000k/sec) with very little swapping in.  All the pre6 derivitives swap
*in* quite heavily during the test.  The 'so' number sometimes drops to 0 for
seconds at a time.  It also looks like pre6 swaps out slightly more overall
(~165MB vs 160MB).

-Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Stephen C. Tweedie" <s...@redhat.com>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/11
Message-ID: <fa.j3h0m5v.ikge1s@ifi.uio.no>#1/1
X-Deja-AN: 431164481
Original-Date: Sun, 10 Jan 1999 16:59:43 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-Id: <199901101659.QAA00922@dax.scot.redhat.com>
References: <fa.ocppgov.h4a389@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
Original-References: <Pine.LNX.3.95.990109095521.2572A-100...@penguin.transmeta.com> 
<Pine.LNX.3.95.990109134233.3478A-100...@penguin.transmeta.com>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

On Sat, 9 Jan 1999 13:50:14 -0800 (PST), Linus Torvalds
<torva...@transmeta.com> said:

> On Sat, 9 Jan 1999, Linus Torvalds wrote:
>> 
>> The cleanest solution I can think of is actually to allow semaphores to be
>> recursive. I can do that with minimal overhead (just one extra instruction
>> in the non-contention case), so it's not too bad, and I've wanted to do it
>> for certain other things, but it's still a nasty piece of code to mess
>> around with. 

Ack.  I've been having a closer look, and making the superblock lock
recursive doesn't work: the ext2fs allocation code is definitely not
reentrant.  In particular, the bitmap buffers can get evicted out from
under our feet if we reenter the block allocation code, leading to nasty
filesystem and/or memory corruption.  The allocation code can also get
confused if the bitmap contents change between checking the group
descriptor for a block group and reading in the bitmap itself, leading
to potential ENOSPC errors turning up wrongly.

Preventing recursive VM access to the filesystem while we have the
superblock lock seems the only easy way out short of making the
allocation/truncate code fully reentrant.

On the other hand, it does look as if the inode deadlock is dealt with
OK if we just make that semaphore recursive; I can't see anywhere that
dies if we make that change.  This does somewhat imply that we may need
to make a distinction between reentrant and non-reentrant semaphores if
we go down this route.

--Stephen.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/11
Message-ID: <fa.oc9dgov.hka2o1@ifi.uio.no>#1/1
X-Deja-AN: 431148217
Original-Date: Sun, 10 Jan 1999 10:35:10 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990110103201.7668D-100000@penguin.transmeta.com>
References: <fa.j3h0m5v.ikge1s@ifi.uio.no>
To: "Stephen C. Tweedie" <s...@redhat.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sun, 10 Jan 1999, Stephen C. Tweedie wrote:
> 
> Ack.  I've been having a closer look, and making the superblock lock
> recursive doesn't work

That's fine - the superblock lock doesn't need to be re-entrant, because
__GFP_IO is quite sufficient for that one.

The thing I want to make re-entrant is just semaphore accesses: at the
point where we would otherwise deadlock on the writer semaphore it's much
better to just allow nested writes. I suspect all filesystems can already
handle nested writes - they are a lot easier to handle than truly
concurrent ones.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Savochkin Andrey Vladimirovich <s...@msu.ru>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/12
Message-ID: <fa.iham6bv.1v0esiv@ifi.uio.no>#1/1
X-Deja-AN: 431454453
Original-Date: Mon, 11 Jan 1999 17:11:38 +0300
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <19990111171138.A9675@castle.nmd.msu.ru>
References: <fa.oc9dgov.hka2o1@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
Original-References: <199901101659.QAA00...@dax.scot.redhat.com> 
<Pine.LNX.3.95.990110103201.7668D-100...@penguin.transmeta.com>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Mime-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sun, Jan 10, 1999 at 10:35:10AM -0800, Linus Torvalds wrote:
> The thing I want to make re-entrant is just semaphore accesses: at the
> point where we would otherwise deadlock on the writer semaphore it's much
> better to just allow nested writes. I suspect all filesystems can already
> handle nested writes - they are a lot easier to handle than truly
> concurrent ones.

You're an optimist, aren't you? :-)

In any case I've checked your recursive semaphore code on a news server
which reliably deadlocked with the previous kernels.
The code seems to work well.

Best wishes
					Andrey V.
					Savochkin

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Stephen C. Tweedie" <s...@redhat.com>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/11
Message-ID: <fa.j516l6v.g4mdhm@ifi.uio.no>#1/1
X-Deja-AN: 431198177
Original-Date: Sun, 10 Jan 1999 22:49:47 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-Id: <199901102249.WAA01684@dax.scot.redhat.com>
References: <fa.oc9dgov.hka2o1@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
Original-References: <199901101659.QAA00...@dax.scot.redhat.com> 
<Pine.LNX.3.95.990110103201.7668D-100...@penguin.transmeta.com>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

On Sun, 10 Jan 1999 10:35:10 -0800 (PST), Linus Torvalds
<torva...@transmeta.com> said:

> On Sun, 10 Jan 1999, Stephen C. Tweedie wrote:
>> 
>> Ack.  I've been having a closer look, and making the superblock lock
>> recursive doesn't work

> That's fine - the superblock lock doesn't need to be re-entrant, because
> __GFP_IO is quite sufficient for that one.

I'm no longer convinced about that.  I think it's much much worse.  A
bread() on an ext2 bitmap buffer with the superblock held is only safe
if the IO can complete without _ever_ relying on a GFP_IO allocation.
That means that any interrupt allocations required in that space have to
be satisfiable by kswapd without GFP_IO, or kswapd could deadlock on us.
It means that if our superblock-locked IO has to stall waiting for an
nbd server process or a raid daemon, then those daemons cannot safely do
GFP_IO.  It's really gross.

I think it's actually ugly enough that we cannot make it safe: we can
really only be sure if we prevent all GFP_IO from any process which
might be involved in our deadlock loop, or if we avoid doing any IO with
the superblock lock held.  

It really looks as if the right way around this is to prevent GFP_IO
from deadlocking in the first place, by moving the asynchronous page
writes out of kswapd/try_to_free_page and into a separate worker thread.
That way we can continue to try to reclaim memory somewhere else without
deadlocking.  In that case the only thing we are left having to worry
about is doing a synchronous swapout, where we end up blocking waiting
for the IO thread to complete.  

In fact, to make it really safe we'd need to avoid synchronous swapout
altogether: otherwise we can have

	    A			kswiod		nbd server process
	    lock_super();
	    bread(ndb device);
	    try_to_free_page();
	    rw_swap_page_async();
				filemap_write_page();
				lock_super();
	    wait_on_buffer();
						try_to_free_page();
						rw_swap_page_sync();
						Oops, kswiod is stalled.

Can we get away without synchronous swapout?  Notice that in this case,
kswiod may be blocked but kswapd itself will not be.  As long as the nbd
server does not try to do a synchronous swap, it won't deadlock on
kswiod.  In other words, it is safe to wait for avaibility of another
free page, but it is not safe to wait for completion of any single,
specific swap IO.  If kswapd itself no longer performs the IO, then we
can always free more memory, until we get to the complete death stage
where there are absolutely no clean pages left in the system.

If we do this, then both the inode and the superblock deadlocks
disappear.

--Stephen.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: ebiederm+e...@ccr.net (Eric W. Biederman)
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/11
Message-ID: <fa.f9uhn3v.1s3akh8@ifi.uio.no>#1/1
X-Deja-AN: 431269781
Original-Date: 11 Jan 1999 00:04:11 -0600
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <m1aezq4a78.fsf@flinx.ccr.net>
References: <fa.j516l6v.g4mdhm@ifi.uio.no>
To: "Stephen C. Tweedie" <s...@redhat.com>
Original-References: <199901101659.QAA00...@dax.scot.redhat.com> 
<Pine.LNX.3.95.990110103201.7668D-100...@penguin.transmeta.com> 
<199901102249.WAA01...@dax.scot.redhat.com>
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

>>>>> "ST" == Stephen C Tweedie <s...@redhat.com> writes:

ST> Hi,
ST> On Sun, 10 Jan 1999 10:35:10 -0800 (PST), Linus Torvalds
ST> <torva...@transmeta.com> said:

>> On Sun, 10 Jan 1999, Stephen C. Tweedie wrote:
>>> 
>>> Ack.  I've been having a closer look, and making the superblock lock
>>> recursive doesn't work

>> That's fine - the superblock lock doesn't need to be re-entrant, because
>> __GFP_IO is quite sufficient for that one.

ST> I'm no longer convinced about that.  I think it's much much worse.  A
ST> bread() on an ext2 bitmap buffer with the superblock held is only safe
ST> if the IO can complete without _ever_ relying on a GFP_IO allocation.
ST> That means that any interrupt allocations required in that space have to
ST> be satisfiable by kswapd without GFP_IO, or kswapd could deadlock
ST> on us.

Well interrupts use GFP_ATOMIC . . . 

ST> It means that if our superblock-locked IO has to stall waiting for an
ST> nbd server process or a raid daemon, then those daemons cannot safely do
ST> GFP_IO.  It's really gross.

Right.  And the flag not to do I/O doesn't propogate across processes.
This sounds like a variation of the priority inheritance problem.

I wonder if this is why there are some known deadlocks with raid?

ST> I think it's actually ugly enough that we cannot make it safe: we can
ST> really only be sure if we prevent all GFP_IO from any process which
ST> might be involved in our deadlock loop, or if we avoid doing any IO with
ST> the superblock lock held.  

ST> In fact, to make it really safe we'd need to avoid synchronous swapout
ST> altogether: otherwise we can have

ST> Can we get away without synchronous swapout?  Notice that in this case,
ST> kswiod may be blocked but kswapd itself will not be.  As long as the nbd
ST> server does not try to do a synchronous swap, it won't deadlock on
ST> kswiod.  In other words, it is safe to wait for avaibility of another
ST> free page, but it is not safe to wait for completion of any single,
ST> specific swap IO.  If kswapd itself no longer performs the IO, then we
ST> can always free more memory, until we get to the complete death stage
ST> where there are absolutely no clean pages left in the system.

ST> If we do this, then both the inode and the superblock deadlocks
ST> disappear.

Sounds good.

I have a daemon just about ready to go, hopefully I can post it
tommorrow for preliminary testing.  It looks like my work for 2.3
in a small part can help deadlocks after all.

It walks the page tables and just writes out dirty pages, and marks
them clean but it doesn't remove them from processes.  So it can get
an early jump on writing things out.

Then if we are hitting a low memory situation (because pages become
dirty quickly), we can just wake it up, more often.

Currently we are doing totally asynchonous swapping but from the
context of the process that needs memory, (so the locks are in
different processes).  Adding a second daemon will play havoc on our
balancing but it shouldn't affect anything else. 

Grr.  I forgot about sysv shm.  It is the only thing doing synchronous
swapping right now.  

Oh, and just as a side note we are currently unfairly penalizing
threaded programs by doing for_each_task instead of for_each_mm in the
swapout code...

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/11
Message-ID: <fa.odadgnv.mka0o6@ifi.uio.no>#1/1
X-Deja-AN: 431421183
Original-Date: Mon, 11 Jan 1999 09:55:59 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990111095116.4886B-100000@penguin.transmeta.com>
References: <fa.iham6bv.1v0esiv@ifi.uio.no>
To: Savochkin Andrey Vladimirovich <s...@msu.ru>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



On Mon, 11 Jan 1999, Savochkin Andrey Vladimirovich wrote:
> On Sun, Jan 10, 1999 at 10:35:10AM -0800, Linus Torvalds wrote:
> > The thing I want to make re-entrant is just semaphore accesses: at the
> > point where we would otherwise deadlock on the writer semaphore it's much
> > better to just allow nested writes. I suspect all filesystems can already
> > handle nested writes - they are a lot easier to handle than truly
> > concurrent ones.
> 
> You're an optimist, aren't you? :-)

No, drugged to my eye-brows.

> In any case I've checked your recursive semaphore code on a news server
> which reliably deadlocked with the previous kernels.
> The code seems to work well.

I found a rather nasty race in my implementation - it's basically
impossible to triggerin real life, but quite frankly I don't want to have
semaphores that have a really subtle bug in them. 

However much I tried, I couldn't make the race go away without using a
spinlock in the critical path of the semaphore, something which I very
much want to avoid.

Unless I find a good recursive semaphore implementation (and I'm starting
to despair about finding one that is lock-free for the non-contention
case), I'll have to come up with something else (like letting only kswapd
swap out pages as has been discussed here).

			Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Stephen C. Tweedie" <s...@redhat.com>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/12
Message-ID: <fa.j2h8klv.jkof1j@ifi.uio.no>#1/1
X-Deja-AN: 431732138
Original-Date: Tue, 12 Jan 1999 16:06:38 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-Id: <199901121606.QAA04800@dax.scot.redhat.com>
References: <fa.f9uhn3v.1s3akh8@ifi.uio.no>
To: ebiederm+e...@ccr.net (Eric W. Biederman)
Original-References: <199901101659.QAA00...@dax.scot.redhat.com> 
<Pine.LNX.3.95.990110103201.7668D-100...@penguin.transmeta.com> 
<199901102249.WAA01...@dax.scot.redhat.com> <m1aezq4a78....@flinx.ccr.net>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

On 11 Jan 1999 00:04:11 -0600, ebiederm+e...@ccr.net (Eric W. Biederman)
said:

> Oh, and just as a side note we are currently unfairly penalizing
> threaded programs by doing for_each_task instead of for_each_mm in the
> swapout code...

I know, on my TODO list...

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/12
Message-ID: <fa.ni5h7uv.1n7elgn@ifi.uio.no>#1/1
X-Deja-AN: 431760196
Original-Date: Tue, 12 Jan 1999 09:54:50 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990112095401.17705A-100000@penguin.transmeta.com>
References: <fa.j2h8klv.jkof1j@ifi.uio.no>
To: "Stephen C. Tweedie" <s...@redhat.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Tue, 12 Jan 1999, Stephen C. Tweedie wrote:
> 
> On 11 Jan 1999 00:04:11 -0600, ebiederm+e...@ccr.net (Eric W. Biederman)
> said:
> 
> > Oh, and just as a side note we are currently unfairly penalizing
> > threaded programs by doing for_each_task instead of for_each_mm in the
> > swapout code...
> 
> I know, on my TODO list...

Actually, this one is _really_ easy to fix.

The truly trivial fix is to just move "swap_cnt" into the mm structure,
and you're all done. You'd still walk the list with for_each_task(), but
it no longer matters.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Zlatko Calusic <Zlatko.Calu...@CARNet.hr>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/12
Message-ID: <fa.g9nljkv.tiatab@ifi.uio.no>#1/1
X-Deja-AN: 431785322
Original-Date: 12 Jan 1999 19:44:45 +0100
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <87d84kl49u.fsf@atlas.CARNet.hr>
References: <fa.ni5h7uv.1n7elgn@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
Original-References: <Pine.LNX.3.95.990112095401.17705A-100...@penguin.transmeta.com>
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-Face: -{{$jeB1W-K.U*M}?5mPbqpi4lh3mpjD9T,~LDH/7U]*Xf9["_k>Ijnnce{CZ-ZK_%]g=vL cAZD>]
jb0OwfLx4*;XgFN0=P7\,5a(k;szUfM0\sKEv?*MLehyoE@!M1mY:`P1w)s7WHkOg8&8oE"; 
0_&*NFyrQMzNv^NW2}:Ifyx`#Rc%]7kazg49XSW>[Pe)s-0^O!Lttfv9-EYr,M2fp)VEE8p]GOiMzA 6Zad,
9ZXunk1k9MO'Yamy(?el@B8Fj1
Organization: Internet mailing list
MIME-Version: 1.0
User-Agent: Gnus/5.070069 (Pterodactyl Gnus v0.69) XEmacs/21.2(beta8) (Artemis)
Reply-To: Zlatko.Calu...@CARNet.hr
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Linus Torvalds <torva...@transmeta.com> writes:

> On Tue, 12 Jan 1999, Stephen C. Tweedie wrote:
> > 
> > On 11 Jan 1999 00:04:11 -0600, ebiederm+e...@ccr.net (Eric W. Biederman)
> > said:
> > 
> > > Oh, and just as a side note we are currently unfairly penalizing
> > > threaded programs by doing for_each_task instead of for_each_mm in the
> > > swapout code...
> > 
> > I know, on my TODO list...
> 
> Actually, this one is _really_ easy to fix.
> 
> The truly trivial fix is to just move "swap_cnt" into the mm structure,
> and you're all done. You'd still walk the list with for_each_task(), but
> it no longer matters.
> 
> 		Linus
> 

Not related to this, but I (hopefully correctly) observed that SHM
swap I/O is done synchronously.

Could somebody spare a minute to explain why is that so, and what
needs to be done to make SHM swapping asynchronous?


Also, while we're at MM fixes, I'm appending below a small patch that
will improve interactive feel.

After number of async pages gets bigger than pager_daemon.swap_cluster
(= SWAP_CLUSTER_MAX), swapin readahead becomes synchronous, and that
hurts performance. It is better to skip readahead in such situations,
and that is also more fair to swapout. Andrea came to exactly the same
conclusion, independent of me (on the same day :)).

diff -urN linux-pre-7/mm/page_alloc.c linux/mm/page_alloc.c
--- linux-pre-7/mm/page_alloc.c	Tue Jan 11 07:28:06 1999
+++ linux/mm/page_alloc.c	Tue Jan 11 07:29:44 1999
@@ -358,6 +358,8 @@
 	for (i = 1 << page_cluster; i > 0; i--) {
 	      if (offset >= swapdev->max)
 		      return;
+	      if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster)
+		      return;
 	      if (!swapdev->swap_map[offset] ||
 		  swapdev->swap_map[offset] == SWAP_MAP_BAD ||
 		  test_bit(offset, swapdev->swap_lockmap))

Regards,
-- 
Zlatko

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel <r...@humbolt.geo.uu.nl>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/13
Message-ID: <fa.lgh1brv.kahgm@ifi.uio.no>#1/1
X-Deja-AN: 431953647
Original-Date: Tue, 12 Jan 1999 22:46:08 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.03.9901122245090.4656-100000@mirkwood.dummy.home>
References: <fa.g9nljkv.tiatab@ifi.uio.no>
To: Zlatko Calusic <Zlatko.Calu...@CARNet.hr>
X-Sender: r...@mirkwood.dummy.home
X-Authentication-Warning: mirkwood.dummy.home: riel owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On 12 Jan 1999, Zlatko Calusic wrote:

> After number of async pages gets bigger than
> pager_daemon.swap_cluster (= SWAP_CLUSTER_MAX), swapin readahead
> becomes synchronous, and that hurts performance. It is better to
> skip readahead in such situations, and that is also more fair to
> swapout. Andrea came to exactly the same conclusion, independent
> of me (on the same day :)).

IIRC this facility was in the original swapin readahead
implementation. That only leaves the question who removed
it and why :))

cheers,

Rik -- If a Microsoft product fails, who do you sue?
+-------------------------------------------------------------------+
| Linux memory management tour guide.        r...@humbolt.geo.uu.nl |
| Scouting Vries cubscout leader.    http://humbolt.geo.uu.nl/~riel |
+-------------------------------------------------------------------+


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/13
Message-ID: <fa.iq05e6v.1agkoh5@ifi.uio.no>#1/1
X-Deja-AN: 432177938
Original-Date: Wed, 13 Jan 1999 14:45:09 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990113144203.284C-100000@laser.bogus>
References: <fa.lgh1brv.kahgm@ifi.uio.no>
To: Rik van Riel <r...@humbolt.geo.uu.nl>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Tue, 12 Jan 1999, Rik van Riel wrote:

> IIRC this facility was in the original swapin readahead
> implementation. That only leaves the question who removed
> it and why :))

There's another thing I completly disagree and that I just removed here. 
It's the alignment of the offset field. I see no one point in going back
instead of only doing real read_ahead_. 

Maybe I am missing something?

Index: page_alloc.c
===================================================================
RCS file: /var/cvs/linux/mm/page_alloc.c,v
retrieving revision 1.1.1.8
retrieving revision 1.1.1.1.2.29
diff -u -r1.1.1.8 -r1.1.1.1.2.29
--- page_alloc.c	1999/01/11 21:24:23	1.1.1.8
+++ linux/mm/page_alloc.c	1999/01/12 23:00:04	1.1.1.1.2.29
@@ -353,10 +352,10 @@
 	unsigned long offset = SWP_OFFSET(entry);
 	struct swap_info_struct *swapdev = SWP_TYPE(entry) + swap_info;
 	
-	offset = (offset >> page_cluster) << page_cluster;
-	
 	for (i = 1 << page_cluster; i > 0; i--) {
-	      if (offset >= swapdev->max)
+	      if (offset >= swapdev->max ||
+		  /* don't block on I/O for doing readahead -arca */
+		  atomic_read(&nr_async_pages) > pager_daemon.max_async_pages)
 		      return;
 	      if (!swapdev->swap_map[offset] ||
 		  swapdev->swap_map[offset] == SWAP_MAP_BAD ||



Andrea Arcangeli


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Stephen C. Tweedie" <s...@redhat.com>
Subject: [PATCH] Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/13
Message-ID: <fa.j5hgktv.gkod9m@ifi.uio.no>#1/1
X-Deja-AN: 432211188
Original-Date: Wed, 13 Jan 1999 17:55:56 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-Id: <199901131755.RAA06476@dax.scot.redhat.com>
References: <fa.iq05e6v.1agkoh5@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>, Linus Torvalds <torva...@transmeta.com>
Original-References: <Pine.LNX.4.03.9901122245090.4656-100...@mirkwood.dummy.home> 
<Pine.LNX.3.96.990113144203.284C-100...@laser.bogus>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

On Wed, 13 Jan 1999 14:45:09 +0100 (CET), Andrea Arcangeli
<and...@e-mind.com> said:

> On Tue, 12 Jan 1999, Rik van Riel wrote:
>> IIRC this facility was in the original swapin readahead
>> implementation. That only leaves the question who removed
>> it and why :))

> There's another thing I completly disagree and that I just removed here. 
> It's the alignment of the offset field. I see no one point in going back
> instead of only doing real read_ahead_. 

> Maybe I am missing something?

Yes, very much so.

When paging in binaries, you often have locality of reference in both
directions --- a set of functions compiled from a single source file
will occupy adjacent pages in VM, but you are as likely to call a
function at the end of the region first as one at the beginning.  It
is very common to get backwards locality as a result.

The big advantage of doing aligned clusters for readin is twofold:
first, it means that you get as much of a readahead advantage for
these backwards access patterns as for forward accesses.  Secondly, it
means that you are reading in complete tiles which are guaranteed to
have no gaps between them, so any two accesses in adjacent tiles are
sufficient to read in the complete set of nearby pages without missing
any gaps between them: it avoids having to do yet another IO to fill
in the few pages missed by a strictly forward-looking readahead
function.

> +		  /* don't block on I/O for doing readahead -arca */
> +		  atomic_read(&nr_async_pages) > pager_daemon.max_async_pages)
>  		      return;

I think this is the wrong solution: far better to do the patch below,
which simply exempts reads from nr_async_pages altogether.  I
originally added nr_async_pages to serve two functions: to allow
kswapd to determine how much memory it was already in the process of
freeing, and to act as a throttle on the number of write IOs submitted
when swapping.

We don't need a similar throttling action for reads, because every
place where we do VM readahead, each readahead IO cluster is followed
by a synchronous read on one page.  We don't throttle the async
readaheads on normal file IO, for example.

--Stephen

----------------------------------------------------------------
--- mm/page_io.c~	Mon Dec 28 21:56:29 1998
+++ mm/page_io.c	Tue Jan 12 16:45:55 1999
@@ -58,7 +58,8 @@
 	}

 	/* Don't allow too many pending pages in flight.. */
-	if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster)
+	if (rw == WRITE &&
+	    atomic_read(&nr_async_pages) > pager_daemon.swap_cluster)
 		wait = 1;

 	p = &swap_info[type];
@@ -170,7 +171,7 @@
 		atomic_dec(&page->count);
 		return;
 	}
- 	if (!wait) {
+ 	if (rw == WRITE && !wait) {
  		set_bit(PG_decr_after, &page->flags);
  		atomic_inc(&nr_async_pages);
  	}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [PATCH] Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/13
Message-ID: <fa.ir0dduv.1bg8o90@ifi.uio.no>#1/1
X-Deja-AN: 432240066
Original-Date: Wed, 13 Jan 1999 19:52:03 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990113191421.185E-100000@laser.bogus>
References: <fa.j5hgktv.gkod9m@ifi.uio.no>
To: "Stephen C. Tweedie" <s...@redhat.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 13 Jan 1999, Stephen C. Tweedie wrote:

> I think this is the wrong solution: far better to do the patch below,
> which simply exempts reads from nr_async_pages altogether.  I
> originally added nr_async_pages to serve two functions: to allow
> kswapd to determine how much memory it was already in the process of
> freeing, and to act as a throttle on the number of write IOs submitted
> when swapping.
> 
> We don't need a similar throttling action for reads, because every
> place where we do VM readahead, each readahead IO cluster is followed
> by a synchronous read on one page.  We don't throttle the async
> readaheads on normal file IO, for example.

Note that we don't need nr_async_pages at all. Here when the limit of
nr_async_pages is low it's only a bottleneck for swapout performances. I
have not removed it (because it could be useful to decrease swapout I/O if
somebody needs this strange feature), but I have added a
page_daemon.max_async_pages and set it to something like 256. Now I check
nr_async_pages against the new max_async_pages. 

I _guess_ (not checked) that the _only_ reason Steve seen arca-vm-16 so
high improved changing SWAP_CLUSTER_MAX to 512 instead of 32 is the
removal of the nr_async_pages bottleneck. 

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Stephen C. Tweedie" <s...@redhat.com>
Subject: Re: [PATCH] Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/13
Message-ID: <fa.j3hijlv.ikqdht@ifi.uio.no>#1/1
X-Deja-AN: 432292951
Original-Date: Wed, 13 Jan 1999 22:10:12 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-Id: <199901132210.WAA07391@dax.scot.redhat.com>
References: <fa.ir0dduv.1bg8o90@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
Original-References: <199901131755.RAA06...@dax.scot.redhat.com> 
<Pine.LNX.3.96.990113191421.185E-100...@laser.bogus>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

On Wed, 13 Jan 1999 19:52:03 +0100 (CET), Andrea Arcangeli
<and...@e-mind.com> said:

> Note that we don't need nr_async_pages at all. Here when the limit of
> nr_async_pages is low it's only a bottleneck for swapout performances. I
> have not removed it (because it could be useful to decrease swapout I/O if
> somebody needs this strange feature), but I have added a
> page_daemon.max_async_pages and set it to something like 256. Now I check
> nr_async_pages against the new max_async_pages. 

The problem is that if you do this, it is easy for the swapper to
generate huge amounts of async IO without actually freeing any real
memory: there's a question of balancing the amount of free memory we
have available right now with the amount which we are in the process of
freeing.  Setting the nr_async_pages bound to 256 just makes the swapper
keen to send a whole 1MB of memory out to disk at a time, which is a bit
steep on an 8MB box.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [PATCH] Re: MM deadlock [was: Re: arca-vm-8...]
Date: 1999/01/13
Message-ID: <fa.ob97fgv.jkc1g5@ifi.uio.no>#1/1
X-Deja-AN: 432306316
Original-Date: Wed, 13 Jan 1999 14:30:32 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990113142730.6104G-100000@penguin.transmeta.com>
References: <fa.j3hijlv.ikqdht@ifi.uio.no>
To: "Stephen C. Tweedie" <s...@redhat.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 13 Jan 1999, Stephen C. Tweedie wrote:
> 
> The problem is that if you do this, it is easy for the swapper to
> generate huge amounts of async IO without actually freeing any real
> memory: there's a question of balancing the amount of free memory we
> have available right now with the amount which we are in the process of
> freeing.  Setting the nr_async_pages bound to 256 just makes the swapper
> keen to send a whole 1MB of memory out to disk at a time, which is a bit
> steep on an 8MB box.

Note that this should be much less of a problem with the current swapout
strategies, but yes, basically we definitely do want to have _some_ way of
maintaining a sane "maximum number of pages in flight" thing. 

The right solution may be to do the check in some other place, rather than
fairly deep inside the swap logic. 

It's not a big deal, I suspect.

Anyway, there's a real pre7 out there now, and it doesn't change a lot of
th issues discussed here. I wanted to get something stable and working. I
still need to get the recursive semaphore thing (or other approach) done,
but basically I think we're at 2.2.0 already apart from that issue, and
that we can continue this discussion as a "occasional tweaks" thing. 

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/