2.2.13aa6 (bugfix release II)

2.2.13aa6 (bugfix release II)

Andrea Arcangeli (andrea@suse.de)
Fri, 17 Dec 1999 16:34:21 +0100 (CET)

I released a 2.2.13aa6 mainly to include my latest fs corruption fixes.

The main features of 2.2.13aa6 are:

o Support for 4Gigabyte of RAM (me and Gerhard.Wichert)
o Improved VM for high end machines with enough ram and doing
heavy I/O under high memory pressure (me)
o RAW-IO (also on bigmem) (Stephen C. Tweedie)

o updated with all showstopper/necessary bugfixes discovered into
the 2.2.x kernels over the time.

NOTE (2.2.14pre): if you don't need the 4g support and raw-io and your
machine has a workstation load (so you don't do heavy I/O) you should
ignore 2.2.13aa6 and I suggest to use 2.2.14pre14 plus my
block_dev-fs-corruption patch.

NOTE (raid): if you want to use the latest raid patches
(raid0145-19990824-2.2.11) over 2.2.13aa6 simply apply the raid patch over
2.2.13aa6 and then apply this incremental patch on the resulting kernel:

ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14pre11/set_blocksize-1-raid0145-19990824-2.2.11.gz

Then raid will work just fine.

Side note: I am not including the new raid code in 2.2.x because at least
raid0 is just rock solid in the stock 2.2.13 kernel and I don't want to
force people to convert the on-disk format of their raid device in order
to run 2.2.13aa6. People who wants to use raid can go with my incremental
raid fix.

Incremental description of 2.2.13aa6:

------ 2.2.13aa6 --------
diff -u --exclude version.gz 2.2.13aa5 2.2.13aa6
Only in 2.2.13aa6: block_dev-fs-corruption-1.gz

block_dev-fs-corruption-1.gz -> fixes fs corruption in the
blockdevice layer (me)

------ 2.2.13aa5 --------
diff -u --exclude version.gz 2.2.13aa4 2.2.13aa5
Only in 2.2.13aa4: buffer-races-2.2.10-A.gz
Only in 2.2.13aa5: buffer-races-2.2.13-3.gz
Only in 2.2.13aa5: ext2-1.gz

buffer-races-2.2.13-3.gz -> includes the
buffer-races-2.2.10-A.gz
features and it also fixes fs
corruption generated by hdparm or
flushb on an active filesystem and
a minor problem in sync_dev (me)

ext2-1.gz -> fixes bugs that may lead to
ext2 fs corruption or
fsync errors (me)

------ 2.2.13aa5 --------
diff -u --exclude version.gz 2.2.13aa3 2.2.13aa4
Only in 2.2.13aa4: inode-recycle-fixes.gz
Only in 2.2.13aa4: java-proc.gz
Only in 2.2.13aa4: signal-race.gz
Only in 2.2.13aa4: syncookies.gz

inode-recycle-fixes.gz -> fixes an inode leakage (me)

syncookies.gz -> fixed syncooky bug (without the
fix at the first synflood
the machine will forbid
connections to all hosts, must
check only the SYN/ACK/FIN
bit and not the data offset
and window of the incoming
packet ;). (Alan Cox)

signal-race.gz -> fixes a race in the send sig path
(David Miller)

java-proc.gz -> revertd the semantic change that
make difference between
/proc/00000$$ and /proc/$$, this
allows backwards compatibilty of
a misfeature and it _won't_ hurt
security. There's no downside
in reverting the 2.2.13 semantic
change.

------ 2.2.13aa3 --------
diff -u --exclude version.gz 2.2.13aa2 2.2.13aa3
Only in 2.2.13aa3: dcache-hashfn.gz
Only in 2.2.13aa3: fdset-fix.gz
Only in 2.2.13aa2: z-bigmem-2.2.13aa2-6.gz
Only in 2.2.13aa3: z-bigmem-2.2.13aa3-7.gz

z-bigmem-2.2.13aa3-7.gz -> fixed a obvious silly bigmem
bug that will lead
to processes killed randomly.
(all the credit goes to Leonard N.
Zubkoff)
fdset-fix.gz -> fixed a fdset bug that may lead to
memory corruption and Oopses
(credits goes to
Savochkin Andrey Vladimirovich,
I only backported the 2.3.x patch
to a four liner against 2.3.13)
dcache-hashfn.gz -> use only the dentry noise for
randomizing the dcache hashfn
(all the credit goes to David S.
Miller)

------ 2.2.13aa2 --------

SMP-scheduler-2.2.11-E.gz -> rewrote of reschedule_idle. (me)
buffer-hash.gz -> fixes lowmem box hash size. (me)
buffer-races-2.2.10-A.gz -> fixes of race condition that may lead
to bad things in invalidate_buffers()
and set_blocksize(). (me)
clear-backlog-2.gz -> fixes for a SMP race condition in
the main network backlog handling. (me)
dcache-hash.gz -> dcache hash dynamic (with my
own heuristc). (started from 2.2.13ac1
but then reimplemented by me)
free_page.gz -> cleanup of the __free_pages
interface. (me)
hashed-buffers-2.2.10.gz -> minor fix to increase the debugging
information in the right place. (me)
inode-leak-2.2.10-A.gz -> make sure to not leak memory
by allocating lots of sockets (DoS),
and let know the admin to enlarge
the max-inodes if the admin really
wants more unfreeable memory in the
icache. (me)
kupdate-sigstop-2.2.11-1.gz -> allow kupdate to be stopped via
SIGSTOP (currently it must be stopped
by setting interval to zero via
sysctl). (me)
no-swapout-2.2.10-B.gz -> avoid swapin/swapouts during heavy
I/O (strictly necessary for decent
performances on very I/O and MM loaded
servers). (me)
oom-2.2.12-I.gz -> assorted OOM fixes (deadlocks in
pagein, Alpha SIGBUS fix, avoid
sigkilling iopl() application send
a sigterm instead, avoid init
to be killed), it's the same
patch merged by Alan into 2.2.14pre2. (me)
pagecache-hash.gz -> pagecache hash dynamic (I think
it's DaveM's work, literally I took it
from 2.2.13ac1). I agree with the
heuristc used. It allocates
num_physpages buckets for the pagecache
and this basically means all the
buckets will be filled supposing a
perfect hash distribution with all the memory
allocated in the cache. (all credits
to David S. Miller)
probe-irq-2.3.14-pre2-1.gz -> avoid a pending irq to be mistaken
for a spurious irq. (me)
shrink_all_cache-2.2.10-A.gz -> make sure that big memory boxes will
shrink the cache well enough. (me)
trashing-mem-2.2.10-A.gz -> heuristic to penalize memory hogs,
the system will remains responsive
also during heavy swapout. (me)
version.gz -> set the EXTRAVERSION to aa2 ;)
wait-event-smp-races.gz -> Put the two mb() after setting the
task state as blocking and before
checking if the event is just happend
(SMP race fix). (me)
wait4-smp-race.gz -> _Critical_ SMP race fix.
Without this one liner each time you
run `ls` from bash, the bash is going
to deadlock in wait4 if you are unlucky
enough. The race is very small
but there are machine under heavy
fork load load that reproduced this
race regularly after some day of load.
The SMP race can happen only
with an SMP kernel on a SMP hardware. (me)
wakeup_bdflush-2.2.10-A.gz -> avoid deadlocking in wakeup_bdflush
(the run_task_queue() can sleep for
example while running the loop
request function). (me)
z-bigmem-2.2.13aa2-6.gz -> 4GB support on x86. (me and
Gerhard Wichert)
z-bigmem-nodebug.gz -> turn the bigmem code into production
mode.
z-bigmem-rawio-2.2.13aa2-1.gz -> rawio working even with bigmem memory
(I started with rawio from 2.2.13ac1
and SCT's 2.3.x rawio bounce buffers,
all the credits go to Stephen C.
Tweedie)
zmagic-all-blocksize.gz -> allow zmagic binaries to run
also on 4k filesystems (it's the same
that gone into 2.2.14pre2). (me)
----------------------------------------------------------------------

To go in sync with 2.2.13aa6 you can:

mkdir 2.2.13aa6
cd 2.2.13aa6
wget --retr-symlinks -A\*.gz ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.2/2.2.13aa6/\*
cd ..

and now you'll have all the interesting patches in the directory
2.2.13aa6.

At this point rename the 2.2.13 sources to 2.2.13aa6:

mv linux-2.2.13 linux-2.2.13aa6
cd linux-2.2.13aa6

and apply all the 2.2.13aa6 patches that you previously downloaded from
the ftp site:

apply-patches.sh ../2.2.13aa6

At this point your tree will be in sync with 2.2.13aa6. Just configure
recompile and boot the new kernel.

You can find the `apply-patches.sh` bash script I written to easily apply
my kernel patches here:

ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/tools/apply-patches/apply-patches.sh.gz

There is also a README on how to use it:

ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/tools/apply-patches/README.gz

The 2.2.13aa6 kernel is placed here:

ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.2/2.2.13aa6/

Have fun! ;)

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: ursus <ur...@usa.net>
Subject: Re: [2.2.13aa6 (bugfix release II) ]
Date: 1999/12/20
Message-ID: <fa.l5tasev.1u742a6@ifi.uio.no>#1/1
X-Deja-AN: 563124487
Original-Date: 20 Dec 99 14:51:32 EST
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 8BIT
Original-Message-ID: <19991220195132.10068.qmail@nw179.netaddress.usa.net>
To: Andrea Arcangeli <and...@suse.de>
X-Priority: 1
Content-Type: text/plain; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Read-Receipt: ur...@usa.net
X-MSMail-Priority: High
Disposition-Notification-To: ur...@usa.net
Mime-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

In newsgroup fa.linux.kernel, Andrea Arcangeli wrote:

> Date:  Fri, 17 Dec 1999 16:34:21 +0100 (CET)
> From: Andrea Arcangeli <and...@suse.de>
> Subject: 2.2.13aa6 (bugfix release II)
>
> [...] 
> The main features of 2.2.13aa6 are:
> 
> o       Support for 4Gigabyte of RAM (me and Gerhard.Wichert)
> o       Improved VM for high end machines with enough ram and doing
>         heavy I/O under high memory pressure (me)
> o       RAW-IO (also on bigmem) (Stephen C. Tweedie)
> 
> o       updated with all showstopper/necessary bugfixes discovered into
>         the 2.2.x kernels over the time.
>

Andrea:

Thanks for the updated 2.2.13aa6 patchset, especially that
it works with the raid-0.90 patches cleanly! I've been using
Alan Cox's 2.2.13ac3 patches for the raid-0.90 support,
but really wanted to run with your SMP scheduling changes,
since they would seem to help performance/stability with
my application (high-load webserver on dual-PIII machine).
Also I was getting errors regarding "Out of memory" which
you have a couple of patches for in aa6 ...

I upgraded a cluster of servers (Compaq 6400R, 2 x PIII-500)
from 2.2.13ac3 to 2.2.13aa6+raid-0.90 (and the incremental
"set_blocksize" patch you kindly provided) and Don Becker's
eepro.c 1.09l (not sure if this is latest?) in hopes I can
finally have a really stable setup ... these had been running
well for about 12 hours, but I just had one of the servers
crash with the following error (seen before under 2.2.13ac3):

  wait_on_bh, CPU 3:	(this is the first processor)
  irq:	0 [0 0]
  bh:	1 [0 0]
  <[8010b39d]> <[80150daa]> <[80150d46]> <[8012912b]> \
  <[8012a367]> <[801291a6]> <[8012921f]> <[801092ac]>

I tried to correlate the registers above with System.map:

  8010b360 T synchronize_bh
  8010b3b0 T synchronize_irq

  80150d20 t sock_close
  80150d5c t sock_fasync

  8012910c T __fput
  80129154 T filp_close

  8012a350 T fput
  8012a398 T put_filp

  80129154 T filp_close
  801291b0 T sys_close

  801291b0 T sys_close
  80129238 T sys_vhangup

  80109278 T system_call
  801092b0 T ret_from_sys_call

If I press ALT+SysRq+P, the EIP shows "0010:[<80166671>]"
which appears to be related to functions (from System.map):

  80166660 T tcp_send_delayed_ack
  801666b4 T tcp_send_ack

In some earlier posts I read that "wait_on_bh"
means that the system is waiting on the bottom half
(SMP-specific), so I've edited my /etc/lilo.conf
to add "nosmp noapic", and I'll see if the servers
run stable w/o SMP ... this isn't a real solution
of course.

Any help/pointers/patches would be greatly appreciated.
In an earlier post I mentioned this is part of a larger
project to upgrade about 100 webservers based on 2.0.36
kernel to 2.2.13+ ... the overall load is 1Billion hits
per day currently. This would be a yet another testament
to Linux's viability in the enterprise environment,
assuming I can nail down this SMP problem :)

PS:	in your directory on the ftp.*.kernel.org mirrors,
	I see a patch regarding bh_latency for 2.2.14pre;
	does this address the above "wait_on_bh" problem?

Thanks in advance

--
ur...@usa.net

____________________________________________________________________
Get free email and a permanent address at http://www.netaddress.com/?N=1

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: Re: [2.2.13aa6 (bugfix release II) ]
Date: 1999/12/21
Message-ID: <fa.m8ag9hv.vi2dpo@ifi.uio.no>#1/1
X-Deja-AN: 563335229
Original-Date: Tue, 21 Dec 1999 10:42:38 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.21.9912211015000.24670-100000@Fibonacci.suse.de>
References: <fa.l5tasev.1u742a6@ifi.uio.no>
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
To: ursus <ur...@usa.net>
X-Authentication-Warning: Fibonacci.suse.de: andrea owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On 20 Dec 1999, ursus wrote:

>I upgraded a cluster of servers (Compaq 6400R, 2 x PIII-500)
>from 2.2.13ac3 to 2.2.13aa6+raid-0.90 (and the incremental
>"set_blocksize" patch you kindly provided) and Don Becker's

That's a fine kernel ;)

>[..] I just had one of the servers
>crash with the following error (seen before under 2.2.13ac3):

I changed nothing in my aa patches related to the problem you have, so
it's normal you get it as in the 2.2.13ac3 kernel.

Your report gives interesting info, thanks.

>to add "nosmp noapic", and I'll see if the servers
>run stable w/o SMP ... this isn't a real solution

I bet it will be rock solid in UP. This looks like a genuine SMP race (of
course trusting it's not an hardware issue).

>assuming I can nail down this SMP problem :)

We'll nail it down ;).

>	I see a patch regarding bh_latency for 2.2.14pre;
>	does this address the above "wait_on_bh" problem?

It's won't help you, it's performance stuff (and it's not complete yet).

Andrea


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: ur...@usa.net
Subject: Re: [2.2.13aa6 (bugfix release II)]
Date: 1999/12/23
Message-ID: <83tdcd$hr8$1@nnrp1.deja.com>#1/1
X-Deja-AN: 564153809
To: and...@suse.de
X-Http-User-Agent: Mozilla (X11; I; Linux 2.0.32 i586)
X-Http-Proxy: 1.0 x38.deja.com:80 (Squid/1.1.22) for client 199.95.209.163
Organization: Deja.com - Before you buy.
X-Article-Creation-Date: Thu Dec 23 14:59:58 1999 GMT
X-MyDeja-Info: XMYDJUIDray450
Reply-To: ur...@usa.net
Newsgroups: fa.linux.kernel

In article <fa.m8ag9hv.vi2...@ifi.uio.no>,
  Andrea Arcangeli <and...@suse.de> wrote:

> I changed nothing in my aa patches related to the problem
> you have, so it's normal you get it as in the 2.2.13ac3 kernel.

Andrea:

This explains why I'm still having the hangs :(

> I bet it will be rock solid in UP.
> This looks like a genuine SMP race

I've been testing the same servers with "nosmp noapic"
appended to the bootprompt (via LILO) without success;
while I don't see the wait_on_bh crash anymore, instead
the system just hangs without any errors whatsoever,
I'll see a "normal" login prompt at the console except
no characters are echoed and only SysRq [partially] works.
I can't successfully Sync, Unmount via SysRq but reBoot
does work. Also nothing is logged in /var/log/messages
regarding the crash.

I also recompiled 2.2.13aa6 for true UniProcessor mode
and ran the UP kernel on the same servers for 2 days,
with the same errorless crashes under heavy network load.
I haven't been able to get your IKD patchset into this
kernel (and boot successfully, that is).

If I press ALT+SysRq+P after one of these crashes,
the EIP always points to an address which maps to
"timer_bh" (according to System.map) Is this the same
timer_bh problem William Montgomory was discussing
with you in another recent thread?

> We'll nail it down ;).
In article <fa.m8ag9hv.vi2...@ifi.uio.no>,
  Andrea Arcangeli <and...@suse.de> wrote:

> I changed nothing in my aa patches related to the problem
> you have, so it's normal you get it as in the 2.2.13ac3 kernel.

Andrea:

This explains why I'm still having the hangs :(

> I bet it will be rock solid in UP.
> This looks like a genuine SMP race

I've been testing the same servers with "nosmp noapic"
appended to the bootprompt (via LILO) without success;
while I don't see the wait_on_bh crash anymore, instead
the system just hangs without any errors whatsoever,
I'll see a "normal" login prompt at the console except
no characters are echoed and only SysRq [partially] works.
I can't successfully Sync, Unmount via SysRq but reBoot
does work. Also nothing is logged in /var/log/messages
regarding the crash.

I also recompiled 2.2.13aa6 for true UniProcessor mode
and ran the UP kernel on the same servers for 2 days,
with the same errorless crashes under heavy network load.
I haven't been able to get your IKD patchset into this
kernel (and boot successfully, that is).

If I press ALT+SysRq+P after one of these crashes,
the EIP always points to an address which maps to
"timer_bh" (according to System.map) Is this the same
timer_bh problem William Montgomory was discussing
with you in another recent thread?

> We'll nail it down ;).

Thanks for your and the list-members' assistance ...
Please let me know if I can help in any way,
whether to test patches, provide crash traces,
etc.

--
ur...@usa.net

Sent via Deja.com http://www.deja.com/
Before you buy.

From: Andrea Arcangeli <and...@suse.de>
Subject: Re: [2.2.13aa6 (bugfix release II) ]
Date: 2000/01/09
Message-ID: <fa.jrefbqv.a0ulib@ifi.uio.no>#1/1
X-Deja-AN: 570505709
Original-Date: Sun, 9 Jan 2000 21:43:01 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.21.0001092136450.11394-100000@alpha.random>
References: <fa.l5tasev.1u742a6@ifi.uio.no>
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
To: ursus <ur...@usa.net>
X-Sender: and...@alpha.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On 20 Dec 1999, ursus wrote:

>I upgraded a cluster of servers (Compaq 6400R, 2 x PIII-500)
>from 2.2.13ac3 to 2.2.13aa6+raid-0.90 (and the incremental
>"set_blocksize" patch you kindly provided) and Don Becker's
>eepro.c 1.09l (not sure if this is latest?) in hopes I can
>finally have a really stable setup ... these had been running
>well for about 12 hours, but I just had one of the servers
>crash with the following error (seen before under 2.2.13ac3):
>
>  wait_on_bh, CPU 3:	(this is the first processor)
>  irq:	0 [0 0]
>  bh:	1 [0 0]
>  <[8010b39d]> <[80150daa]> <[80150d46]> <[8012912b]> \
>  <[8012a367]> <[801291a6]> <[8012921f]> <[801092ac]>
>
>I tried to correlate the registers above with System.map:
>
>  8010b360 T synchronize_bh
>  8010b3b0 T synchronize_irq
>
>  80150d20 t sock_close
>  80150d5c t sock_fasync
>  
>  8012910c T __fput
>  80129154 T filp_close
>
>  8012a350 T fput
>  8012a398 T put_filp
>
>  80129154 T filp_close
>  801291b0 T sys_close
>
>  801291b0 T sys_close
>  80129238 T sys_vhangup
>
>  80109278 T system_call
>  801092b0 T ret_from_sys_call
>
>If I press ALT+SysRq+P, the EIP shows "0010:[<80166671>]"
>which appears to be related to functions (from System.map):
>
>  80166660 T tcp_send_delayed_ack
>  801666b4 T tcp_send_ack
>
>In some earlier posts I read that "wait_on_bh"
>means that the system is waiting on the bottom half
>(SMP-specific), so I've edited my /etc/lilo.conf
>to add "nosmp noapic", and I'll see if the servers
>run stable w/o SMP ... this isn't a real solution
>of course.
>
>Any help/pointers/patches would be greatly appreciated.

I think I spotted and fixed the bug that is soft-deadlocking your 2.2.x
compaq cluster (all seems to make sense :). Could you try the below patch
against 2.2.14 (or 2.2.14aa1 or 2.2.13 or 2.2.13aa6)?

--- 2.2.14/net/ipv4/tcp_output.c.~1~	Fri Jan  7 18:19:25 2000
+++ 2.2.14/net/ipv4/tcp_output.c	Sun Jan  9 21:32:04 2000
@@ -1004,7 +1004,7 @@
 	unsigned long timeout;
 
 	/* Stay within the limit we were given */
-	timeout = tp->ato;
+	timeout = (tp->ato << 1) >> 1;
 	if (timeout > max_timeout)
 		timeout = max_timeout;
 	timeout += jiffies;

I uploaded the above patch here too:

	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/delack-timer-1.gz

Have fun!

Andrea


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "David S. Miller" <da...@redhat.com>
Subject: Re: [2.2.13aa6 (bugfix release II) ]
Date: 2000/01/09
Message-ID: <fa.hesm80v.157e704@ifi.uio.no>#1/1
X-Deja-AN: 570541496
Original-Date: Sun, 9 Jan 2000 15:36:09 -0800
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <200001092336.PAA11026@pizda.ninka.net>
References: <fa.jrefbqv.a0ulib@ifi.uio.no>
To: and...@suse.de
Original-References: <Pine.LNX.4.21.0001092136450.11394-100...@alpha.random>
X-Authentication-Warning: pizda.ninka.net: davem set sender to da...@redhat.com using -f
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

   Date:   Sun, 9 Jan 2000 21:43:01 +0100 (CET)
   From: Andrea Arcangeli <and...@suse.de>

   I think I spotted and fixed the bug that is soft-deadlocking your
   2.2.x compaq cluster (all seems to make sense :). Could you try the
   below patch against 2.2.14 (or 2.2.14aa1 or 2.2.13 or 2.2.13aa6)?

Wrong, all callers of tcp_send_delayed_ack _guarentee_ that the
quickack bit is clear.  Your patch does nothing, put an assert
there if you don't believe me.

Later,
David S. Miller
da...@redhat.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: Re: [2.2.13aa6 (bugfix release II) ]
Date: 2000/01/10
Message-ID: <fa.jof3chv.a0elqb@ifi.uio.no>#1/1
X-Deja-AN: 570872471
Original-Date: Mon, 10 Jan 2000 16:11:28 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.21.0001100246170.18403-100000@alpha.random>
References: <fa.hesm80v.157e704@ifi.uio.no>
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
To: "David S. Miller" <da...@redhat.com>
X-Sender: and...@alpha.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sun, 9 Jan 2000, David S. Miller wrote:

>   Date:   Sun, 9 Jan 2000 21:43:01 +0100 (CET)
>   From: Andrea Arcangeli <and...@suse.de>
>
>   I think I spotted and fixed the bug that is soft-deadlocking your
>   2.2.x compaq cluster (all seems to make sense :). Could you try the
>   below patch against 2.2.14 (or 2.2.14aa1 or 2.2.13 or 2.2.13aa6)?
>
>Wrong, all callers of tcp_send_delayed_ack _guarentee_ that the
>quickack bit is clear.  Your patch does nothing, put an assert

tcp_delack_timer doesn't really guarantee that, see:

void tcp_delack_timer(unsigned long data)
{
	struct sock *sk = (struct sock*)data;

	if(!sk->zapped &&
	   sk->tp_pinfo.af_tcp.delayed_acks &&
	   sk->state != TCP_CLOSE) {
		/* If socket is currently locked, defer the ACK. */
		if (!atomic_read(&sk->sock_readers))
			tcp_send_ack(sk);
		else
			tcp_send_delayed_ack(&(sk->tp_pinfo.af_tcp), HZ/10);
	}
}

You should as well guarantee by design that none timer is pending after
you turn on the quickack bit and before dropping the bh or sock lock. It
seems to me you are guaranteeing that by always sending an ack on the wire
after you set up the quickack bit but it's not trivial to prove and right
now the only explanation for the deadlock reported by urban and that other
people is experiencing is that a delack timer triggers while delayed_acks
is > 0 and the quickack bit is set.

If the quickack bit is set while calling tcp_send_delayed_ack the kernel
will lockup immediatly in a way that matches the reports from ursus.

The reason for the deadlock is that the expired field of the timer will be
set in the past and so the timer will reinsert inself in the first heap
slot and so it will continue to reinsert and rexecute it in an infinite
loop -> soft deadlock. I'd like to also make the timer code robust against
these kind of subsystem bugs later but actually I am only focused to fix
the offending code in TCP.

I have to admit that I can't yet see exactly the path that sets the
quickack bit without sending data on the wire but you agree with me that
the tcp_send_delayed_ack function is not interested about the quickack bit
and it's interested only about the real "ato" information, so my patch is
obviously correct and in the worst case it won't change anything. I
believe it will as well fix the lockup fine, and that it's the right
approch to avoid these kind of subtle mistakes. It makes more sense than
destroying the quickack information before reinserting the the delack
timer from the delack timer, no?

BTW, while reading the code I found a lockup-unrelated bug in the delack
handling:

--- 2.2.14/net/ipv4/tcp_input.c	Fri Jan  7 18:19:25 2000
+++ /tmp/tcp_input.c	Mon Jan 10 03:41:10 2000
@@ -1428,6 +1428,7 @@
 	if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
 		/* A retransmit, 2nd most common case.  Force an imediate ack. */
 		SOCK_DEBUG(sk, "retransmit received: seq %X\n", TCP_SKB_CB(skb)->seq);
+		tp->delayed_acks++;
 		tcp_enter_quickack_mode(tp);
 		kfree_skb(skb);
 		return;

Anyway the above fix is not interesting for real world because it seems
impossible to me to reach such path in the TCP code (so basically such
check is useless but after all I like it for completeness of the
function). This because we call tcp_data only when we know the packet is
in our receive window (otherwise we force an ack by hand prior calling
tcp_queue).

And really tp->delayed_acks is meaningless as far I can tell and the right
thing to do is to remove the delayed_acks field completly (this must be
done at least in 2.3.x). Removing it will avoid wasting time in the TCP
code and will decrease half of the the delacks code braindamage :).

Ursus, please apply also this patch on the top of my fix in the previous
email as for David correct suggestion of putting an assert there. If
you'll see a printk with the below patch applyed we'll have the proof my
theory about the source of your deadlocks is correct and that my fix
made the difference for you. Without the below patch applyed you could
think you are not deadlocking anymore because of luck :).

--- 2.2.14/net/ipv4/tcp_timer.c	Fri Jan  7 18:19:25 2000
+++ /tmp/tcp_timer.c	Mon Jan 10 16:02:14 2000
@@ -173,7 +173,12 @@
 		if (!atomic_read(&sk->sock_readers))
 			tcp_send_ack(sk);
 		else
+		{
+			struct tcp_opt *tp = &(sk->tp_pinfo.af_tcp);
+			if (tcp_in_quickack_mode(tp))
+				printk(KERN_ERR "quickack bit set!!!!\n");
 			tcp_send_delayed_ack(&(sk->tp_pinfo.af_tcp), HZ/10);
+		}
 	}
 }

If my fix doesn't fix the deadlock completly I have really no other
rasonable ideas on what can be going wrong right now. Thinking thinking...

Comments?

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "David S. Miller" <da...@redhat.com>
Subject: Re: [2.2.13aa6 (bugfix release II) ]
Date: 2000/01/11
Message-ID: <fa.hfcqagv.13m24g2@ifi.uio.no>#1/1
X-Deja-AN: 570965301
Original-Date: Mon, 10 Jan 2000 11:43:40 -0800
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <200001101943.LAA11888@pizda.ninka.net>
References: <fa.jof3chv.a0elqb@ifi.uio.no>
To: and...@suse.de
Original-References: <Pine.LNX.4.21.0001100246170.18403-100...@alpha.random>
X-Authentication-Warning: pizda.ninka.net: davem set sender to da...@redhat.com using -f
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

   Date: Mon, 10 Jan 2000 16:11:28 +0100 (CET)
   From: Andrea Arcangeli <and...@suse.de>

   tcp_delack_timer doesn't really guarantee that, see:

Ok, thanks for catching this.

   If the quickack bit is set while calling tcp_send_delayed_ack the kernel
   will lockup immediatly in a way that matches the reports from ursus.

How?  Always in such a case, timeout > max_timeout because this bit is
set and the values are unsigned.

   The reason for the deadlock is that the expired field of the timer will be
   set in the past and so the timer will reinsert inself in the first heap
   slot and so it will continue to reinsert and rexecute it in an infinite
   loop -> soft deadlock.

Not if the timeout>max_timeout test passes, which I think it will.

   I'd like to also make the timer code robust against
   these kind of subsystem bugs later but actually I am only focused to fix
   the offending code in TCP.

Agreed, so I want your fix to go in anyways.

But I do want to discuss where the error comes from and
why the timeout>max_timeout test does not prevent it.

   And really tp->delayed_acks is meaningless as far I can tell and the right
   thing to do is to remove the delayed_acks field completly (this must be
   done at least in 2.3.x). Removing it will avoid wasting time in the TCP
   code and will decrease half of the the delacks code braindamage :).

It's already done in the patch sets I've been feeding Linus for 2.3.x
It has died in our sources, it will be no more :-)

Later,
David S. Miller
da...@redhat.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: ursus <ur...@usa.net>
Subject: Re: [Re: [2.2.13aa6 (bugfix release II) ]]
Date: 2000/01/10
Message-ID: <fa.nsvi2rv.1e4401n@ifi.uio.no>#1/1
X-Deja-AN: 570902240
Original-Date: 10 Jan 00 13:30:13 EST
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 8BIT
Original-Message-ID: <20000110183013.13273.qmail@nwcst314.netaddress.usa.net>
To: Andrea Arcangeli <and...@suse.de>, "David S. Miller" <da...@redhat.com>
Content-Type: text/plain; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Mime-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Andrea Arcangeli <and...@suse.de> wrote:

> please apply also this patch on the top of my fix
> in the previous email as for David [Miller's] correct
> suggestion of putting an assert there. If you see
> a printk with the below patch applyed, we'll have 
> [proof that] my theory about the source of your deadlocks
> is correct and that my fix made the difference for you.

ok, I'll apply the previous "delack-timer-1" patch,
as well as the one below. However, can you upload
the below patch to the ftp.*.kernel.org mirrors also,
just so I can ensure the spacings and such are correct.
Thanks.

> --- 2.2.14/net/ipv4/tcp_timer.c	Fri Jan  7 18:19:25 2000

As a side note, the machines have still have not crashed,
so almost certainly the TCP_DELAY_ACK bug is the culprit,
at least in UniProcessor case. Going to try using SMP
kernel again after I'm sure these patches do the job.

Thanks for your help ...

--
RW <ur...@usa.net>

____________________________________________________________________
Get free email and a permanent address at http://www.netaddress.com/?N=1

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: Re: [Re: [2.2.13aa6 (bugfix release II) ]]
Date: 2000/01/10
Message-ID: <fa.inbme8v.1a58kjb@ifi.uio.no>#1/1
X-Deja-AN: 570914675
Original-Date: Mon, 10 Jan 2000 19:53:05 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.21.0001101945490.4259-100000@alpha.random>
References: <fa.nsvi2rv.1e4401n@ifi.uio.no>
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
To: ursus <ur...@usa.net>
X-Sender: and...@alpha.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On 10 Jan 2000, ursus wrote:

>as well as the one below. However, can you upload
>the below patch to the ftp.*.kernel.org mirrors also,

ok, I put it here:

	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/delack-assert-1.gz

>at least in UniProcessor case. Going to try using SMP
>kernel again after I'm sure these patches do the job.

Fine.

BTW, assuming TCP does its locking right (either doing lock_sock or
running in atomic bh context) SMP/UP shouldn't matter. And if TCP is
missing a lock_sock a race can trigger also in UP. So hopefully if we get
it fixed on UP we should be fine on SMP too later.

Andrea


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: timer_bh robusteness fix against potential deadlocks
Date: 2000/01/12
Message-ID: <fa.hpj3vlv.752u12@ifi.uio.no>#1/1
X-Deja-AN: 571586341
Original-Date: Wed, 12 Jan 2000 02:00:02 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.21.0001120135430.312-100000@alpha.random>
References: <fa.nsvi2rv.1e4401n@ifi.uio.no>
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
To: ursus <ur...@usa.net>
X-Sender: and...@alpha.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

I fixed the timer code to be robust against the bad scenario I discovered
in the last days. The bad secnario consists in a timer that reinsert
itself with an expire <= jiffies (or more precisely < timer_jiffies).

In the current 2.2.x and 2.3.x this scenario will lead in a plain
deadlock.

I verifyed the correctness the patch in an userspace simulation.

Ursus could you please check out if you can still deadlock your machine
with this patch against 2.2.14 applyed on the top of your current tree?

--- 2.2.14/kernel/sched.c	Wed Jan  5 14:16:56 2000
+++ /tmp/sched.c	Wed Jan 12 00:45:15 2000
@@ -535,6 +535,15 @@
 		/* can happen if you add a timer with expires == jiffies,
 		 * or you set a timer to go off in the past
 		 */
+		if ((signed long) idx < -50)
+			/* Nobody should set a timer so insanely in the past or
+			   waiting so many timer interrupts between reading
+			   jiffies and calling the timer code. The timer code
+			   is completly robust against this condition but
+			   a printk may let us know about bugs in the
+			   caller we might not notice otherwise. */
+			printk(KERN_WARNING
+			       "timer inserted in the past, idx = %ld\n", idx);
 		insert_timer(timer, tv1.vec, tv1.index);
 	} else if (idx <= 0xffffffffUL) {
 		int i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
@@ -1124,10 +1133,31 @@
         tv->index = (tv->index + 1) & TVN_MASK;
 }
 
+/* defer current timers to the next pass */
+static void cascade_current_timers(void)
+{
+	struct timer_list * timer;
+	int index = tv1.index;
+
+	timer = tv1.vec[index];
+	tv1.index = (tv1.index + 1) & TVR_MASK;
+
+	while (timer)
+	{
+		struct timer_list *tmp = timer;
+		timer = timer->next;
+		insert_timer(tmp, tv1.vec, tv1.index);
+	}
+	tv1.vec[index] = NULL;
+}
+
 static inline void run_timer_list(void)
 {
+	long passes;
+
 	spin_lock_irq(&timerlist_lock);
-	while ((long)(jiffies - timer_jiffies) >= 0) {
+	passes = jiffies - timer_jiffies;
+	while (passes-- >= 0) {
 		struct timer_list *timer;
 		if (!tv1.index) {
 			int n = 1;
@@ -1135,17 +1165,21 @@
 				cascade_timers(tvecs[n]);
 			} while (tvecs[n]->index == 1 && ++n < NOOF_TVECS);
 		}
-		while ((timer = tv1.vec[tv1.index])) {
+		timer = tv1.vec[tv1.index];
+		tv1.vec[tv1.index] = 0;
+		while (timer) {
 			void (*fn)(unsigned long) = timer->function;
 			unsigned long data = timer->data;
-			detach_timer(timer);
-			timer->next = timer->prev = NULL;
+			struct timer_list * tmp = timer;
+			timer = timer->next;
+			detach_timer(tmp);
+			tmp->next = tmp->prev = NULL;
 			spin_unlock_irq(&timerlist_lock);
 			fn(data);
 			spin_lock_irq(&timerlist_lock);
 		}
 		++timer_jiffies; 
-		tv1.index = (tv1.index + 1) & TVR_MASK;
+		cascade_current_timers();
 	}
 	spin_unlock_irq(&timerlist_lock);
 }


The same patch will apply cleanly also to 2.3.38 by specifying as file to
patch linux/kernel/timer.c .

Or you can download the patch from here:

	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/timer_bh-deadlock-1.gz
	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.38/timer_bh-deadlock-1.gz

Andrea



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: Re: timer_bh robusteness fix against potential deadlocks
Date: 2000/01/12
Message-ID: <fa.hs37vlv.2l2u1e@ifi.uio.no>#1/1
X-Deja-AN: 571572295
Original-Date: Wed, 12 Jan 2000 02:49:56 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.21.0001120239430.822-100000@alpha.random>
References: <fa.hpj3vlv.752u12@ifi.uio.no>
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
To: ursus <ur...@usa.net>
X-Sender: and...@alpha.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 12 Jan 2000, Andrea Arcangeli wrote:

>I fixed the timer code to be robust against the bad scenario I discovered
>in the last days. The bad secnario consists in a timer that reinsert
>itself with an expire <= jiffies (or more precisely < timer_jiffies).

As TyTso suggested the problem could be ato to be zero. I checked and I
finally also spotted the offending bug that was causing the above
condition to happen (second chunk of the patch):

--- 2.2.14-tcp/net/ipv4/tcp_output.c.~1~	Fri Jan  7 18:19:25 2000
+++ 2.2.14-tcp/net/ipv4/tcp_output.c	Wed Jan 12 02:47:32 2000
@@ -1004,7 +1004,7 @@
 	unsigned long timeout;
 
 	/* Stay within the limit we were given */
-	timeout = tp->ato;
+	timeout = (tp->ato << 1) >> 1;
 	if (timeout > max_timeout)
 		timeout = max_timeout;
 	timeout += jiffies;
@@ -1044,6 +1044,8 @@
 			 */
 			if(tcp_in_quickack_mode(tp))
 				tcp_exit_quickack_mode(tp);
+			if (!tp->ato)
+				tp->ato = tp->rto;
 			tcp_send_delayed_ack(tp, HZ/2);
 			return;
 		}

An incoming synack doesn't carry any data into the packet so the
tcp_delack_estimator gets not recalled from tcp_ack, and the ato stays
zero. Then tcp_send_ack (the one we send to put the other end in
enstablished state) goes oom and queue the delack timer while ato is still
zero. Then the timer gets reinserted in the current queue from
run_timer_list and boom!

The fact an oom condition was necessary to trigger the bug, perfectly
explains why it wasn't reproducible in most machines.

	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/delack-timer-2.gz

Andrea


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "David S. Miller" <da...@redhat.com>
Subject: Re: timer_bh robusteness fix against potential deadlocks
Date: 2000/01/12
Message-ID: <fa.hacs8ov.17nc6oe@ifi.uio.no>#1/1
X-Deja-AN: 571586340
Original-Date: Tue, 11 Jan 2000 18:40:23 -0800
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <200001120240.SAA02150@pizda.ninka.net>
References: <fa.hs37vlv.2l2u1e@ifi.uio.no>
To: and...@suse.de
Original-References: <Pine.LNX.4.21.0001120239430.822-100...@alpha.random>
X-Authentication-Warning: pizda.ninka.net: davem set sender to da...@redhat.com using -f
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

   Date: Wed, 12 Jan 2000 02:49:56 +0100 (CET)
   From: Andrea Arcangeli <and...@suse.de>

   @@ -1044,6 +1044,8 @@
			    */
			   if(tcp_in_quickack_mode(tp))
				   tcp_exit_quickack_mode(tp);
   +			if (!tp->ato)
   +				tp->ato = tp->rto;
			   tcp_send_delayed_ack(tp, HZ/2);
			   return;
		   }

Yep, I bet this is it.  Good spotting.

Both of these fixes to tcp_output.c are in my tree now.

Later,
David S. Miller
da...@redhat.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: ty...@valinux.com
Subject: Re: timer_bh robusteness fix against potential deadlocks
Date: 2000/01/14
Message-ID: <fa.ikttrov.2ks7bu@ifi.uio.no>#1/1
X-Deja-AN: 572547617
Original-Date: Wed, 12 Jan 2000 11:23:45 -0800
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <E128TMv-0007md-00@dcl.su.varesearch.com>
References: <fa.ikaud7v.1b50lj0@ifi.uio.no>
To: and...@suse.de
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Phone: (781) 391-3464
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

   Date: Wed, 12 Jan 2000 20:12:54 +0100 (CET)
   From: Andrea Arcangeli <and...@suse.de>

   >It adds one conditional inside a rare 'if' case, so it's not a
   >performance issue, and it means that the next time something like this
   >happens, the machine will cleanly panic, and leave a very easy to
   >understand indication of what went wrong.

   I don't like a panic for a thing that we can recover gracefully and
   allowing the user to also see the message even if he was running under X 8).

Fine,so make it set a standard timeout and do a printk instead.  This is
a "never can happen" situation, right?


	if (!timeout) {
		timeout = tp->rto;
		if (!timeout) {
			printk("Bugcheck: tcp_send_delayed_ack ato and rto are 0");
			timeout = HZ/50;
	}

							- Ted


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: ty...@valinux.com
Subject: Re: timer_bh robusteness fix against potential deadlocks
Date: 2000/01/14
Message-ID: <fa.iib9h8v.3609ru@ifi.uio.no>#1/1
X-Deja-AN: 572550514
Original-Date: Wed, 12 Jan 2000 07:59:35 -0800
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <E128QBL-0007kE-00@dcl.su.varesearch.com>
References: <fa.ilr2dov.18l4l35@ifi.uio.no>
To: and...@suse.de
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Phone: (781) 391-3464
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

   Date: Wed, 12 Jan 2000 04:02:35 +0100 (CET)
   From: Andrea Arcangeli <and...@suse.de>

   Ok, moving the check inside the tcp_send_delayed_ack it's fine for me.

   But there's to say that the only other way the delack timer could be
   posted is via __tcp_ack_snd_check. And if __tcp_ack_snd_check is using
   tcp_send_delayed_ack instead of tcp_send_ack before in order packets with
   data are arrived (so before the ato is been initalized to something
   different than zero) it probably means there's a genuine bug in tcp.

True; but my paranoia says that even if there isn't a problem *now*,
there may be later.  Which is why why I'll suggest one more change to
your patch:

	if (!timeout) {
		timeout = tp->rto;
		if (!timeout) 
			panic("Bugcheck: tcp_send_delayed_ack ato and rto are 0");
	}

It adds one conditional inside a rare 'if' case, so it's not a
performance issue, and it means that the next time something like this
happens, the machine will cleanly panic, and leave a very easy to
understand indication of what went wrong.

						- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: Re: timer_bh robusteness fix against potential deadlocks
Date: 2000/01/14
Message-ID: <fa.ikreegv.1bl0kr2@ifi.uio.no>#1/1
X-Deja-AN: 572766913
Original-Date: Fri, 14 Jan 2000 15:56:55 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.21.0001141552150.2653-100000@alpha.random>
References: <fa.hpj3vlv.752u12@ifi.uio.no>
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
To: ursus <ur...@usa.net>
X-Sender: and...@alpha.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 12 Jan 2000, Andrea Arcangeli wrote:

>+		while (timer) {
> 			void (*fn)(unsigned long) = timer->function;
> 			unsigned long data = timer->data;
>-			detach_timer(timer);
>-			timer->next = timer->prev = NULL;
>+			struct timer_list * tmp = timer;
>+			timer = timer->next;
>+			detach_timer(tmp);
>+			tmp->next = tmp->prev = NULL;
> 			spin_unlock_irq(&timerlist_lock);

If at this point the timer pointed by "timer" gets detached while "fn" is
running, at the next loop the machine is going to fail. I am sorry.

>Or you can download the patch from here:
>
>	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/timer_bh-deadlock-1.gz
>	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.38/timer_bh-deadlock-1.gz

Please don't use the two timer_bh patches quoted above (neither the -2
optimized version). Having the timer robust against buggy users is not
necessary but only desiderable, so actually you don't need it.
Nevertheless I'll fix the problem soon.

FYI: the delack-timer-3 patch here:

	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/delack-timer-3.gz

seems to fix the wait_on_bh popular deadlock on UP/SMP webservers fine :).

Andrea


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: ty...@valinux.com
Subject: Re: timer_bh robusteness fix against potential deadlocks
Date: 2000/01/15
Message-ID: <fa.i6sti0v.cjc93u@ifi.uio.no>#1/1
X-Deja-AN: 572800041
Original-Date: Tue, 11 Jan 2000 18:03:40 -0800
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <E128D8O-0007ai-00@dcl.su.varesearch.com>
References: <fa.hs37vlv.2l2u1e@ifi.uio.no>
To: and...@suse.de
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Phone: (781) 391-3464
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

   Date: Wed, 12 Jan 2000 02:49:56 +0100 (CET)
   From: Andrea Arcangeli <and...@suse.de>

   As TyTso suggested the problem could be ato to be zero. I checked and I
   finally also spotted the offending bug that was causing the above
   condition to happen (second chunk of the patch):

   --- 2.2.14-tcp/net/ipv4/tcp_output.c.~1~	Fri Jan  7 18:19:25 2000
   +++ 2.2.14-tcp/net/ipv4/tcp_output.c	Wed Jan 12 02:47:32 2000
   @@ -1004,7 +1004,7 @@
	   unsigned long timeout;

	   /* Stay within the limit we were given */
   -	timeout = tp->ato;
   +	timeout = (tp->ato << 1) >> 1;
	   if (timeout > max_timeout)
		   timeout = max_timeout;
	   timeout += jiffies;

Note that max_timeout is always a small positive number (HZ/2 or HZ/10),
and timeout is a unsigned long.  Hence if the quickack bit is set,
timeout is > max_timeout, and so timeout gets capped to max_timeout.
Without the patch, we simply delay the hack by the max_timeout instead
of the current value of ato.  Probably not the best, but not a disaster,
either.

   @@ -1044,6 +1044,8 @@
			    */
			   if(tcp_in_quickack_mode(tp))
				   tcp_exit_quickack_mode(tp);
   +			if (!tp->ato)
   +				tp->ato = tp->rto;
			   tcp_send_delayed_ack(tp, HZ/2);
			   return;
		   }

This fixes the bug, but I'd be much happier if we put a
belt-and-suspenders check in tcp_send_delayed_ack.  If there's some
other place which allows tcp_send_delayed_ack() to be called with
tp->ato set to zero, we shouldn't lock up the entire kernel.  

So I'd propose adding to the tcp_send_delayed_ack() something like this:

#define min_timeout HZ/50
	if (timeout < min_timeout)
		timeout = min_timeout;	/* This prevents an endless kernel loop */

(who me, paranoid?)

							- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: <arij...@valinux.com>
Subject: Re: timer_bh robusteness fix against potential deadlocks
Date: 2000/01/15
Message-ID: <fa.lcv2c3v.1r0obru@ifi.uio.no>#1/1
X-Deja-AN: 572878309
Original-Date: Fri, 14 Jan 2000 17:43:21 -0500 (EST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.10001141737580.1356-100000@lap.valinuxny.net>
References: <fa.ikreegv.1bl0kr2@ifi.uio.no>
To: Andrea Arcangeli <and...@suse.de>
X-Sender: arij...@lap.valinuxny.net
X-Authentication-Warning: lap.valinuxny.net: arijort owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



Andrea,

Just to be clear...

Are you saying that the entire timer_bh-deadlock patch is bad?
Or simply the hunk that you refer to below, which begins like this:


@@ -1135,17 +1165,20 @@


ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/timer_bh-deadlock-1.gz

I'm assuming it's the whole patch.


ari




On Fri, 14 Jan 2000, Andrea Arcangeli wrote:

> On Wed, 12 Jan 2000, Andrea Arcangeli wrote:
> 
> >+		while (timer) {
> > 			void (*fn)(unsigned long) = timer->function;
> > 			unsigned long data = timer->data;
> >-			detach_timer(timer);
> >-			timer->next = timer->prev = NULL;
> >+			struct timer_list * tmp = timer;
> >+			timer = timer->next;
> >+			detach_timer(tmp);
> >+			tmp->next = tmp->prev = NULL;
> > 			spin_unlock_irq(&timerlist_lock);
> 
> If at this point the timer pointed by "timer" gets detached while "fn" is
> running, at the next loop the machine is going to fail. I am sorry.
> 
> >Or you can download the patch from here:
> >
> >	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/timer_bh-deadlock-1.gz
> >	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.38/timer_bh-deadlock-1.gz
> 
> Please don't use the two timer_bh patches quoted above (neither the -2
> optimized version). Having the timer robust against buggy users is not
> necessary but only desiderable, so actually you don't need it.
> Nevertheless I'll fix the problem soon.
> 
> FYI: the delack-timer-3 patch here:
> 
> 	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/delack-timer-3.gz
> 
> seems to fix the wait_on_bh popular deadlock on UP/SMP webservers fine :).
> 
> Andrea
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: Re: timer_bh robusteness fix against potential deadlocks
Date: 2000/01/15
Message-ID: <fa.jmunc1v.bgikqb@ifi.uio.no>#1/1
X-Deja-AN: 572924322
Original-Date: Sat, 15 Jan 2000 01:46:12 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.21.0001150145061.14161-100000@alpha.random>
References: <fa.lcv2c3v.1r0obru@ifi.uio.no>
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
To: arij...@valinux.com
X-Sender: and...@alpha.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 14 Jan 2000 arij...@valinux.com wrote:

>Are you saying that the entire timer_bh-deadlock patch is bad?
>Or simply the hunk that you refer to below, which begins like this:

All. You can safely reverse such patch completly because it's not
necessary for now.

>ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/timer_bh-deadlock-1.gz
>
>I'm assuming it's the whole patch.

You are correct. The whole patch.

Andrea


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/