To: linux-kernel@ Subject: 2.2.13 wait_on_bh lockups on SMP From: Mark van WalravenDate: Mon, 3 Jan 2000 21:59:14 +1300 Sender: owner-linux-kernel@ Hi, I am getting lock ups on a production server with a high network load once or twice a day. After a hang earlier this evening, the same message repeating off the console screen: wait_on_bh, CPU 0: irq: 0 [0 0] bh: 1 [0 1] <[c010af05]> <[c015c994]> <[c016a18e]> <[c014c4e6]> (This is slightly different to what ursus@xxxxxxx reported last month http://www.deja.com/=dnc/[ST_rn=ps]/getdoc.xp?AN=563124486&fmt=text .) Another hang happened about an hour later, this time with nothing written to the console. From my system map: c010aecc T synchronize_bh c014c4ac T sock_recvmsg c015c800 T tcp_recvmsg c016a100 T inet_recvmsg Another system, with identical hardware and kernel, but not quite so heavily loaded, has been running flawlessly for a couple of weeks. Probably-irrelevant details: Dell PowerEdge 2300 with AMI MegaRAID; kernel built from the Debian kernel-source-2.2.13_2.2.13-2 package, to which I added freeswan-1.1 - otherwise a vanilla Debian 2.1 system; eth0 and eth1 are eepro100 (module), though only eth0 is up; CONFIG_M686, CONFIG_X86_GOOD_APIC, CONFIG_1GB, CONFIG_MTRR, CONFIG_SMP. Of course, further config details are available on request. I'd really appreciate any assistance - this is interrupting our services, as well as my so-called holiday. Thanks, Mark. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@xxxxxxxxxxxxxxxx Please read the FAQ at http://www.tux.org/lkml/
From: Manfred Spraul <manfr...@colorfullife.com> Subject: Re: 2.2.13 wait_on_bh lockups on SMP Date: 2000/01/03 Message-ID: <fa.gvuju1v.1u6kdj6@ifi.uio.no>#1/1 X-Deja-AN: 567737709 Original-Date: Mon, 03 Jan 2000 14:05:24 +0100 Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-ID: <38709E94.B6AAAB2@colorfullife.com> References: <fa.g1p84pv.1074t81@ifi.uio.no> To: Mark van Walraven <ma...@wave.co.nz> Original-References: <20000103215914.A3...@mail.wave.co.nz> X-Accept-Language: en Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Mark van Walraven wrote: > > Hi, > > I am getting lock ups on a production server with a high network load > once or twice a day. After a hang earlier this evening, the same message > repeating off the console screen: > > wait_on_bh, CPU 0: > irq: 0 [0 0] > bh: 1 [0 1] > <[c010af05]> <[c015c994]> <[c016a18e]> <[c014c4e6]> > Unfortunately wait_on_bh() doesn't print a complete back trace if you use modules. Could you apply the patch below? It will print a complete back trace. Parse the result through ksymoops. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Mark van Walraven <ma...@wave.co.nz> Subject: Re: 2.2.13 wait_on_bh lockups on SMP Date: 2000/01/06 Message-ID: <fa.g59o39v.16ngv86@ifi.uio.no>#1/1 X-Deja-AN: 568878982 Original-Date: Thu, 6 Jan 2000 12:24:48 +1300 Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <20000106122448.C8464@mail.wave.co.nz> References: <fa.gvuju1v.1u6kdj6@ifi.uio.no> To: linux-ker...@vger.rutgers.edu Original-References: <20000103215914.A3...@mail.wave.co.nz> <38709E94.B6AA...@colorfullife.com> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list Mime-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Mon, Jan 03, 2000 at 02:05:24PM +0100, Manfred Spraul wrote: > Unfortunately wait_on_bh() doesn't print a complete back trace if you > use modules. Could you apply the patch below? It will print a complete > back trace. I did this several days ago. After going into hiding for a bit, the hangs have re-appeared, but I get nothing on the console at all. It's possible that there are two (or more) separate problems. I suspect the eepro100 driver and will investigate that. I'll post here if I get any more oopses. Thanks, Mark. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Alan Cox <a...@lxorguk.ukuu.org.uk> Subject: Re: 2.2.13 wait_on_bh lockups on SMP Date: 2000/01/06 Message-ID: <fa.fll6vmv.tlmnqs@ifi.uio.no>#1/1 X-Deja-AN: 568924160 Original-Date: Thu, 6 Jan 2000 00:39:20 +0000 (GMT) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-Id: <E1260xV-0005Cb-00@the-village.bc.nu> Content-Transfer-Encoding: 7bit References: <fa.g59o39v.16ngv86@ifi.uio.no> To: ma...@wave.co.nz (Mark van Walraven) Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu > It's possible that there are two (or more) separate problems. I suspect > the eepro100 driver and will investigate that. I'll post here if I get > any more oopses. Someone else reported a tcp hang in wait_on_bh and is using an eepro100. Right now I cant see a cause in the eepro100 driver, just a happens to match.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Terry Katz" <k...@advanced.org> Subject: RE: 2.2.13 wait_on_bh lockups on SMP Date: 2000/01/06 Message-ID: <fa.mjjdh7v.1p5kjgt@ifi.uio.no>#1/1 X-Deja-AN: 568944613 Original-Date: Wed, 5 Jan 2000 21:42:07 -0500 Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-ID: <NDBBIHAIKLCAPCNCPEJNIELHCBAA.katz@advanced.org> References: <fa.fll6vmv.tlmnqs@ifi.uio.no> To: "Alan Cox" <a...@lxorguk.ukuu.org.uk>, "Mark van Walraven" <ma...@wave.co.nz> X-Priority: 3 (Normal) Content-Type: text/plain; charset="iso-8859-1" X-Orcpt: rfc822;linux-kernel-outgoing-dig X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 Importance: Normal Organization: Internet mailing list X-MSMail-Priority: Normal MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Were there updates to the eepro driver from 2.2.12 to 2.2.13? We have been running a whole bunch of SMP systems since 2.2.12 was released and haven't had a single crash ... infact, we've had a system up for 50 days straight, which receives about a million web-hits a day... -Terry > -----Original Message----- > From: owner-linux-ker...@vger.rutgers.edu > [mailto:owner-linux-ker...@vger.rutgers.edu]On Behalf Of Alan Cox > Sent: Wednesday, January 05, 2000 7:39 PM > To: Mark van Walraven > Cc: linux-ker...@vger.rutgers.edu; Manfred Spraul > Subject: Re: 2.2.13 wait_on_bh lockups on SMP > > > > It's possible that there are two (or more) separate problems. I suspect > > the eepro100 driver and will investigate that. I'll post here if I get > > any more oopses. > > Someone else reported a tcp hang in wait_on_bh and is using an > eepro100. Right now I cant > see a cause in the eepro100 driver, just a happens to match.. > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.rutgers.edu > Please read the FAQ at http://www.tux.org/lkml/ > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: ur...@usa.net Subject: Re: 2.2.13 wait_on_bh lockups on SMP Date: 2000/01/09 Message-ID: <s7gn1tiqoj8159@corp.supernews.com> X-Deja-AN: 570289565 References: <fa.g59o39v.16ngv86@ifi.uio.no> <fa.fll6vmv.tlmnqs@ifi.uio.no> Organization: Posted via Supernews, http://www.supernews.com Newsgroups: fa.linux.kernel X-Complaints-To: newsabuse@supernews.com In article <fa.fll6vmv.tlm...@ifi.uio.no>, Alan Cox wrote ... > > Someone else reported a tcp hang in wait_on_bh and is using an eepro100. > Right now I can't see a cause in the eepro100 driver, just a happens to match. Alan: Been hammering away on this &@#%$ timer_bh problem for weeks :( I reported the same "wait_on_bh" hangs under heavy network load with linux 2.2.13[stock/ac3/aa6]-SMP as Manfred Spraul reported. I'm using Compaq 6400R (2xPIII-500/2G memory) for webserving (need to handle 10+ million hits/day/server) with Linux+Apache; used Apache 1.3.9 with updated version of Dean Gaudet's TOP_FUEL patch. Also need raid-0.90 support (meaning the patch of course) as my web content is on a raid-0.90 style 2-disk RAID-0 stripe. Plus using Intel EtherExpress Pro/100 NIC with eepro100.c-1.09l at first, and later v1.09t (more explanation on eepro100 below). OT: I needed to increase the size of rt_cache via /proc/sys/net otherwise I was getting TONS of "dst_cache_overflow" in syslog during heavy webserver load. Thanks to Alexey Kuznetsov for an older USENET post on the SysCtl stuff to tune routing code. But the ipv4 rt_cache tuning needs much better documentation! Anyway, back to the "wait_on_bh" issue ... Was getting too many crashes in SMP mode, even with Andrea's "aa6" patchset for kernel 2.2.13, plus raid-0.90, plus the incremental "set-blocksize" patch on top of the raid-0.90 patch. So I tried to run the SMP kernel with "nosmp noapic" options on the bootprompt (via LILO "append="nosmp noapic"). Started seeing kernel crashes without any console or syslog errors and keyboard mostly dead, except I could still get some EIP's (via <ALT>+<SysCtl>+P) ... all of which seem to refer to "timer_bh" according to the System.map matching this kernel. Also contacted Andrea Arcangeli directly about this problem, he has a lot of crash traces and info I fed him via E-mail; if anyone thinks they can make use of the traces I'll repost or he can simply summarize them [better than I could anyway]. Decided to compile for true UniProcessor as last-ditch option, worst case is I "waste" the second CPU until kernel bug squashed. No SMP "wait_on_bh"-style hangs, but immediately saw retransmit errors and hangs from eepro100 ... same thing people have been complaining about on eepro100 mailing-list for months. Didn't see it myself until now (2.2.13aa6+raid-up). Strange. Upgraded to "test" v1.09t of eepro100.c driver from CESDIS, due to posting mentioning some possible workarounds Don Becker put into the 1.09t version. Recompiled 2.2.13aa6+raid-0.90-up. Voila! the eepro100 retransmit errors went away completely. This incident makes me uneasy about stability/performance of eepro100.c in general. Going to try testing 3Com cards ... Started getting crashes under 2.2.13aa6+raid-0.90-up under load, again without any usable output to console/syslog and only some SysRq still working (e.g. <ALT>+<SysRq>+P); got some EIP's which I sent to Andrea. He suggested I grab his IKD patchset and enable in-kernel debugger (KDB) to get better trace info. I applied Andrea's 2.2.13-ikd1 patchset pretty successfully (albeit a few small rejects I fixed by hand due to BIGMEM breaking the patch) and I ran the new kernel on 10 servers which although still crashing under load every few days, I was able to press <PAUSE> to get into the kernel debugger (KDB) to see the current process backtrace. Sent these to Andrea Arcangeli as well. From KDB backtraces, the hangs under kernel 2.2.13aa6+raid+ikd-up looked to him like an IRQ flood and possibly related to TCP_DELAY_ACK ... Andrea sent me a one-line patch to disable TCP_DELAY_ACK which I applied, to see see if this is the offending code. I compiled 2.2.13aa6+raid+ikd+NO_DELAY_ACK for UniProcessor, and have been running this kernel for 4 days straight under load without any crashes. I'm not assuming the patch solved the problem, maybe I've just been lucky this week (and these servers could crash any moment, so says Murphy). Recently I found some disturbing errors in /var/log/messages on a few of these machines, regarding the RAID device: "kernel: fs warning (device md(9,1)): ext2_free_blocks: bit already cleared for block 4408088" These repeat a few times every hour or so, during heavy filesystem activity on the md device (RAID-0 stripe). I'm going to watch this webserver cluster like a hawk, and if/when I can capture a crash (and KDB backtrace) I will copy this information to Andrea and the list. Also I want to find out of any changes 2.2.13->2.2.14 will help me (which haven't been addressed by the "aa6" patchset already). Are there any of Andrea's fixes in aa6 which have NOT been folded into 2.2.14? Finally to try a different netcard (3Com?) in case eepro100.c is buggy. As aforementioned these servers need to run rock-stable, handling in excess of 10 million hits per day (per server) I will be deploying 100 of these Linux/Apache webservers to handle > 1 billion transactions per day. Any/all ideas/comments/patches/advice greatly appreciated. Thanks to all of those who've actually read this far :) PS: As a high-performance linux/webserver issue ... Has anyone here had success with the "10xpatches" from SGI for Apache (http://oss.sgi.com/apache/)? They seem similar to the TOP_FUEL Apache patches (from Dean Gaudet or Dan Kegel? can't remember now) -- RW <ur...@usa.net>