From: torva...@transmeta.com (Linus Torvalds) Subject: Linux-2.1.100... Date: 1998/05/08 Message-ID: <Pine.LNX.3.95.980507192644.21545A-100000@penguin.transmeta.com>#1/1 X-Deja-AN: 351312390 Approved: g...@greenie.muc.de Sender: muc.de!l-linux-kernel-owner Newsgroups: muc.lists.linux-kernel Ok, I just released 2.1.100, which does: - fix an ugly lockup on SMP that could fairly easily happen if you used your floppies. - various irq/apic fixes - this should get us back to booting on the machines that had problems with the earlier versions. - capabilities stuff - get rid of many suser() calls to instead use the more finegrained capabilities. - IDE driver updates - Coda FS update - various network fixes from David (the oops in the TCP hashing stuff fixed etc) As has already been found out by earlier testers, pppd has problems with newer kernels. The problems are: - using "strcmp()" to do numeric comparisons. That's a no-no, pppd needs to be fixed (patches have floated around). It breaks because it thinks 100 is smaller than 16. - doing a route on a downed device doesn't work in recent kernels (sanely enough). Again, a patch to pppd is available. Do people still have problems with lockups or bootup under SMP with this? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu
From: ste...@eecs.umich.edu (Steve Hsieh) Subject: Re: Linux-2.1.100... Date: 1998/05/08 Message-ID: <Pine.LNX.3.96.980508002933.16483C-100000@bigfoot.eecs.umich.edu>#1/1 X-Deja-AN: 351332097 Approved: g...@greenie.muc.de Sender: muc.de!l-linux-kernel-owner References: <Pine.LNX.3.95.980507192644.21545A-100000@penguin.transmeta.com> Newsgroups: muc.lists.linux-kernel > Do people still have problems with lockups or bootup under SMP with this? > > Linus Yes, on my quad ppro alder, I still get lockups with 2.1.100. The kernel is still running, but I can't start any new processes. Existing ones hang if they require disk access. Existing shells, login are still active, but will hang as soon as you try to do something. alt-sysreq still works, although I don't know how to interpret any of that info either. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu
From: torva...@transmeta.com (Linus Torvalds) Subject: Re: Linux-2.1.100... Date: 1998/05/08 Message-ID: <Pine.LNX.3.95.980507224726.22916B-100000@penguin.transmeta.com>#1/1 X-Deja-AN: 351342903 Approved: g...@greenie.muc.de Sender: muc.de!l-linux-kernel-owner References: <Pine.LNX.3.96.980508002933.16483C-100000@bigfoot.eecs.umich.edu> Newsgroups: muc.lists.linux-kernel On Fri, 8 May 1998, Steve Hsieh wrote: > > Yes, on my quad ppro alder, I still get lockups with 2.1.100. > The kernel is still running, but I can't start any new processes. > Existing ones hang if they require disk access. Existing shells, login > are still active, but will hang as soon as you try to do something. > alt-sysreq still works, although I don't know how to interpret any of that > info either. Ok, it appears that we have a bad case of lost interrupts, where everything gets stuck in disk-wait. Does a Ctrl+ScrolLock show processes in "D" state? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu
From: ste...@eecs.umich.edu (Steve Hsieh) Subject: Re: Linux-2.1.100...[bad case of lost interrupts] Date: 1998/05/08 Message-ID: <Pine.LNX.3.96.980508183951.13743C-100000@bigfoot.eecs.umich.edu>#1/1 X-Deja-AN: 351686140 Approved: g...@greenie.muc.de Sender: muc.de!l-linux-kernel-owner References: <Pine.LNX.3.95.980507224726.22916B-100000@penguin.transmeta.com> Newsgroups: muc.lists.linux-kernel On Thu, 7 May 1998, Linus Torvalds wrote: > On Fri, 8 May 1998, Steve Hsieh wrote: > > > > Yes, on my quad ppro alder, I still get lockups with 2.1.100. > > The kernel is still running, but I can't start any new processes. > > Existing ones hang if they require disk access. Existing shells, login > > are still active, but will hang as soon as you try to do something. > > alt-sysreq still works, although I don't know how to interpret any of that > > info either. > > Ok, it appears that we have a bad case of lost interrupts, where > everything gets stuck in disk-wait. Does a Ctrl+ScrolLock show processes > in "D" state? Yes, I see quite a few processes stuck in "D" state. It was first 'update', and then after that every process I tried got stuck in "D" until I ran out of windows that worked. :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu
From: dledf...@dialnet.net (Doug Ledford) Subject: Re: Linux-2.1.100...[bad case of lost interrupts] Date: 1998/05/08 Message-ID: <3553971C.F5B7109A@dialnet.net>#1/1 X-Deja-AN: 351688834 Approved: g...@greenie.muc.de Sender: muc.de!l-linux-kernel-owner References: <Pine.LNX.3.96.980508183951.13743C-100000@bigfoot.eecs.umich.edu> Newsgroups: muc.lists.linux-kernel Steve Hsieh wrote: > > On Thu, 7 May 1998, Linus Torvalds wrote: > > > On Fri, 8 May 1998, Steve Hsieh wrote: > > > > > > Yes, on my quad ppro alder, I still get lockups with 2.1.100. > > > The kernel is still running, but I can't start any new processes. > > > Existing ones hang if they require disk access. Existing shells, login > > > are still active, but will hang as soon as you try to do something. > > > alt-sysreq still works, although I don't know how to interpret any of that > > > info either. > > > > Ok, it appears that we have a bad case of lost interrupts, where > > everything gets stuck in disk-wait. Does a Ctrl+ScrolLock show processes > > in "D" state? > > Yes, I see quite a few processes stuck in "D" state. He has processes stuck in a "D" state, but I don't think lost interrupts are his problem. He has 4 PPro processor in an Alder MB, and three aic7xxx controllers on three different IRQs. The aic7xxx driver doesn't care if we loose an interrupt as long as one will come along later to alleviate the problem. IOW, we clear a complete queue on each interrupt, regardless of the number of entries in that complete queue. Besides, the aic7xxx PCI hardware uses level sensitive interrupts, and if we don't turn those interrupts off then we simply get more interrupts even without further completion events pending. We actually depend on this behavior on the PCI cards to detect PCI bus parity problems as well. I would be more suspicious that there is some area somewhere that has been overlooked in the mid level SCSI code that allows something along the lines of the enable_IOapic_irq() problem to occur (specifically, that while inside of a spin lock, we can attempt a recursive entry on the spin lock) or that something is causing our local cli() state to get lost while in the spin lock, then we take an interrupt from another IRQ on the same processor that also wants to grab the io_request_lock, resulting in deadlock. -- Doug Ledford <dledf...@dialnet.net> Opinions expressed are my own, but they should be everybody's. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu
From: torva...@transmeta.com (Linus Torvalds) Subject: Re: Linux-2.1.100...[bad case of lost interrupts] Date: 1998/05/08 Message-ID: <Pine.LNX.3.95.980508163902.25393I-100000@penguin.transmeta.com>#1/1 X-Deja-AN: 351688800 Approved: g...@greenie.muc.de Sender: muc.de!l-linux-kernel-owner References: <3553971C.F5B7109A@dialnet.net> Newsgroups: muc.lists.linux-kernel On Fri, 8 May 1998, Doug Ledford wrote: > > He has processes stuck in a "D" state, but I don't think lost interrupts are > his problem. He has 4 PPro processor in an Alder MB, and three aic7xxx > controllers on three different IRQs. The aic7xxx driver doesn't care if we > loose an interrupt as long as one will come along later to alleviate the > problem. Note that this problem sounds like either of: - egcs problem. Make sure to compile with a standard gcc or at least with a plain -O2 (check your gcc options file whether that contains additional default options), there is one confirmed report that egcs didn't boot with certain options to egcs even though it works fine with others. - something makes us lose the io-apic completely for a certain interrupt. I don't see anything that could do that, but the behaviour sounds like we just simply no longer get interrupts from the controller - at all. For example, let's assume that you have an interrupt on irq16 through a PCI device, and 2.1.100 for some reason doesn't ACK it. You'll still continue to get interrupts for higher priority events, but not for that irq or for anything lower. I don't think this is what 2.1.100 does, because the priorities are reverse from what this would indicate, but there may be something we've missed. However, at this point I do know that the compiler makes a difference, so I'd ask everybody to make sure they are using gcc-2.7.2 if they have the problem. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu