Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84; site stratus.UUCP Path: utzoo!watmath!clyde!bonnie!akgua!gatech!stratus!spaf From: spaf@stratus.UUCP (Gene Spafford) Newsgroups: net.unix-wizards Subject: What is this panic? Message-ID: <110@stratus.UUCP> Date: Wed, 28-Nov-84 13:57:40 EST Article-I.D.: stratus.110 Posted: Wed Nov 28 13:57:40 1984 Date-Received: Fri, 30-Nov-84 04:38:10 EST Distribution: net Organization: The Clouds Project, School of ICS, Georgia Tech Lines: 90 We've just recently brought up 4.2 BSD on 3 750 Vaxen. Each Vax is configured with 2 or 3Mbyte memory, Rev 7 CPU boards, DEUNA ethernet drivers, a DZ-32 board, an RL02 disk, and a UDA50 disk controller running 1, 2, or 3 RA81 disks. All three machines keep dying with a (claimed) tbuf parity problem. However, the value in the mcesr register indicates a bus error rather than a tbuf error. The PC of the last few faults was in some of the UDA50 code (udrsp) or the hardclock routine, while others were in user address space; this seems to rule out any likely direct correspondance with any particular software module. The problem appears whenever the machines are under load, but there is no sure way to bring on the problem. Rebuilding 2 or 3 copies of Unix at the same time seems to bring it on regularly, but not all the time. This problem is rather annoying, to say the least, and I've had little success either tracking the problem down or getting much co-operation out of some of our local DEC people ("If it isn't a problem that occurs with DEC software, it isn't our problem."). Has anybody out there seen this before? Anybody have a fix or suggestion where I go from here? If so, please drop me some mail (I don't always have time to read my news in this group). I'm enclosing some samples of the error summary printed on the console (and log) whenever the problem occurs. Thanks in advance. --gene machine check 2: cp tbuf par fault va 12f90 errpc 6dfc mdr aaaaaaaa smr b rdtimo 0 tbgpar 0 cacherr 5 buserr 6 mcesr 9 pc 6df0 psl 3c00004 mcsr 80016 panic: mchk trap type 2, code = 0, pc = 80000d76 panic: Reserved operand machine check 2: cp tbuf par fault va 7fffeb38 errpc 157d mdr 2d smr b rdtimo 0 tbgpar 0 cacherr 5 buserr 6 mcesr 9 pc 157b psl 3c00004 mcsr 80016 panic: mchk trap type 2, code = 0, pc = 80000d76 panic: Reserved operand machine check 2: cp tbuf par fault va 7fffec6c errpc c5f8 mdr 0 smr b rdtimo 0 tbgpar 0 cacherr 4 buserr 6 mcesr 9 pc c5f8 psl 3c00000 mcsr 80016 panic: mchk trap type 2, code = 0, pc = 80000d76 panic: Reserved operand machine check 2: cp tbuf par fault va 8017dfd4 errpc 8001c1f8 mdr 7c smr 8 rdtimo 0 tbgpar 0 cacherr 5 buserr 6 mcesr 9 pc 8001c1f3 psl c00000 mcsr 80016 panic: mchk trap type 2, code = 0, pc = 80000d76 panic: Reserved operand machine check 2: cp tbuf par fault va 800336c4 errpc 800271fc mdr 0 smr 8 rdtimo 0 tbgpar 0 cacherr 4 buserr 6 mcesr 9 pc 800271f9 psl 4150004 mcsr 80016 panic: mchk trap type 2, code = 0, pc = 80000d76 panic: Reserved operand machine check 2: cp tbuf par fault va 800336c4 errpc 800271fc mdr 0 smr 8 rdtimo 0 tbgpar 0 cacherr 5 buserr 6 mcesr 8 pc 800271f9 psl 4150000 mcsr 80016 panic: mchk trap type 2, code = 0, pc = 80000d76 panic: Reserved operand machine check 2: cp tbuf par fault va 800336c4 errpc 800271fc mdr 0 smr 8 rdtimo 0 tbgpar 0 cacherr 4 buserr 6 mcesr 8 pc 800271f9 psl 4150000 mcsr 80016 panic: mchk trap type 2, code = 0, pc = 80000d76 panic: Reserved operand machine check 2: cp tbuf par fault va 7ffff1a4 errpc 8000b188 mdr 7ffff184 smr 8 rdtimo 0 tbgpar 0 cacherr 5 buserr 6 mcesr 9 pc 8000b186 psl c00000 mcsr 80016 panic: mchk trap type 2, code = 0, pc = 80000d76 panic: Reserved operand -- Off the Wall of Gene Spafford The Clouds Project, School of ICS, Georgia Tech, Atlanta GA 30332 CSNet: Spaf @ GATech ARPA: Spaf%GATech.CSNet @ CSNet-Relay.ARPA uucp: ...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-sally}!gatech!spaf
Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: Notesfiles $Revision: 1.6.2.17 $; site uiucdcs.UUCP Path: utzoo!watmath!clyde!burl!ulysses!mhuxj!ihnp4!inuxc!pur-ee!uiucdcs!irwin From: irwin@uiucdcs.UUCP Newsgroups: net.unix-wizards Subject: Re: What is this panic? Message-ID: <13700083@uiucdcs.UUCP> Date: Fri, 30-Nov-84 16:49:00 EST Article-I.D.: uiucdcs.13700083 Posted: Fri Nov 30 16:49:00 1984 Date-Received: Sun, 2-Dec-84 03:35:29 EST References: <110@stratus.UUCP> Lines: 14 Nf-ID: #R:stratus:-11000:uiucdcs:13700083:000:624 Nf-From: uiucdcs!irwin Nov 30 15:49:00 1984 We have 7 Vax-750s. Ours are all at Rev 3 CPU. We have the exact panic you have described on 3 of the 7 quite frequently. We have a Ramtek on one that can cause a crash from time to time, but this is a different bug. We know that if the CPU were at Rev 7 the Ramtek bug would go away. We have wondered if our machines were at Rev 7, if the panic bug would disappear. Yours is at Rev 7 and do it, so that answers the question. What causes it, still do not know. Our disks are on CMI controllers, some CDC 300MB, some Eagles, some CDC 80MB. Two of the 7 have hdwr floating point, no two of the seven are configured alike.
Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: Notesfiles $Revision: 1.6.2.17 $; site uiucdcs.UUCP Path: utzoo!watmath!clyde!cbosgd!ihnp4!inuxc!pur-ee!uiucdcs!irwin From: irwin@uiucdcs.UUCP Newsgroups: net.unix-wizards Subject: Re: What is this panic? Message-ID: <13700084@uiucdcs.UUCP> Date: Tue, 4-Dec-84 10:06:00 EST Article-I.D.: uiucdcs.13700084 Posted: Tue Dec 4 10:06:00 1984 Date-Received: Thu, 6-Dec-84 03:48:38 EST References: <110@stratus.UUCP> Lines: 26 Nf-ID: #R:stratus:-11000:uiucdcs:13700084:000:1539 Nf-From: uiucdcs!irwin Dec 4 09:06:00 1984 What is this panic? I had the opportunity to set down and have a talk with the manager of our local branch of DEC. To quote, "The 750s have a problem with translation buffer parity errors when running 4.2BSD, if the Rev 7 has not been installed. (tbuf panics) These errors go away if the machine is brought to Rev 7. In addition to this, 4.2BSD also has problems with cache memory parity errors, which also cause panic type crashes. These will be fixed with Rev 8, which will be available in the spring of '85. There are fixes in our software, both VMS and our version of UNIX, which gets around the bug, but not in 4.2." I let him read the base note, to which this response is made, and pointed out that the author stated that their machine was Rev 7 and was having tbuf panics. He said that is the first case he knows of where those type of crashes were a problem on a Rev 7 machine, running 4.2 and that there well may be a hardware problem and may need a board replaced. As to the memory time outs in the response before mine, it well may be that the bug is a bad memory board. If there is a steady stream of errors, the controller may be busy correcting errors, while at the same time getting more errors, which hangs the controller so long that it does not respond in time and a time out is declared. If 4.2 is being run on this machine, it usually reports memory errors, and a look in /usr/adm/messages will verify if there is a bad board. If there are, if the board can be removed, it may be that the time out bug will go away.