Big SMP machine hangs often

From: Miquel van Smoorenburg <miqu...@cistron.nl>
Subject: Big SMP machine hangs often [debug included]
Date: 2000/05/17
Message-ID: <fa.d8edb3v.1lk258q@ifi.uio.no>#1/1
X-Deja-AN: 624615735
Original-Date: Wed, 17 May 2000 19:06:43 +0200
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <20000517190643.A9020@cistron.nl>
To: linux-ker...@vger.rutgers.edu
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-NCC-RegID: nl.cistron
Organization: Internet mailing list
Mime-Version: 1.0
User-Agent: Mutt/1.0i
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

[Apologies if you see this twice, but after 3 hours I still haven't
 seen my original posting on this list]

We sold a customer a big AMI Megaplex (4xPIII/500, 2GB RAM) server
but as soon as they put any load on it, they see the following
problems:

- sometimes the I/O subsytems "hangs" for 10-20 seconds
- every few days the server just hangs. Doesn't respond to pings, nothing.
  We need to press the RESET switch....

Config is:

- AMI Megaplex 4xPIII/500 2 GB RAM
- AMI MegaRaid EC9F:1.24 controller with 4 18 GB disks in RAID5 mode
- Linux 2.2.13 kernel compiled with gcc 2.7.2.3 and SMP / 2GB support

Today the machine hung again, but it did still respond to SYSRQ, so
I got the following debug output. I'd appreciate it if someone could
take a look and say if this is something that 2.2.14/2.2.15 should
solve, or that it is something else. It looks like the kernel gets
stuck in add_timer/timer_bh somehow. Note also the strange "buffer hashed"
output.

Unfortunately there is no way to force an OOPS using sysrq right now,
so I do not have a complete stack trace.

[short System.map fragment]
80110cf8 T add_timer		<-----
80110e94 T mod_timer
8011102c T del_timer
80111084 T schedule_timeout

80111f38 T update_one_process
8011200c t update_process_times
80112014 t timer_bh		<-----
801123ac T do_timer
80112400 T sys_alarm
[.....]

SysRq: Show Regs

EIP: 0010:[<8011234c>] EFLAGS: 00000283
EAX: aa448ae0 EBX: aa4489c0 ECX: 801d6584 EDX: aa448af4
ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<8011234c>] EFLAGS: 00000283
EAX: fbdeb72c EBX: aa4489c0 ECX: 801d6558 EDX: aa448af4
ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<80110e78>] EFLAGS: 00000246
EAX: 801d664c EBX: 00000246 ECX: aa448af4 EDX: 000000ec
ESI: 8015d060 EDI: 00000001 EBP: 81e97c9c DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<8011234c>] EFLAGS: 00000287
EAX: f67045ac EBX: aa4489c0 ECX: 801d63c0 EDX: aa448af4
ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<80110e78>] EFLAGS: 00000246
EAX: 801d6504 EBX: 00000246 ECX: aa448af4 EDX: 0000009a
ESI: 8015d060 EDI: 00000001 EBP: 81e97c9c DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<80110e78>] EFLAGS: 00000246
EAX: 801d6318 EBX: 00000246 ECX: aa448af4 EDX: 0000001f
ESI: 8015d060 EDI: 00000001 EBP: 81e97c9c DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<8011234c>] EFLAGS: 00000283
EAX: aa448ae0 EBX: aa4489c0 ECX: 801d665c EDX: aa448af4
ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Memory
Mem-info:
Free pages:     71316kB
 ( Free: 17829 (256 512 768)
NonDMA: 8179*4kB 3695*8kB 489*16kB 5*32kB 2*64kB 1*128kB 1*256kB 0*512kB 0*1024k
B 0*2048kB = 70772kB)
DMA: 0*4kB 0*8kB 2*16kB 2*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 
= 544kB)
Swap cache: add 6339, delete 6330, find 329846/339376
Free swap:      129880kB
507904 pages of RAM
5456 reserved pages
52779 pages shared
9 pages swap cached
361640 pages in file cache
361649 pages in page cache
24 pages in page table cache
Buffer memory:  282680kB
Buffer heads:   71303
Buffer blocks:  71267
Buffer hashed:  -3666981
   CLEAN: 70927 buffers, 83 used (last=70758), 0 locked, 0 protected, 0 dirty
  LOCKED: 74 buffers, 0 used (last=0), 0 locked, 0 protected, 0 dirty
   DIRTY: 190 buffers, 9 used (last=177), 0 locked, 0 protected, 190 dirty
Networking buffers in use         : 780
Total network buffer allocations   : 2201836468
Total failed network buffer allocs : 0
IP fragment buffer size         : 0

Mike.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Alan Cox <a...@lxorguk.ukuu.org.uk>
Subject: Re: Big SMP machine hangs often [debug included]
Date: 2000/05/17
Message-ID: <fa.fiiqv8v.12min8s@ifi.uio.no>#1/1
X-Deja-AN: 624650271
Original-Date: Wed, 17 May 2000 19:43:53 +0100 (BST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <E12s8nT-00005F-00@the-village.bc.nu>
Content-Transfer-Encoding: 7bit
References: <fa.d8edb3v.1lk258q@ifi.uio.no>
To: miqu...@cistron.nl (Miquel van Smoorenburg)
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

> - AMI MegaRaid EC9F:1.24 controller with 4 18 GB disks in RAID5 mode
> - Linux 2.2.13 kernel compiled with gcc 2.7.2.3 and SMP / 2GB support

Get at least the 1.07b MegaRAID driver and also the 3.10 or higher firmware.

> solve, or that it is something else. It looks like the kernel gets
> stuck in add_timer/timer_bh somehow. Note also the strange "buffer hashed"
> output.

Adding a timer continually for now or past can cause this hang. Try 2.2.15
and the delack-5 patch from Andrea. Andrea also did a patch to stop timers
that get queued this way hanging the box so you can debug it

Alan


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Alan Cox <a...@lxorguk.ukuu.org.uk>
Subject: Re: Big SMP machine hangs often [debug included]
Date: 2000/05/17
Message-ID: <fa.fiiqv8v.12min8s@ifi.uio.no>#1/1
X-Deja-AN: 624650271
Original-Date: Wed, 17 May 2000 19:43:53 +0100 (BST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <E12s8nT-00005F-00@the-village.bc.nu>
Content-Transfer-Encoding: 7bit
References: <fa.d8edb3v.1lk258q@ifi.uio.no>
To: miqu...@cistron.nl (Miquel van Smoorenburg)
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

> - AMI MegaRaid EC9F:1.24 controller with 4 18 GB disks in RAID5 mode
> - Linux 2.2.13 kernel compiled with gcc 2.7.2.3 and SMP / 2GB support

Get at least the 1.07b MegaRAID driver and also the 3.10 or higher firmware.

> solve, or that it is something else. It looks like the kernel gets
> stuck in add_timer/timer_bh somehow. Note also the strange "buffer hashed"
> output.

Adding a timer continually for now or past can cause this hang. Try 2.2.15
and the delack-5 patch from Andrea. Andrea also did a patch to stop timers
that get queued this way hanging the box so you can debug it

Alan


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/