From: Miquel van Smoorenburg <miqu...@cistron.nl> Subject: Big SMP machine hangs often [debug included] Date: 2000/05/17 Message-ID: <fa.d8edb3v.1lk258q@ifi.uio.no>#1/1 X-Deja-AN: 624615735 Original-Date: Wed, 17 May 2000 19:06:43 +0200 Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <20000517190643.A9020@cistron.nl> To: linux-ker...@vger.rutgers.edu Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig X-NCC-RegID: nl.cistron Organization: Internet mailing list Mime-Version: 1.0 User-Agent: Mutt/1.0i Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu [Apologies if you see this twice, but after 3 hours I still haven't seen my original posting on this list] We sold a customer a big AMI Megaplex (4xPIII/500, 2GB RAM) server but as soon as they put any load on it, they see the following problems: - sometimes the I/O subsytems "hangs" for 10-20 seconds - every few days the server just hangs. Doesn't respond to pings, nothing. We need to press the RESET switch.... Config is: - AMI Megaplex 4xPIII/500 2 GB RAM - AMI MegaRaid EC9F:1.24 controller with 4 18 GB disks in RAID5 mode - Linux 2.2.13 kernel compiled with gcc 2.7.2.3 and SMP / 2GB support Today the machine hung again, but it did still respond to SYSRQ, so I got the following debug output. I'd appreciate it if someone could take a look and say if this is something that 2.2.14/2.2.15 should solve, or that it is something else. It looks like the kernel gets stuck in add_timer/timer_bh somehow. Note also the strange "buffer hashed" output. Unfortunately there is no way to force an OOPS using sysrq right now, so I do not have a complete stack trace. [short System.map fragment] 80110cf8 T add_timer <----- 80110e94 T mod_timer 8011102c T del_timer 80111084 T schedule_timeout 80111f38 T update_one_process 8011200c t update_process_times 80112014 t timer_bh <----- 801123ac T do_timer 80112400 T sys_alarm [.....] SysRq: Show Regs EIP: 0010:[<8011234c>] EFLAGS: 00000283 EAX: aa448ae0 EBX: aa4489c0 ECX: 801d6584 EDX: aa448af4 ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018 CR0: 8005003b CR2: 2acae000 CR3: 3a30f000 SysRq: Show Regs EIP: 0010:[<8011234c>] EFLAGS: 00000283 EAX: fbdeb72c EBX: aa4489c0 ECX: 801d6558 EDX: aa448af4 ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018 CR0: 8005003b CR2: 2acae000 CR3: 3a30f000 SysRq: Show Regs EIP: 0010:[<80110e78>] EFLAGS: 00000246 EAX: 801d664c EBX: 00000246 ECX: aa448af4 EDX: 000000ec ESI: 8015d060 EDI: 00000001 EBP: 81e97c9c DS: 0018 ES: 0018 CR0: 8005003b CR2: 2acae000 CR3: 3a30f000 SysRq: Show Regs EIP: 0010:[<8011234c>] EFLAGS: 00000287 EAX: f67045ac EBX: aa4489c0 ECX: 801d63c0 EDX: aa448af4 ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018 CR0: 8005003b CR2: 2acae000 CR3: 3a30f000 SysRq: Show Regs EIP: 0010:[<80110e78>] EFLAGS: 00000246 EAX: 801d6504 EBX: 00000246 ECX: aa448af4 EDX: 0000009a ESI: 8015d060 EDI: 00000001 EBP: 81e97c9c DS: 0018 ES: 0018 CR0: 8005003b CR2: 2acae000 CR3: 3a30f000 SysRq: Show Regs EIP: 0010:[<80110e78>] EFLAGS: 00000246 EAX: 801d6318 EBX: 00000246 ECX: aa448af4 EDX: 0000001f ESI: 8015d060 EDI: 00000001 EBP: 81e97c9c DS: 0018 ES: 0018 CR0: 8005003b CR2: 2acae000 CR3: 3a30f000 SysRq: Show Regs EIP: 0010:[<8011234c>] EFLAGS: 00000283 EAX: aa448ae0 EBX: aa4489c0 ECX: 801d665c EDX: aa448af4 ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018 CR0: 8005003b CR2: 2acae000 CR3: 3a30f000 SysRq: Show Memory Mem-info: Free pages: 71316kB ( Free: 17829 (256 512 768) NonDMA: 8179*4kB 3695*8kB 489*16kB 5*32kB 2*64kB 1*128kB 1*256kB 0*512kB 0*1024k B 0*2048kB = 70772kB) DMA: 0*4kB 0*8kB 2*16kB 2*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB = 544kB) Swap cache: add 6339, delete 6330, find 329846/339376 Free swap: 129880kB 507904 pages of RAM 5456 reserved pages 52779 pages shared 9 pages swap cached 361640 pages in file cache 361649 pages in page cache 24 pages in page table cache Buffer memory: 282680kB Buffer heads: 71303 Buffer blocks: 71267 Buffer hashed: -3666981 CLEAN: 70927 buffers, 83 used (last=70758), 0 locked, 0 protected, 0 dirty LOCKED: 74 buffers, 0 used (last=0), 0 locked, 0 protected, 0 dirty DIRTY: 190 buffers, 9 used (last=177), 0 locked, 0 protected, 190 dirty Networking buffers in use : 780 Total network buffer allocations : 2201836468 Total failed network buffer allocs : 0 IP fragment buffer size : 0 Mike. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Alan Cox <a...@lxorguk.ukuu.org.uk> Subject: Re: Big SMP machine hangs often [debug included] Date: 2000/05/17 Message-ID: <fa.fiiqv8v.12min8s@ifi.uio.no>#1/1 X-Deja-AN: 624650271 Original-Date: Wed, 17 May 2000 19:43:53 +0100 (BST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-Id: <E12s8nT-00005F-00@the-village.bc.nu> Content-Transfer-Encoding: 7bit References: <fa.d8edb3v.1lk258q@ifi.uio.no> To: miqu...@cistron.nl (Miquel van Smoorenburg) Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu > - AMI MegaRaid EC9F:1.24 controller with 4 18 GB disks in RAID5 mode > - Linux 2.2.13 kernel compiled with gcc 2.7.2.3 and SMP / 2GB support Get at least the 1.07b MegaRAID driver and also the 3.10 or higher firmware. > solve, or that it is something else. It looks like the kernel gets > stuck in add_timer/timer_bh somehow. Note also the strange "buffer hashed" > output. Adding a timer continually for now or past can cause this hang. Try 2.2.15 and the delack-5 patch from Andrea. Andrea also did a patch to stop timers that get queued this way hanging the box so you can debug it Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Alan Cox <a...@lxorguk.ukuu.org.uk> Subject: Re: Big SMP machine hangs often [debug included] Date: 2000/05/17 Message-ID: <fa.fiiqv8v.12min8s@ifi.uio.no>#1/1 X-Deja-AN: 624650271 Original-Date: Wed, 17 May 2000 19:43:53 +0100 (BST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-Id: <E12s8nT-00005F-00@the-village.bc.nu> Content-Transfer-Encoding: 7bit References: <fa.d8edb3v.1lk258q@ifi.uio.no> To: miqu...@cistron.nl (Miquel van Smoorenburg) Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu > - AMI MegaRaid EC9F:1.24 controller with 4 18 GB disks in RAID5 mode > - Linux 2.2.13 kernel compiled with gcc 2.7.2.3 and SMP / 2GB support Get at least the 1.07b MegaRAID driver and also the 3.10 or higher firmware. > solve, or that it is something else. It looks like the kernel gets > stuck in add_timer/timer_bh somehow. Note also the strange "buffer hashed" > output. Adding a timer continually for now or past can cause this hang. Try 2.2.15 and the delack-5 patch from Andrea. Andrea also did a patch to stop timers that get queued this way hanging the box so you can debug it Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/