From: i-wear-spammers-skulls-jan-2...@paypc.com (Lord Apollyon) Subject: YIKES!!! A very scary thing happened today [2.2.13] Memory bug? Date: 2000/01/11 Message-ID: <fa.gs7dt1v.154qjgf@ifi.uio.no> X-Deja-AN: 571002812 Original-Date: Tue, 11 Jan 2000 08:29:02 +1000 Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-Id: <200001102229.IAA15164@quanta.paypc.com> To: linux-ker...@vger.rutgers.edu Followup-To: comp.os.linux.setup X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Acheron, Ninth Plane of HELL Keywords: Linux 2.2.13 kernel possible memory bug X-Warning: Spammers, you will face network disconnection, disembowellment, and eternal damnation. In that order. Newsgroups: fa.linux.kernel Original-Newsgroups: comp.os.linux.development.system X-Loop: majord...@vger.rutgers.edu (A copy of this message has also been posted to the following newsgroups: comp.os.linux.development.system) [2.2.13+Solar-Designer's patches, libc5, AMD-K6-200, 64MB physical+64MB swap, started off with Slack 4.0] Well, I get into work and my very hard-working and faithful Linux server which runs my entire organisation showed a large number of dead services. At first I suspected the worst - HACKERS. klogd was dead, as were nmbd [Netbios name services], my checkups monitor, powerd, rpc.mountd, rpc.nfsd, and atalkd [AppleTalk protocol daemon for the netatalk file server]. The system was still operational of course, but limping along a bit. I restarted the dead services without incident, and immediately scrutinised my logs. There were the usual hacker scans and probes of port 137 and 1080 - but no signs of intrusions, modified kernel, libraries, or system binaries. I was still very nervous - and went to my router logs, and didn't see anything that would reveal signs of a successful break-in. No bizarre net connections or funky network activity at all during the night, other than the port 137 stuff. [I have a set of default deny inbound firewall rules in force which block just about everything except web, ftp, ssh, very limited telnet, a secure POP3, and of course SMTP [patched sendmail 8.9.3 to remove the alias rebuild DoS] ] I was just calming down but still rather perplexed when I typed "dmesg" just to see if the kernel barfed any messages about the process closures... and I saw the final clue which unravelled the puzzle: Out of memory for nmbd. Out of memory for klogd. Out of memory for atalkd. Out of memory for checkups. Out of memory for powerd. Out of memory for rpc.mountd. Out of memory for rpc.nfsd. Out of memory for tiff2bin. Aha! tiff2bin is a utility I wrote to rapidly print faxes to HP Laserjet printers at extremely high speed by simply converting the tiff images to 200x200 dpi [line doubling the low quality faxes], and then raster-compressing them using HP's FASST! algorithms. It works great, however, the TIFF library is a bit of a bitch to deal with so I use the "read entire file into core" method. These in-core images can require over 8MB chunks of memory to convert. Confirmation of my incoming fax logs show that there was indeed a fax [I use HylaFax 4.1beta2] at the time of this meltdown. I run squid 2.2 STABLE 5, and it was using nearly 18MB of memory at the time. It had died with a core swap failure at about this time as well. My box's "resting" memory usage is as follows [with 10MB used by squid]: Memory: Total Used Free Shared Buffers Cached Mem: 63312 60580 2732 9196 21048 20076 Swap: 65988 25004 40984 I don't read all of that to mean I was RAM challenged, however... with all of this floating buffer/cache stuff, it can be hard to tell at times. Now, my $64,000 question: Why didn't the system kill the process that was being the pig (tiff2bin)? It seemed to kill processes "early" in the process table to free up memory or something. powerd and checkups don't allocate memory once they've started up, so they didn't die because they were attempting to allocate memory and failed. And klogd! I've never seen klogd or syslogd ever die like this. It seems like the kernel just up and decided to clean house or something, axing whichever processes seemed expedient. Hell, I don't mind it killing off processes, but WHY NOT HAVE IT KILL USER/NON-DAEMON processes [or at the very least, non-wheel-grouped/UID=0]? This was a real pisser of a "mini-crash". Amazingly, most of the remaining services were OK despite no klogd etc. I have a separate logfile for kernel alerts [which is solely used by the Solar Designer no-stack-exec and other security log entries], and it's been empty and devoid of problems for the box's entire 33 day uptime. So I highly doubt it's a problem with his patches. [I do not have auto-trampolines enabled, but I've yet to see a daemon or piece of code on my box which requires them]. My kernel is gcc2.7.2.3 compiled, and it is conservatively configured [very little experimental code if any, other than the Solar D stuff]. I am VERY worried that one (non-root) process requesting a large chunk of memory could cause several vital system services to fail. I also find it extremely worrisome that the memory hog was the last process to get the chop. To me this is counter-intuitive. I'd think the kernel would nix the "youngest" non-root process as being the most anti-social. It killing klogd was the biggest surprise of all. Is there a "memory fragmentation" issue with Linux 2.2 memory management that can sometimes arise in pathological cases? It'd be extremely difficult for me to replicate the conditions that lead to this situation on my system. It's seen heavy squid, Macintosh as well as Windows file sharing, and the usual things a modern net-using org does to an Intranet box. Any ideas or explanations (or especially recommendations!) would be very welcome, =Rob= -- The reply-to-address is real and will expire on 12:01AM 1-Feb-2000. Spammers: You will lose your network access. Guaranteed. 102 domains, 376 web-accounts, and 568 dialup ISP accounts flushed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel <r...@nl.linux.org> Subject: Re: YIKES!!! A very scary thing happened today [2.2.13] Memory bug? Date: 2000/01/11 Message-ID: <fa.lc0bbbv.d4gg08@ifi.uio.no>#1/1 X-Deja-AN: 571340916 Original-Date: Tue, 11 Jan 2000 15:16:27 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.10001111511380.329-100000@mirkwood.dummy.home> References: <fa.gs7dt1v.154qjgf@ifi.uio.no> To: Lord Apollyon <i-wear-spammers-skulls-jan-2...@paypc.com> X-Sender: r...@mirkwood.dummy.home Content-Type: TEXT/PLAIN; charset=US-ASCII X-Search-Engine-Bait: http://www.nl.linux.org/ X-Orcpt: rfc822;linux-kernel-outgoing-dig Organisation: NL.linux.org (http://www.nl.linux.org/) MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Tue, 11 Jan 2000, Lord Apollyon wrote: > Out of memory for nmbd. > Out of memory for klogd. > Out of memory for atalkd. > Out of memory for checkups. > Out of memory for powerd. > Out of memory for rpc.mountd. > Out of memory for rpc.nfsd. > Out of memory for tiff2bin. And all of this with tiff2bin being the obvious `guilty party'... > Now, my $64,000 question: > > Why didn't the system kill the process that was being the pig > (tiff2bin)? It seemed to kill processes "early" in the process > table to free up memory or something. powerd and checkups don't > allocate memory once they've started up, so they didn't die > because they were attempting to allocate memory and failed. And > klogd! I've never seen klogd or syslogd ever die like this. I have an (old) patch that does try to find the `guilty party' and get rid of that, and only that. I'll soon move house and have a test machine (at the office); then I'll port the patch to 2.2.14 and 2.3.<new> and will fix the last few bugs in it. People without that patience are welcome to take a look at my patches page, grab the (old) OOM killer patch and do the obvious things (compiling, reading it, sending me comments, etc)... http://www.nl.linux.org/~riel/patches/ cheers, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/