From: Jean Francois Martinez <j...@sidney.remcomp.fr> Subject: Bugs and wishes in memory management area Date: 1996/11/19 Message-ID: <199611192039.VAA00436@sidney.remcomp.fr>#1/1 X-Deja-AN: 197500883 sender: owner-linux-ker...@vger.rutgers.edu x-hdr-sender: j...@sidney.remcomp.fr x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel Well the bug is about some modules failing loading with "unable to get DMA memory". That makes modules unreliable. Of course I don't see why modules are apparently using GFP_ATOMIC | GFP_DMA. When loading a module I think we can swap. But GFP_ATOMIC allocation could be improved because we could discard unmodified pages I think. Correct me if I am wrong. Wishes for memory management in 2.1 First: Real swapping. When two processes are fighting for memory and memory is tight only way to make progress is freezing one of them and let the other run unimpeded for a few seconds. Second: Linux does not write to swap, pages who have never been modified like code pages. This way Linux does not lose time wring them. But because reading from swap is faster I think when a page is being recalled for thez upteemth time it would be wise to write it to swap. We will get it faster next times. However it seems it would greatly complicate the code so perhaps it is not a good idea. (Of course this does not apply to pages we are dicarding to make a hole for DMA). -- Jean Francois Martinez
From: "David S. Miller" <da...@jenolan.rutgers.edu> Subject: Re: Bugs and wishes in memory management area Date: 1996/11/20 Message-ID: <199611200538.AAA05284@jenolan.caipgeneral>#1/1 X-Deja-AN: 197587531 sender: owner-linux-ker...@vger.rutgers.edu x-hdr-sender: da...@jenolan.rutgers.edu references: <199611192039.VAA00436@sidney.remcomp.fr> x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel Date: Tue, 19 Nov 1996 21:39:19 +0100 From: Jean Francois Martinez <j...@sidney.remcomp.fr> Well the bug is about some modules failing loading with "unable to get DMA memory". That makes modules unreliable. Of course I don't see why modules are apparently using GFP_ATOMIC | GFP_DMA. When loading a module I think we can swap. No, you cannot page the memory a module uses. I leave the reader to consider the monsterous races assosciated with a kernel that can page itself. But GFP_ATOMIC allocation could be improved because we could discard unmodified pages I think. Correct me if I am wrong. It cannot do this reliably, because there is always the danger of sleeping for some reason. First: Real swapping. When two processes are fighting for memory and memory is tight only way to make progress is freezing one of them and let the other run unimpeded for a few seconds. Consider the live-lock situations that could lead to. Second: Linux does not write to swap, pages who have never been modified like code pages. This way Linux does not lose time wring them. But because reading from swap is faster I think when a page is being recalled for thez upteemth time it would be wise to write it to swap. We will get it faster next times. However it seems it would greatly complicate the code so perhaps it is not a good idea. (Of course this does not apply to pages we are dicarding to make a hole for DMA). We do this, if a page is unmodified and we have an up to date copy on either swap or the binary image from where it was executed. Code pages are modified! The dynamic linker when it performs relocations to the program text to use shared libraries, it does modify and dirty them. So they cannot be just dropped, they must be paged to disk. --------------------------------------------//// Yow! 10.49 MB/s remote host TCP bandwidth //// over 100Mb/s ethernet. Beat that! //// -----------------------------------------////__________ o David S. Miller, da...@caip.rutgers.edu /_____________/ / // /_/ ><
From: Keith Owens <k...@edison.dialix.com.au> Subject: Pageable kernel (was Bugs and wishes in memory management area) Date: 1996/11/20 Message-ID: <Pine.BSF.3.95.961120170907.12479A-100000@edison.dialix.com.au>#1/1 X-Deja-AN: 197596528 sender: owner-linux-ker...@vger.rutgers.edu references: <199611200538.AAA05284@jenolan.caipgeneral> content-type: TEXT/PLAIN; charset=US-ASCII x-hdr-sender: k...@edison.dialix.com.au mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Wed, 20 Nov 1996, David S. Miller wrote: > No, you cannot page the memory a module uses. I leave the reader to > consider the monsterous races assosciated with a kernel that can page > itself. It is not impossible in general but the specific design of the Linux kernel makes it very difficult. For example, MVS can page most of the O/S pages. The MVS nucleus ("kernel") is split into pageable and non-pageable (fixed) areas, the former even use the MMU to map nucleus addresses to physical addresses and the physical address changes everytime they are paged in. Fixed areas are the paging code itself, the common page tables, data that is the subject of current I/O and various bits of time critical code that cannot tolerate the extra time to page-fault into physical memory or run with interrupts disabled. Even the page tables for inactive address spaces ("processes") are paged out. Not that I'm suggesting redesigning the Linux kernel this week to make it pageable. However it may be worth thinking about as a long term project. Finer kernel locking is a prerequisite, replacing the current glocal kernel lock. --- *** O . C. Software -- Consulting Services for Storage Management, *** *** Disaster Recovery, Security, TCP/IP, Intranets and Internet *** *** for mainframes (IBM, Fujitsu, Hitachi) and Unix. ***
From: "David S. Miller" <da...@jenolan.rutgers.edu> Subject: Re: Pageable kernel (was Bugs and wishes in memory management area) Date: 1996/11/20 Message-ID: <199611200628.BAA05367@jenolan.caipgeneral>#1/1 X-Deja-AN: 197716355 sender: owner-linux-ker...@vger.rutgers.edu x-hdr-sender: da...@jenolan.rutgers.edu references: <Pine.BSF.3.95.961120170907.12479A-100000@edison.dialix.com.au> x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel Date: Wed, 20 Nov 1996 17:23:07 +1100 (EST) From: Keith Owens <k...@edison.dialix.com.au> On Wed, 20 Nov 1996, David S. Miller wrote: > No, you cannot page the memory a module uses. I leave the reader to > consider the monsterous races assosciated with a kernel that can page > itself. It is not impossible in general but the specific design of the Linux kernel makes it very difficult. For example, MVS can page most of the O/S pages. The MVS nucleus ("kernel") is split into pageable and non-pageable (fixed) areas, the former even use the MMU to map nucleus addresses to physical addresses and the physical address changes everytime they are paged in. Just because it is possible doesn't mean it is not a stupid idea. I think linux will become pagable as soon as it also begins to do system calls via RPC... Oh, and on fine grained locking, another example of how to "get there" but again a stupid way to do it. The answer is not spin locks all over the kernel, every textbook mutual exclusion primitive known to man, and whatnot, to get SMP. I leave such monstrosities to Solaris and others of it's kin, some of which need programs to scan their code for race conditions. I quit the day we need that sort of garbage for Linux. The answer is with self locking data structures and careful coding. A well thought out design, and not something meant to meet a deadline. --------------------------------------------//// Yow! 10.49 MB/s remote host TCP bandwidth //// over 100Mb/s ethernet. Beat that! //// -----------------------------------------////__________ o David S. Miller, da...@caip.rutgers.edu /_____________/ / // /_/ ><
From: Jean Francois Martinez <j...@sidney.remcomp.fr> Subject: Re: Bugs and wishes in memory management area Date: 1996/11/20 Message-ID: <199611202158.WAA01515@sidney.remcomp.fr>#1/1 X-Deja-AN: 197827432 sender: owner-linux-ker...@vger.rutgers.edu x-hdr-sender: j...@sidney.remcomp.fr references: <m0vPyaH-0005KgC@lightning.swansea.linux.org.uk> x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel From: a...@lxorguk.ukuu.org.uk (Alan Cox) Date: Tue, 19 Nov 1996 22:24:00 +0000 (GMT) Cc: linux-ker...@vger.rutgers.edu Content-Type: text > DMA memory". That makes modules unreliable. Of course I don't see > why modules are apparently using GFP_ATOMIC | GFP_DMA. When loading a > module I think we can swap. GFP_DMA implies GFP_ATOMIC (rightly or wrongly) When saying we can swap I was of course referring to stealing memory from user processes not from modules. (MVS is pageable but it is because it is so big than not even IBM mainframes can hold it). About modules I still think than we do not need GFP_ATOMIC when loading them. Wa need contiguous memory and that can we get it either by swapping or at least by discarding unmodified pages from user processes. Swapping is less prone to failing to find enough DMAable memory but it is too big a change for putting it in 2.0. And I don't think having unrelable modules can wait until 2.2 > Wishes for memory management in 2.1 > > First: Real swapping. When two processes are fighting for memory and > memory is tight only way to make progress is freezing one of them and > let the other run unimpeded for a few seconds. Questionable. The shared set of pages in most systems is a lot of the application. Also the 'less regularly used' algorithm behaves far more stably with heavy loads. I can see a case for trying to pick more on processes that have run a lot but the whole VM/scheduling/IO interaction is an incredibly complex and subtle beast. Repeating the formulae of the early 1980's isnt always progress No. First I am not talking about swapping a whole process like BSD does. Just freezing a process (of course the living one will steal a big chunk of memory from it). Second. You have two programs using big arrays or chained lists. Each one needs 6 Megs of memory for being able to run for 5 ms without faulting. A page fault will take 20 ms. You only have 8 megs of RAM. I had a situation like this, I used CTRL-Z on one of the processes. Third: In its advertisements about Free BSD, Infomagic claims (implictly) than Free BSD can handle heavy loads far better than Linux. BSD swaps. I would really like Linux erasing that stupid superiority smile of BSDers. Respective to proprietary systems we can claim than Linux and a lot of RAM is cheaper and a faster than a proprietary OS and less RAM. That is not valid when comparing to BSD. -- Jean Francois Martinez
From: m...@toaster.roan.co.uk (Mike Jagdis) Subject: Re: Bugs and wishes in memory management area Date: 1996/11/21 Message-ID: <slrn5988u1.6tu.mike@toaster.roan.co.uk>#1/1 X-Deja-AN: 197827802 x-nntp-posting-host: whthom.demon.co.uk references: <199611192039.VAA00436@sidney.remcomp.fr> x-submitted-via: n...@ratatosk.yggdrasil.com (linux.* gateway) x-hdr-sender: m...@toaster.roan.co.uk x-env-sender: m...@toaster.roan.co.uk newsgroups: linux.dev.kernel >But GFP_ATOMIC allocation could be improved because we could discard >unmodified pages I think. Correct me if I am wrong. Yes, try_to_free_page almost but not quite handles no wait requests suitably (ipc shrinking ignores no wait at least). There are one or two other stunts we can pull to make DMA requests for multiple pages more reliable. I've been working on page_alloc.c and kmalloc.c for a while now but I got side tracked with the Cyrix stuff which turned out to be more involved than I thought :-(. Mike -- .----------------------------------------------------------------------. | Mike Jagdis | Internet: mailto:m...@roan.co.uk | | Roan Technology Ltd. | | | 54A Peach Street, Wokingham | Telephone: +44 118 989 0403 | | RG40 1XG, ENGLAND | Fax: +44 118 989 1195 | `----------------------------------------------------------------------'
From: Rob Riggs <rri...@tesser.com> Subject: Re: Bugs and wishes in memory management area Date: 1996/11/21 Message-ID: <XFMail.961121024422.rriggs@tesser.com>#1/1 X-Deja-AN: 198026129 sender: owner-linux-ker...@vger.rutgers.edu references: <m0vPyaH-0005KgC@lightning.swansea.linux.org.uk> content-type: text/plain; charset=us-ascii x-hdr-sender: rri...@tesser.com mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On 20-Nov-96 a...@lxorguk.ukuu.org.uk wrote: >> >> DMA memory". That makes modules unreliable. Of course I don't see >> why modules are apparently using GFP_ATOMIC | GFP_DMA. When loading a >> module I think we can swap. > >GFP_DMA implies GFP_ATOMIC (rightly or wrongly) >From my reading of the source and headers it seems that GFP_DMA implies GFP_BUFFER, not GFP_ATOMIC... >From mm.h: > #define GFP_BUFFER 0x00 > #define GFP_ATOMIC 0x01 > #define GFP_USER 0x02 > #define GFP_KERNEL 0x03 > #define GFP_NOBUFFER 0x04 > #define GFP_NFS 0x05 > > /* Flag - indicates that the buffer will be suitable for DMA. > Ignored on some platforms, used as appropriate on others */ > > #define GFP_DMA 0x80 > > #define GFP_LEVEL_MASK 0xf >From kmalloc.c: > if (priority & GFP_DMA) { > dma = 1; > type = MF_DMA; > pg = &bucket->dmafree; > } > > priority &= GFP_LEVEL_MASK; This would seem to imply that one can do: ptr = kmalloc(size, GFP_DMA | GFP_KERNEL); The comments in mm.h clearly state that "GFP_DMA" is a flag, not an actual priority level. However GFP_DMA & GFP_LEVEL_MASK = GFP_BUFFER. And GFP_BUFFER does not try very hard to free pages. I think the original poster is correct in that we could try harder to free pages if DMAable RAM is not immediately availible, even by paging if necessary. Can modules sleep during initialization? If we can't, then the affected modules may need to defer allocation of the DMA buffers until it can sleep. Rob (rri...@tesser.com)
From: Jean Francois Martinez <j...@sidney.remcomp.fr> Subject: Re: Bugs and wishes in memory management area Date: 1996/11/22 Message-ID: <199611222047.VAA01260@sidney.remcomp.fr>#1/1 X-Deja-AN: 198276088 sender: owner-linux-ker...@vger.rutgers.edu x-hdr-sender: j...@sidney.remcomp.fr references: <Pine.LNX.3.91.961122002240.195B-100000@localhost> x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel Date: Fri, 22 Nov 1996 01:07:29 +0000 (GMT) From: Gerard Roudier <groud...@club-internet.fr> X-Sender: groudier@localhost cc: Rob Riggs <rri...@tesser.com>, Alan Cox <a...@lxorguk.ukuu.org.uk>, Jean Francois Martinez <j...@sidney.remcomp.fr>, linux-ker...@vger.rutgers.edu MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Thu, 21 Nov 1996, Jon Tombs wrote: > > Rob Riggs said: > > > This would seem to imply that one can do: > > > > ptr = kmalloc(size, GFP_DMA | GFP_KERNEL); > > > > The comments in mm.h clearly state that "GFP_DMA" is a flag, not an > > actual priority level. However GFP_DMA & GFP_LEVEL_MASK = GFP_BUFFER. > > And GFP_BUFFER does not try very hard to free pages. > > Yes you can do it, but it doesn't help. The problem is in get_free_pages, it 1) Opinion(s): Doing DMA through the ISA bus is the silliest thing you can do with a PCI system. Complexifying the kernel in order to use PCI systems as outdated ones IsSillyAnyway(TM). 16MB of memory is enough for systems that donnot use EISA or PCI bus. The problem is all x86 machines have ISA connectors so we will have to support them, please or not. And there are people using older machines. Besides if you don't use DMA how do you transfer data in a multitasking OS? > returns when there is enough free pages, or if called with the ATOMIC flag. 2) Agreement: Enough to make it happy, but sometimes not enough to satisfy the request. > But it doesn't check if the free pages are actually suitable for the DMA > request. You can fix it easily with an extra if() that avoids returning NULL > when called (GFP_DMA | GFP_KERNEL), but this just results in randomly freeing > pages until success. > What is really needed is an algorithm to pick the pages, or you will swap > several MB in order to create a single 128k DMA buffer (I know, I've > tried...). 3) Irony: No need to try it. Can be guessed easily. 4) Suggestions: A suggestion, for > 16MB systems is to garbage if necessary up to 1MB in a pool for ISA/DMA but to keep the kernel simple. Kernel modules/drivers that allocate ISA/DMA memory when it is not needed are to be reworked (or rewritten). 5) Greetings: Gerard. Interesting idea. But there are still boxes with less than 16 Meg. Ah the good old days where Linux was usable with 2 megs and was advertised as being as being small and mean. -- Jean Francois Martinez -What is the operating system used by Bill Gates? -Linux of course. Do you think stupid people become millionnaire?
From: b...@stat.wisc.edu (Kevin Buhr) Subject: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/26 Message-ID: <vba7mn8bu70.fsf@mozart.stat.wisc.edu>#1/1 X-Deja-AN: 200888738 sender: owner-linux-ker...@vger.rutgers.edu x-hdr-sender: b...@stat.wisc.edu x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel An amusing anecdote: One day, right out of the blue, my poor little 8 meg machine went loco. It began generating reams and reams of "Couldn't get a free page" messages. I was away from the console, and it churned madly away for several hours before I was able to power cycle it. "Fortunately", I'd added the priority and size to the "Couldn't get a free page" message in my kernel (2.0.13 vintage, I believe), and I immediately realized that I was seeing request after request for a 2-page block at GFP_NFS priority. Eventually, I traced it back to this culprit in "fs/nfs/proc.c": static inline int *nfs_rpc_alloc(int size) { int *i; while (!(i = (int *)kmalloc(size+NFS_SLACK_SPACE,GFP_NFS))) { schedule(); } return i; } Get it? *All* my runnable processes wanted a 2-page block of memory for nefarious NFS-read-related purposes, and the kernel was viciously failing to grant any of them. Why, you ask? Well, observe the following from the "mm/page_alloc.c" code: if ((priority==GFP_ATOMIC) || nr_free_pages > reserved_pages) { RMQUEUE(order, dma); restore_flags(flags); return 0; } restore_flags(flags); if (priority != GFP_BUFFER && try_to_free_page(priority, dma, 1)) goto repeat; return 0; Imagine that kernel memory is sufficiently fragmented that, even though there are lots of pages available (say more than "free_pages_high" which is 32 on my little box), there are no 2-page blocks around. Note that, provided "nr_free_pages" is large enough, we never even *get* to "try_to_free_page". Every multipage memory request will be flatly refused until the memory becomes magically "defragmented", which isn't always likely to happen, particularly if "kswapd" doesn't run. Since that horrible experience, I've hacked up my kernel so that "kmalloc" retries multipage, non-GFP_ATOMIC, non-GFP_BUFFER requests (that can't even be satisfied by the "kmalloc" cache and so would otherwise result in "Couldn't get a free page" messages) at a magical "GFP_DEFRAG" priority that will "try_to_free_page" anyway, no matter how many "nr_free_pages" there may be. My hack works like a dream: the only "Couldn't get a free page" messages I get now are at GFP_ATOMIC priority (though, disturbingly enough, they are multipage requests for 4388 bytes---anyone know what these are?), and instead I get dozens of "Free list fragmented" messages associated with 2-page NFS requests whenever the going gets tough. On the other hand, I'm well aware that I've merely traded one race condition for another: for one thing, "try_to_free_page" produces blocks of consecutive free pages by accident, not on purpose. * * * I remember a long time ago, some brave soul was criticizing the allocation buddy-system; he or she wanted to completely replace it with a kernel page-table that could produce multipage blocks automagically, thus eliminating the scourge of memory fragmentation forever. (Now you probably know why I chose the "Subject" line I did.) If I remember correctly, the primary argument against this was the performance penalty of invalidating the cache after every kernel memory allocation. Besides which, it was pretty gross compared to the superefficient buddy system. Was this the only argument against the proposed scheme? Is it a bad idea to have the kernel use a page-table system to back up the buddy system? I'm thinking that a (non-GFP_DMA) multipage request that's in danger of failing merely because of fragmentation would be satisfied by RMQUEUEing a bunch of single pages and bundling them together in the kernel page table up past the physical memory high-water mark. On "free", the pages would be "unbundled" and returned to the queue. I have to claim ignorance here: I don't know how many drivers would break if there was virtual/physical address dichotomy in (non-GFP_DMA) kmalloced blocks. Perhaps we'd have to add a GFP_VIRTOK flag or something. Comments? Flames? Kevin <b...@stat.wisc.edu>
From: m...@toaster.roan.co.uk (Mike Jagdis) Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/27 Message-ID: <slrn59o1ug.9hc.mike@toaster.roan.co.uk>#1/1 X-Deja-AN: 201001570 x-nntp-posting-host: roan.demon.co.uk references: <vba7mn8bu70.fsf@mozart.stat.wisc.edu> x-submitted-via: n...@ratatosk.yggdrasil.com (linux.* gateway) x-hdr-sender: m...@toaster.roan.co.uk x-env-sender: m...@toaster.roan.co.uk newsgroups: linux.dev.kernel >If I remember correctly, the primary argument against this was the >performance penalty of invalidating the cache after every kernel >memory allocation. Besides which, it was pretty gross compared to the >superefficient buddy system. The buddy system is pretty gross for what we want. You don't need to pull gross hacks with page tables or use complex algorithms to handle memory allocation though! I've been working on improved page allocation and kmalloc that doesn't have nasty power-of-two limits, doesn't thrash any more than necessary and is optimised for the common case. It's a slow job because at this stage each little change needs careful thought and a couple of hours benchmarking to see what effect it has. We are talking a few microseconds variation but this code is called a lot. Mike -- .----------------------------------------------------------------------. | Mike Jagdis | Internet: mailto:m...@roan.co.uk | | Roan Technology Ltd. | | | 54A Peach Street, Wokingham | Telephone: +44 118 989 0403 | | RG40 1XG, ENGLAND | Fax: +44 118 989 1195 | `----------------------------------------------------------------------'
From: Ingo Molnar <mi...@pc5829.hil.siemens.at> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/27 Message-ID: <Pine.LNX.3.95.961127103914.13206B-100000@pc5829.hil.siemens.at>#1/1 X-Deja-AN: 201006013 sender: owner-linux-ker...@vger.rutgers.edu references: <slrn59o1ug.9hc.mike@toaster.roan.co.uk> content-type: TEXT/PLAIN; charset=US-ASCII x-hdr-sender: mi...@pc5829.hil.siemens.at mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Wed, 27 Nov 1996, Mike Jagdis wrote: > >If I remember correctly, the primary argument against this was the > >performance penalty of invalidating the cache after every kernel > >memory allocation. Besides which, it was pretty gross compared to the > >superefficient buddy system. > > The buddy system is pretty gross for what we want. You don't need > to pull gross hacks with page tables or use complex algorithms > to handle memory allocation though! IMHO, the problem is that once a pointer is given out, you cannot reogranize your logical->physical memory mappings. With the page table solution you can. It's a CPU hardware feature that is hard to emulate. with the buddy system, once you are fragmented, you can do nothing about it (other than using double indirection pointers [memory handles] which basically emulate paging at a cost we probably dont want to pay?). and the memory handle stuff isnt good for interrupt handlers ... neither for SMP? TLB invalidates are basically a hardware-implemented 'handle-invalidate' feature ... we cannot really implement this in software, can we? -- mingo
From: Mike Jagdis <m...@roan.co.uk> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/27 Message-ID: <Pine.LNX.3.91.961127122646.13538A-100000@toaster.roan.co.uk>#1/1 X-Deja-AN: 201170011 sender: owner-linux-ker...@vger.rutgers.edu references: <Pine.LNX.3.95.961127103914.13206B-100000@pc5829.hil.siemens.at> content-type: TEXT/PLAIN; charset=US-ASCII organisation: Roan Technology Ltd x-hdr-sender: m...@roan.co.uk mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel > IMHO, the problem is that once a pointer is given out, you cannot > reogranize your logical->physical memory mappings. With the page table > solution you can. It's a CPU hardware feature that is hard to emulate. But you don't need to reorganize, you don't need paging and you don't need to emulate it. Buddy is expensive because it *always* tries to coalesce pages to power of two blocks (despite the fact that 99.99% of requests are for single blocks anyway) and only ever coalesces neighbouring blocks of the same size (which makes it hideously vulnerable to fragmentation). I use page arrays of arbitrary length (up to a suitably arbitrary maximum). I use lazy coalescing to optimise the common, single page case - this imposes an extra overhead on the first miss but this is not serious for infrequent requests and frequent requests tend to keep the queues of larger sizes suitably topped up (so requests show the same overhead as the single page case). Being able to add a single page to a 4 page array to get 5 is a _lot_ easier than finding two neighbouring 4s to join. When coalescing pages I check to see if neighbouring pages are in the page cache but otherwise unused. If so I reclaim them immediately rather than waiting for shrink_mmap to clock through memory and get round to releasing them (we could do this at interrupt time too if I mutex the inode and page cache modifications). I do kmalloc in a similar-ish way, treating pages as arrays of small (16 byte) "paragraphs" (although the optimum solution appears to be not to coalesce them but allocate new page arrays and let the current ones fragment away, dropping them when the become entirely free - 99.99% of kmalloc action is below 1k). These schemes allow good handling of requests for arbitrary amounts of memory but the simplicity of the algorithm gives around zero cost. My current benchmarking shows the difference in raw performance (latency and bandwidth) between this and the old buddy code to be down in the noise - if anything I am a small percentage faster. I can still trim a few more instructions as well - plus I have a fair amount of statistics gathering and the like in there :-). Hopefully it will be ready for the harsh light of day in a couple of weeks so you lot can break it :-). Mike -- .----------------------------------------------------------------------. | Mike Jagdis | Internet: mailto:m...@roan.co.uk | | Roan Technology Ltd. | | | 54A Peach Street, Wokingham | Telephone: +44 118 989 0403 | | RG40 1XG, ENGLAND | Fax: +44 118 989 1195 | `----------------------------------------------------------------------'
From: Ingo Molnar <mi...@pc5829.hil.siemens.at> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/27 Message-ID: <Pine.LNX.3.95.961127160222.2744A-100000@pc5829.hil.siemens.at>#1/1 X-Deja-AN: 201173693 sender: owner-linux-ker...@vger.rutgers.edu references: <Pine.LNX.3.91.961127122646.13538A-100000@toaster.roan.co.uk> content-type: TEXT/PLAIN; charset=US-ASCII x-hdr-sender: mi...@pc5829.hil.siemens.at mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Wed, 27 Nov 1996, Mike Jagdis wrote: > > IMHO, the problem is that once a pointer is given out, you cannot > > reogranize your logical->physical memory mappings. With the page table > > solution you can. It's a CPU hardware feature that is hard to emulate. > > But you don't need to reorganize, you don't need paging and you don't > need to emulate it. i dont get it [it must be me]. How would you allocate 2 consecutive pages in this 5-page pool which has 3 free pages? : page 1 -------> free page 2 -------> used by the Ethernet driver page 3 -------> free page 4 -------> used by the SCSI driver page 5 -------> free [ sure we are not at page granularity with the buddy system, but the original poster proposed an >additional< mechanizm for multipage vmalloc() calls, maybe by using paging ] now i need 2 pages in the above situation ... from an IRQ handler, to get 8192 bytes off the networking card for example. How would you do it? -- mingo
From: Mike Jagdis <m...@roan.co.uk> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/27 Message-ID: <Pine.LNX.3.91.961127170819.17768A-100000@toaster.roan.co.uk>#1/1 X-Deja-AN: 201453123 sender: owner-linux-ker...@vger.rutgers.edu references: <Pine.LNX.3.95.961127160222.2744A-100000@pc5829.hil.siemens.at> content-type: TEXT/PLAIN; charset=US-ASCII organisation: Roan Technology Ltd x-hdr-sender: m...@roan.co.uk mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Wed, 27 Nov 1996, Ingo Molnar wrote: > i dont get it [it must be me]. How would you allocate 2 consecutive pages > in this 5-page pool which has 3 free pages? : > [...] > now i need 2 pages in the above situation ... from an IRQ handler, to get > 8192 bytes off the networking card for example. How would you do it? You don't. If the allocated pages were swappable _and_ you weren't in an interrupt you could page one or more out. If you can't do that you have a problem, sure. However this is a worst-case scenario. To get to that state you are probably short of memory and have a high memory load. Normally there are many pages easily reclaimable even from interrupts. If you seriously want to handle the worst case by making things worse it would be possible to fall back to shuffling pages using the MMU (at a price) but I don't think this is a good idea in the general case. (On the other hand the MMU fiddling required is along the lines of the sort of thing we need to do to support large, 1GB or more, memory systems...) Mike -- .----------------------------------------------------------------------. | Mike Jagdis | Internet: mailto:m...@roan.co.uk | | Roan Technology Ltd. | | | 54A Peach Street, Wokingham | Telephone: +44 118 989 0403 | | RG40 1XG, ENGLAND | Fax: +44 118 989 1195 | `----------------------------------------------------------------------'
From: b...@stat.wisc.edu (Kevin Buhr) Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/27 Message-ID: <vba917nl2ur.fsf@mozart.stat.wisc.edu>#1/1 X-Deja-AN: 201126172 sender: owner-linux-ker...@vger.rutgers.edu x-hdr-sender: b...@stat.wisc.edu references: <Pine.LNX.3.91.961127122646.13538A-100000@toaster.roan.co.uk> x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel Mike Jagdis <m...@roan.co.uk> writes: | | But you don't need to reorganize, you don't need paging and you don't | need to emulate it. If I understand correctly, the primary strength of the system you are working on is that it wastes less memory in satisfying allocation requests. As the user of a tiny little 8 meg machine, you can't imagine how much I look forward to getting my filthy hands on this code. ;) However, the buddy system already does a good job of actually satisfying small-allocation requests, even if it wastes lots of the space it has available. That is to say, your new code has the effect of magically boosting the size of physical memory, but the actual failure rates of a "kmalloc(1024,GFP_KERNEL)" won't differ, even if that call is more likely to cause a page fault under the buddy system than under yours. My problem is that the buddy system is very prone to fragmentation: when memory becomes fragmented, it *often* completely fails to satisfy large allocations (and worse yet, the current kernel mechanisms for freeing pages make no attempt to address fragmentation problems---they try to free "any old page" not "a page next to another free page"). It sounds like your system will be better at satisfying large allocation requests, simply because it will be able to find, for example, 9000 bytes starting *anywhere* (or almost anywhere) in memory instead of requiring four consecutive pages. This seems to be a happy side effect of your real goals: avoiding big chunks of wasted memory and optimising the small-allocation case. The proof is in the pudding. If your system makes my fragmentation problem vanish, I will click my heels together with glee and do a dance of joy. ;) However, even after your code is in the kernel, it will still be possible for a multipage allocation request to fail even though there are plenty of pages all over the place. If this situation is very unlikely, then it's not so bad. We can still imagine deadlock scenarios (where every runnable process needs a 4-page block to continue and yet allocated memory looks like a big checkerboard), but if they are incredibly improbable, it's no big deal. My proposed scheme is ugly, but it's for emergencies. Click, click, click, click and four nonconsecutive pages have been mapped into the land of make believe. One cache invalidate, and we've prevented deadlock. Since the virtual address space is so damn big (ha ha), we can make fragmentation of the "imaginary" memory impossible, simply by mapping on 32-page boundaries (the size of the maximum allocation currently allowed by "kmalloc"), and we never have to deal with the ugliness (that Ingo was talking about) of reorganizing already-allocated mappings. It's cost? (1) Even if an emergency never occurs, every "kfree" will need another check, but depending on the actual code your new allocator uses, perhaps we'd be able to get this check done for free; (2) It might break nasty virtual/physical address dependencies deep inside some device driver somewhere despite the existence of a GFP_VIRTOK flag; (3) It's ugly. I guess we'll know whether these costs outweigh potential benefits when we get some idea of how likely bad fragmentation is with your new code in place. I certainly look forward to giving it a test run! Kevin <b...@stat.wisc.edu>
From: Ingo Molnar <mi...@pc5829.hil.siemens.at> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/27 Message-ID: <Pine.LNX.3.95.961127213022.10159C-100000@pc5829.hil.siemens.at>#1/1 X-Deja-AN: 201403769 sender: owner-linux-ker...@vger.rutgers.edu references: <vba917nl2ur.fsf@mozart.stat.wisc.edu> content-type: TEXT/PLAIN; charset=US-ASCII x-hdr-sender: mi...@pc5829.hil.siemens.at mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On 27 Nov 1996, Kevin Buhr wrote: > It's cost? (1) Even if an emergency never occurs, every "kfree" will > need another check, but depending on the actual code your new > allocator uses, perhaps we'd be able to get this check done for free; > (2) It might break nasty virtual/physical address dependencies deep > inside some device driver somewhere despite the existence of a > GFP_VIRTOK flag; (3) It's ugly. with double indirection pointers we could implement some kind of garbage collector kernel thread(s). Kernel code has to 'get' and 'put' such resources, to avoid reentrancy problems. The benefit would be soft relocatable buffers. [this one could be made caching aware too, based on access pattern detection in the get/put logic]. And this could coexist with no-MMU features like the 4MB pages or the Sparc bigpages or the same Cyrix stuff? the operations on these buffers would be: allocate(),get(),put(),free(). For once-only buffers it wont do no good, but for persistant buffers it is an option? -- mingo
From: Mike Jagdis <m...@roan.co.uk> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/28 Message-ID: <Pine.LNX.3.91.961128094919.24362B-100000@toaster.roan.co.uk>#1/1 X-Deja-AN: 201456250 sender: owner-linux-ker...@vger.rutgers.edu references: <vba917nl2ur.fsf@mozart.stat.wisc.edu> content-type: TEXT/PLAIN; charset=US-ASCII organisation: Roan Technology Ltd x-hdr-sender: m...@roan.co.uk mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On 27 Nov 1996, Kevin Buhr wrote: > If I understand correctly, the primary strength of the system you are > working on is that it wastes less memory in satisfying allocation > requests. As the user of a tiny little 8 meg machine, you can't > imagine how much I look forward to getting my filthy hands on this > code. ;) Until a couple of months ago I was using a 386/33 with 4MB :-). > However, the buddy system already does a good job of actually > satisfying small-allocation requests, even if it wastes lots of the > space it has available. That is to say, your new code has the effect > of magically boosting the size of physical memory, but the actual > failure rates of a "kmalloc(1024,GFP_KERNEL)" won't differ, even if > that call is more likely to cause a page fault under the buddy system > than under yours. Small allocations, yes. The problem with buddy is that to make, say, a block of 8 pages it needs to combine two neighbouring blocks of 4 pages and so on. This leads it to only allow certain alignments of blocks which reduces the possibilities as you go to larger sizes (and increases the wastage). Note that I talked about page allocation above. For kmalloc it seems better to forget about coalescing fragments. The dominant kmalloc activity is at relatively small sizes so collecting fragments is a good thing (when we get a completely free page in the free lists we pull all the fragments and release the page - modulo a small delay to avoid thrashing). For larger requests we just grab more pages and peel off the fragment we want. With the lower size queues being kept stuffed wih fragments we will have a tendency to keep a few blocks on the active larger queues. It's like a cascade through a meat grind - pages go in the top, get fragmented as needed, and when the little bits drop out the bottom we just sweep them up and recycle them. It's pretty much self balancing. > However, even after your code is in the kernel, it will still be > possible for a multipage allocation request to fail even though there > are plenty of pages all over the place. Even single page requests fail eventually :-). Code should be prepared for things to fail. The likelyhood of something failing is inversely proportional to the cost of making it work under extreme conditions. As with everything we just have to do our best within the budget available :-). Of course, you can *already* use paging to "solve" the problem. Simply ask for memory with GFP_ATOMIC (to avoid a thrash if the request cannot trivially be satisfied) and if it fails use vmalloc to map pages into kernel memory. But remember that remapped memory is not always suitable - DMA... Mike -- .----------------------------------------------------------------------. | Mike Jagdis | Internet: mailto:m...@roan.co.uk | | Roan Technology Ltd. | | | 54A Peach Street, Wokingham | Telephone: +44 118 989 0403 | | RG40 1XG, ENGLAND | Fax: +44 118 989 1195 | `----------------------------------------------------------------------'
From: Mike Jagdis <m...@roan.co.uk> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/28 Message-ID: <Pine.LNX.3.91.961128094331.24362A-100000@toaster.roan.co.uk>#1/1 X-Deja-AN: 201453149 sender: owner-linux-ker...@vger.rutgers.edu references: <Pine.LNX.3.95.961127213022.10159C-100000@pc5829.hil.siemens.at> content-type: TEXT/PLAIN; charset=US-ASCII organisation: Roan Technology Ltd x-hdr-sender: m...@roan.co.uk mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Wed, 27 Nov 1996, Ingo Molnar wrote: > with double indirection pointers we could implement some kind of garbage > collector kernel thread(s). Kernel code has to 'get' and 'put' such > resources, to avoid reentrancy problems. This means mutexing each access through the pointer so it can't change while we are using the object at the other end. It might help the worst case but at the expense of extra overhead on every usage of an allocated object - expensive. Mike -- .----------------------------------------------------------------------. | Mike Jagdis | Internet: mailto:m...@roan.co.uk | | Roan Technology Ltd. | | | 54A Peach Street, Wokingham | Telephone: +44 118 989 0403 | | RG40 1XG, ENGLAND | Fax: +44 118 989 1195 | `----------------------------------------------------------------------'
From: Ingo Molnar <mi...@pc5829.hil.siemens.at> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/28 Message-ID: <Pine.LNX.3.95.961128131827.13995A-100000@pc5829.hil.siemens.at>#1/1 X-Deja-AN: 201453126 sender: owner-linux-ker...@vger.rutgers.edu references: <Pine.LNX.3.91.961128094331.24362A-100000@toaster.roan.co.uk> content-type: TEXT/PLAIN; charset=US-ASCII x-hdr-sender: mi...@pc5829.hil.siemens.at mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Thu, 28 Nov 1996, Mike Jagdis wrote: > On Wed, 27 Nov 1996, Ingo Molnar wrote: > > > with double indirection pointers we could implement some kind of garbage > > collector kernel thread(s). Kernel code has to 'get' and 'put' such > > resources, to avoid reentrancy problems. > > This means mutexing each access through the pointer so it can't > change while we are using the object at the other end. It might > help the worst case but at the expense of extra overhead on every > usage of an allocated object - expensive. yup ... but note that this is basically a software-implemented paging/TLB mechanizm. With all the advantages of byte granularity and arbitrary size buffers. The 'mutex' is just for the garbage collector thread, not for normal usage. It could be signalled with a high bit in the address or something like that, and could be macroed heavily. once you have got() a buffer, you can code along as if it were a normal buffer, no speed penalty. The garbage collector thread then would clusterise buffers that arent in use. [they are allocated, but no pointer is out somewhere in a context] but anyways, this was just a gedankenexperiment to avoid TLB invalidation. -- mingo
From: Mike Jagdis <m...@roan.co.uk> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/28 Message-ID: <Pine.LNX.3.91.961128101411.24362C-100000@toaster.roan.co.uk>#1/1 X-Deja-AN: 201452850 sender: owner-linux-ker...@vger.rutgers.edu references: <m0vSqZb-0005FbC@lightning.swansea.linux.org.uk> content-type: TEXT/PLAIN; charset=US-ASCII organisation: Roan Technology Ltd x-hdr-sender: m...@roan.co.uk mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Wed, 27 Nov 1996, Alan Cox wrote: > There is an even worse problem with the buddy system, especially on CPU's > without 4 way caches. All our structures tend to have the used stuff at > the top and data below. Our kmalloc spends all day landing all the > critical structures on the same cache line.. That's why I started with the premise that kmalloc should deal in terms of 16 byte "paragraphs" (32 on an Alpha) and that it should not blindly pair groups of similar size. (As I've mentioned before, I gave up coalescing kmalloc fragments completely). After gathering some stats it seems useful to round requests to about 4*16 but just add one. > Some CPU's the TLB is software. On all CPU's I know an SMP CPU invalidation > needs software and is relatively expensive. The principle of keep-it-simple applies in buckets. There is a big price to pay for complexity whether algorithmic or hardware. Unless, possibly you have absolute control of both the software *and* hardware design. We wish... :-) Mike -- .----------------------------------------------------------------------. | Mike Jagdis | Internet: mailto:m...@roan.co.uk | | Roan Technology Ltd. | | | 54A Peach Street, Wokingham | Telephone: +44 118 989 0403 | | RG40 1XG, ENGLAND | Fax: +44 118 989 1195 | `----------------------------------------------------------------------'
From: Mike Jagdis <m...@roan.co.uk> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/29 Message-ID: <Pine.LNX.3.91.961129092948.3799B@toaster.roan.co.uk>#1/1 X-Deja-AN: 201456293 sender: owner-linux-ker...@vger.rutgers.edu references: <Pine.LNX.3.95.961128131827.13995A-100000@pc5829.hil.siemens.at> content-type: TEXT/PLAIN; charset=US-ASCII organisation: Roan Technology Ltd x-hdr-sender: m...@roan.co.uk mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Thu, 28 Nov 1996, Ingo Molnar wrote: > once you have got() a buffer, you can code along as if it were a normal > buffer, no speed penalty. The garbage collector thread then would > clusterise buffers that arent in use. [they are allocated, but no pointer > is out somewhere in a context] This is pretty much what we have. Kmalloc gets the buffer, you use it, kfree releases it. The only thing it doesn't currently do is to shuffle allocated things around to make larger chunks. My experimentation suggests that coalescing in kmalloc is not worthwhile anyway. Mike -- .----------------------------------------------------------------------. | Mike Jagdis | Internet: mailto:m...@roan.co.uk | | Roan Technology Ltd. | | | 54A Peach Street, Wokingham | Telephone: +44 118 989 0403 | | RG40 1XG, ENGLAND | Fax: +44 118 989 1195 | `----------------------------------------------------------------------'
From: Mark Hemment <mar...@nextd.demon.co.uk> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/11/29 Message-ID: <Pine.LNX.3.91.961129202137.142A-100000@nextd.demon.co.uk> X-Deja-AN: 201565405 sender: owner-linux-ker...@vger.rutgers.edu references: <Pine.LNX.3.91.961129093724.3799C-100000@toaster.roan.co.uk> content-type: TEXT/PLAIN; charset=US-ASCII x-hdr-sender: mar...@nextd.demon.co.uk mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Fri, 29 Nov 1996, Mike Jagdis wrote: > On Thu, 28 Nov 1996, Mark Hemment wrote: > > Unfortunately, when searching for a page to move into the free memory > > pool no weight is given to whether the released page will improve the > > quality of the memory-pool (such as allowing coalescing). > 99.99% of the time we want a single page. Yep, but...there are those occassions when more than one page is needed. These are the hardest requests to handle (or prepare for). I was suggesting 'priming' the free memory pool for these cases (without introducing a large overhead - less restrictions on selecting a page to reap == more chance of getting the page 'type' you need). > > But they are not freed immediately. Instead, update the necessary pte's > > (and anon structs), place the pages on a reclaim list, and tell the > > paging out daemon to start. > That is more or less what we have. Pages are kept in the page cache > or buffer pool as long as possible even when unused. Kswapd keeps > the actual free lists topped up to minimise overhead on allocation > requests plus requests fall back to calling try_to_free_page > themselves if necessary. Think you might have missed my point (or I didn't make it v. well), or I'm mis-understanding your reply. What you say above is correct. My point was extending the 'swap-cache', to make it more general (into a "relcaim-list"). The current swap-cache is used to prevent a thrash case on the swap device after an anonymous page has been unmapped from an address space and the process references it v. soon after. (Remember, Linux has asynchronous writes to swap). > > OK, I've simplified the above quite a bit. But do you get the idea? > > If you think you're seen the above before, you're correct. It's what > > some other UNIX OSes do, and for good reasons. > > Linux is not entirely dissimilar but it is optimised more for the > "sufficient memory" case - there is no way to go from a page to > the page tables it is used in. Where the lack of physical-page to pte(s) mapping causes the most problems (for me anyway) is in a "make -j zImage" on the kernel source. When Linux identifies a named page to be reaped, it _only_ removes it from the address space under the process that it found the page. There maybe other processes that still have a ref to the page. Hence, all the searching work has been done for zero gain - OK, not totally zero. At least when it stumbles over the other mappings for the page it will be released. I'm being a bit harsh here. It's not very often that a single text page has many mappings, and needs to be unmapped. Plus the text page usage patterns in my example probably means releasing such a page is not good - it would soon be faulted in again. Also note the line "if (page_map->count != 1) return 0;" in try_to_swap_out(). This prevents the paging out of shared anonymous pages. Only effects some pages, but I'd like to get rid of it (hence the need for anon structs). > > and improves the performance of swap (when reading > > back, we read several pages back - locality of > > reference on a none page granularity). > Linux also does page ahead. Unless I've missed something, there is no read-ahead on a swap device! (No point as pages are not guaranteed any order on swap, as I was proposing). Clustering of swap pages, yes, but that's something different. Have a look at scan_swap_map() for the clustering (been sometime since I looked at this func, but I seem to remember the idea is to prevent a disk head from moving all over a swap device writing single pages in the holes left by fault-ins from the device). > > iii) Identify pages that allow coalescing in the free > > memory pool. > > I reclaim unused pages from the page cache when coalescing and it > seems worthwhile. It needs work to mutex things so we can do it > at interrupt time as well though. Good. I though it would be messey (is it?). I was thinking about using the age of pages in the cache when deciding what would coalesce with pages in the free memory pool. Perhaps using a random selection policy for a page is good enough (is this what you're doing?). By random I mean pseudo-random, ie. the first page we find, from the previous starting point, which allows a coalesce. > > See my slab allocator on http://www.nextd.demon.co.uk/ > Yes, I've been looking at it. There are some interesting ideas. But it > is relatively slow by my current standards :-). PCs have all too > little L1 cache, all but the oldest processors have enough pipelining > to be able to init a structure pretty quickly and conditional branches > are a killer - even with the Cyrix branch prediction. It may be > possible to optimise slab though... Slow? Slow! Maybe in the v. _worst_ case. The common, critical, path is short, with few branches (haven't checked the assembly - I should do! - just relied on how I believe gcc produces code. eg. "if() goto ???" when the 'if' is rarely taken, and "if () {common-case}" when the 'if' will normally be taken. But I should check what it produces...). If some of the handlers to allow different types of packing are removed, then the paths can be made even short. A compromise is always needed; using memory efficiently, against the instruction count. Yes, it needs a bit of tuning (already being done), but not much. As for the L1 cache? A structure is usually referenced many times before it is freed, so helping whatever L1 cache is available is well worthwhile. (Note, my critical paths have no extra instructions to allow for cache alignment). Init-ing a struct inline may be quick, but it will still place instructions into both types of caches (instruction and data). If a constructed state can be maintained for a structure (and hence, some code out of the 'hot' path), then it should be. > (It still does the silly power-of-two sizing though, doesn't it?) Most definitely not, walk the code (or ask me questions)! In my patch the 'general purpose caches' are 2^n because I was rushed in getting it out and couldn't be bothered to tune the sizes (not a good recommendation on my part...). The only power of twos are; the num of pages per slab (due to get_free_page()), bytes per page (h/w constraint), and the size of cache lines (because that's the way natural intended them). BTW, lets not knock power-of-two. They are nice numbers (often your're best friend when efficiency is concerned). Be carefully of using Fibonacci sequences, or some other sets, for allocator sizes. Their internal fragmentation tends to become high over time (not good for an OS). None-coalescing, splitting allocators tend to show the same behaviour. If you know the usage patterns, it is possible to design efficient allocators for specific applications. Difficult to do in a kernel. Regards, markhe
From: Mike Jagdis <m...@roan.co.uk> Subject: Re: Please don't beat me up (was Re: Bugs and wishes in memory management area) Date: 1996/12/02 Message-ID: <Pine.LNX.3.91.961202120038.1602A-100000@toaster.roan.co.uk> X-Deja-AN: 202046718 sender: owner-linux-ker...@vger.rutgers.edu references: <Pine.LNX.3.91.961129202137.142A-100000@nextd.demon.co.uk> content-type: TEXT/PLAIN; charset=US-ASCII organisation: Roan Technology Ltd x-hdr-sender: m...@roan.co.uk mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel > Yep, but...there are those occassions when more than one page is > needed. These are the hardest requests to handle (or prepare for). > I was suggesting 'priming' the free memory pool for these cases (without > introducing a large overhead - less restrictions on selecting a page to > reap == more chance of getting the page 'type' you need). My problem with preparing for multi-page allocations in advance is that you incur overheads coalescing pages into blocks and then splitting them again. This costs on things like network latency and bandwidth for instance. My preferred approach at the moment is to take the hit on first allocation of a particular size but to tend to keep requested sizes around for a while so that, if this is other than a one off, further requests can be satisfied with due speed. I am also considering the wisdom of doing some coalescing during idle time to keep a few suitably large blocks around in order to minimize the miss on the first request in particular. I haven't implemented this however... > Think you might have missed my point (or I didn't make it v. well), or > I'm mis-understanding your reply. What you say above is correct. My point > was extending the 'swap-cache', to make it more general (into a > "relcaim-list"). Ah, I was thinking more of the page cache. However, aren't pages in the swap cache also in the page cache? I seem to remember that reclaiming from the page cache needs a delete_from_swap_cache. Are we getting the swap and page caches confused? > When Linux identifies a named page to be reaped, it _only_ removes it > from the address space under the process that it found the page. There > maybe other processes that still have a ref to the page. Hence, all the > searching work has been done for zero gain - OK, not totally zero. At > least when it stumbles over the other mappings for the page it will be > released. Yes, this could get expensive (the extra invalidates are probably not good for SMP either). I tend to look on this as a seperate problem for now however. > Good. I though it would be messey (is it?). I was thinking about using > the age of pages in the cache when deciding what would coalesce with > pages in the free memory pool. Perhaps using a random selection policy > for a page is good enough (is this what you're doing?). By random I > mean pseudo-random, ie. the first page we find, from the previous > starting point, which allows a coalesce. Essentially random, yes, since there is no way to predict the current distribution of pages and their uses. It's simplistic in that it doesn't look ahead to see if it will ultimately be able to build a block big enough but seems to balance out well since over coalescing on one request increases the chance of future requests trivially succeeding - so the spread of latencies widens but the average stays reasonably good. Of course, a one off request for 1MB of contiguous DMA-able memory is always going to be somewhat anti-social :-). > Slow? Slow! Maybe in the v. _worst_ case. The common, critical, path is > short, with few branches (haven't checked the assembly - I should do! - > just relied on how I believe gcc produces code. eg. "if() goto ???" when > the 'if' is rarely taken, and "if () {common-case}" when the 'if' will > normally be taken. But I should check what it produces...). My timings show it often slower than the current stock kernel for the simple, benchmarkable, cases whereas I can go faster fairly consistently (without modifying anything other than kmalloc.c and page_alloc.c). I haven't looked closely enough to say where the time goes though. > As for the L1 cache? A structure is usually referenced many times before > it is freed, so helping whatever L1 cache is available is well worthwhile. > (Note, my critical paths have no extra instructions to allow for cache > alignment). Init-ing a struct inline may be quick, but it will > still place instructions into both types of caches (instruction and > data). If a constructed state can be maintained for a structure (and > hence, some code out of the 'hot' path), then it should be. True enough but the instructions to init a structure are generally pretty fast, easily pipelined and inline so in-cache. The structure they are writing to is almost certain to be either in cache or likely to be accessed in the very near future anyway. The assumption is that a "recently" freed structure will be reused in the "near" future while it is still in cache. I don't know the precise details of other chips but on the 6x86 the L1 is "only" 16KB organised as 512 lines by 32 bytes. How long will an L1 line remain resident? I went for the "fast reuse of the right size block" rather than the instruction saving. Unfortunately I don't know anyway to track cache hits. (I seem to remember that the Alpha has some cache statistics capabilities but that's out of my reach :-( ). > > (It still does the silly power-of-two sizing though, doesn't it?) > > Most definitely not, walk the code (or ask me questions)! > In my patch the 'general purpose caches' are 2^n because I was rushed in > getting it out and couldn't be bothered to tune the sizes (not a good > recommendation on my part...). Ah, I saw the table and jumped to the conclusion that it was significant. Mike -- .----------------------------------------------------------------------. | Mike Jagdis | Internet: mailto:m...@roan.co.uk | | Roan Technology Ltd. | | | 54A Peach Street, Wokingham | Telephone: +44 118 989 0403 | | RG40 1XG, ENGLAND | Fax: +44 118 989 1195 | `----------------------------------------------------------------------'