[Patch] shm bug introduced with pagecache in 2.3.11 Christoph Rohland (hans-christoph.rohland@sap.com) 11 Nov 1999 16:43:08 +0100 --Multipart_Thu_Nov_11_16:43:08_1999-1 Content-Type: text/plain; charset=US-ASCII Hi Linus, Finally shm swapping seems to work for me again. The following patch fixes a refcounting bug which got introduced with 2.3.11. (Thanks to Manfred who finally found the right point). It survived a lot of swap stress testing on UP/32MB up to 8xSMP/8GB. The patch also fixes some int/size_t issues. Could you please apply this. Greetings Christoph --Multipart_Thu_Nov_11_16:43:08_1999-1 Content-Type: text/plain; charset=US-ASCII Content-Disposition: attachment; filename="patch-27.6-shm4" Content-Transfer-Encoding: 7bit --- 2.3.27-pre6/ipc/shm.c Thu Nov 11 10:33:16 1999 +++ make27/ipc/shm.c Thu Nov 11 14:47:57 1999 @@ -206,7 +206,7 @@ struct shmid_kernel *shp; int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT; int id, err; - unsigned int shmall, shmmni; + size_t shmall, shmmni; shmall = shm_prm[1]; shmmni = shm_prm[2]; @@ -378,13 +378,16 @@ case IPC_INFO: { struct shminfo shminfo; + size_t shmmax; + spin_unlock(&shm_lock); err = -EFAULT; if (!buf) goto out; + shmmax=shm_prm[0]; + shminfo.shmmax = shmmax > UINT_MAX ? UINT_MAX : shmmax; shminfo.shmmni = shminfo.shmseg = shm_prm[2]; - shminfo.shmmax = shm_prm[0]; shminfo.shmall = shm_prm[1]; shminfo.shmmin = SHMMIN; @@ -791,11 +794,14 @@ if (!page) { lock_kernel(); swapin_readahead(entry); + if (pte_val(pte) != pte_val(SHM_ENTRY(shp, idx))) goto again; page = read_swap_cache(entry); unlock_kernel(); if (!page) goto oom; } + if (pte_val(pte) != pte_val(SHM_ENTRY(shp, idx))) + goto changed; delete_from_swap_cache(page); page = replace_with_highmem(page); lock_kernel(); @@ -803,9 +809,6 @@ unlock_kernel(); spin_lock(&shm_lock); shm_swp--; - pte = SHM_ENTRY(shp, idx); - if (pte_present(pte)) - goto present; } shm_rss++; pte = pte_mkdirty(mk_pte(page, PAGE_SHARED)); @@ -813,8 +816,6 @@ } else --current->maj_flt; /* was incremented in do_no_page */ -done: - /* pte_val(pte) == SHM_ENTRY (shp, idx) */ get_page(pte_page(pte)); spin_unlock(&shm_lock); current->min_flt++; @@ -823,10 +824,6 @@ changed: __free_page(page); goto again; -present: - if (page) - free_page_and_swap_cache(page); - goto done; oom: return NOPAGE_OOM; } --Multipart_Thu_Nov_11_16:43:08_1999-1-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Andrea Arcangeli (andrea@suse.de) Thu, 11 Nov 1999 22:12:20 +0100 (CET) On 11 Nov 1999, Christoph Rohland wrote: >The patch also fixes some int/size_t issues. The patch is buggy. In this path: swapin_readahead(entry); + if (pte_val(pte) != pte_val(SHM_ENTRY(shp, idx))) goto again; page = read_swap_cache(entry); you `goto again` without first releasing the big kernel lock and without acquiring again the shm lock. But this is a minor implementation issue. There is a worse problem. The real issue is that in SMP even removing the readahead is still racy. All the checks for the pte you added are racy. This my patch should fix all races (both UP and SMP). Please try it out. It's against 2.3.27pre4. As the anonymous swapin we are protected by the per-mm semaphore, in shm we must protect us with a per-shm-segment semaphore to handle the swapin case safely. The design is almost the same as in the anonymous swapin then. diff -urN 2.3.27pre4/include/linux/swap.h shm/include/linux/swap.h --- 2.3.27pre4/include/linux/swap.h Thu Nov 11 18:23:09 1999 +++ shm/include/linux/swap.h Thu Nov 11 21:20:36 1999 @@ -112,8 +112,10 @@ extern struct swap_info_struct swap_info[]; extern int is_swap_partition(kdev_t); extern void si_swapinfo(struct sysinfo *); -extern swp_entry_t get_swap_page(void); -extern void swap_free(swp_entry_t); +extern swp_entry_t __get_swap_page(unsigned short); +#define get_swap_page() __get_swap_page(1) +extern void __swap_free(swp_entry_t, unsigned short); +#define swap_free(entry) __swap_free((entry), 1) struct swap_list_t { int head; /* head of priority-ordered swapfile list */ int next; /* swapfile to be used next */ diff -urN 2.3.27pre4/ipc/shm.c shm/ipc/shm.c --- 2.3.27pre4/ipc/shm.c Wed Nov 10 16:59:27 1999 +++ shm/ipc/shm.c Thu Nov 11 21:58:28 1999 @@ -36,6 +36,7 @@ pte_t **shm_dir; /* ptr to array of ptrs to frames -> SHMMAX */ struct vm_area_struct *attaches; /* descriptors for attaches */ int id; /* backreference to id for shm_close */ + struct semaphore sem; }; static int findkey (key_t key); @@ -61,6 +62,9 @@ static unsigned int num_segs = 0; static unsigned short shm_seq = 0; /* incremented, for recognizing stale ids */ +/* locks order: + shm_lock -> pagecache_lock (end of shm_swap) + shp->sem -> other spinlocks (shm_nopage) */ spinlock_t shm_lock = SPIN_LOCK_UNLOCKED; /* some statistics */ @@ -260,6 +264,7 @@ shp->u.shm_ctime = CURRENT_TIME; shp->shm_npages = numpages; shp->id = id; + init_MUTEX(&shp->sem); spin_lock(&shm_lock); @@ -770,10 +775,13 @@ idx = (address - shmd->vm_start) >> PAGE_SHIFT; idx += shmd->vm_pgoff; + down(&shp->sem); spin_lock(&shm_lock); -again: pte = SHM_ENTRY(shp,idx); if (!pte_present(pte)) { + /* page not present so shm_swap can't race with us + and the semaphore protects us by other tasks that + could potentially fault on our pte under us */ if (pte_none(pte)) { spin_unlock(&shm_lock); page = get_free_highpage(GFP_HIGHUSER); @@ -781,8 +789,6 @@ goto oom; clear_highpage(page); spin_lock(&shm_lock); - if (pte_val(pte) != pte_val(SHM_ENTRY(shp, idx))) - goto changed; } else { swp_entry_t entry = pte_to_swp_entry(pte); @@ -803,9 +809,6 @@ unlock_kernel(); spin_lock(&shm_lock); shm_swp--; - pte = SHM_ENTRY(shp, idx); - if (pte_present(pte)) - goto present; } shm_rss++; pte = pte_mkdirty(mk_pte(page, PAGE_SHARED)); @@ -813,21 +816,15 @@ } else --current->maj_flt; /* was incremented in do_no_page */ -done: /* pte_val(pte) == SHM_ENTRY (shp, idx) */ get_page(pte_page(pte)); spin_unlock(&shm_lock); + up(&shp->sem); current->min_flt++; return pte_page(pte); -changed: - __free_page(page); - goto again; -present: - if (page) - free_page_and_swap_cache(page); - goto done; oom: + up(&shp->sem); return NOPAGE_OOM; } @@ -851,7 +848,11 @@ if (!counter) return 0; lock_kernel(); - swap_entry = get_swap_page(); + /* subtle: preload the swap count for the swap cache. We can't + increase the count inside the critical section as we can't release + the shm_lock there. And we can't acquire the big lock with the + shm_lock held (otherwise we would deadlock too easily). */ + swap_entry = __get_swap_page(2); if (!swap_entry.val) { unlock_kernel(); return 0; @@ -893,7 +894,7 @@ failed: spin_unlock(&shm_lock); lock_kernel(); - swap_free(swap_entry); + __swap_free(swap_entry, 2); unlock_kernel(); return 0; } @@ -905,11 +906,16 @@ swap_successes++; shm_swp++; shm_rss--; + + /* add the locked page to the swap cache before allowing + the swapin path to run lookup_swap_cache(). This avoids + reading a not yet uptodate block from disk. + NOTE: we just accounted the swap space reference for this + swap cache page at __get_swap_page() time. */ + add_to_swap_cache(page_map, swap_entry); spin_unlock(&shm_lock); lock_kernel(); - swap_duplicate(swap_entry); - add_to_swap_cache(page_map, swap_entry); rw_swap_page(WRITE, page_map, 0); unlock_kernel(); diff -urN 2.3.27pre4/mm/swapfile.c shm/mm/swapfile.c --- 2.3.27pre4/mm/swapfile.c Sun Nov 7 17:33:38 1999 +++ shm/mm/swapfile.c Thu Nov 11 21:38:01 1999 @@ -25,7 +25,7 @@ #define SWAPFILE_CLUSTER 256 -static inline int scan_swap_map(struct swap_info_struct *si) +static inline int scan_swap_map(struct swap_info_struct *si, unsigned short count) { unsigned long offset; /* @@ -73,7 +73,7 @@ si->lowest_bit++; if (offset == si->highest_bit) si->highest_bit--; - si->swap_map[offset] = 1; + si->swap_map[offset] = count; nr_swap_pages--; si->cluster_next = offset+1; return offset; @@ -81,7 +81,7 @@ return 0; } -swp_entry_t get_swap_page(void) +swp_entry_t __get_swap_page(unsigned short count) { struct swap_info_struct * p; unsigned long offset; @@ -94,11 +94,13 @@ goto out; if (nr_swap_pages == 0) goto out; + if (count >= SWAP_MAP_MAX) + goto bad_count; while (1) { p = &swap_info[type]; if ((p->flags & SWP_WRITEOK) == SWP_WRITEOK) { - offset = scan_swap_map(p); + offset = scan_swap_map(p, count); if (offset) { entry = SWP_ENTRY(type,offset); type = swap_info[type].next; @@ -123,10 +125,15 @@ } out: return entry; + +bad_count: + printk(KERN_ERR "get_swap_page: bad count %hd from %p\n", + count, __builtin_return_address(0)); + goto out; } -void swap_free(swp_entry_t entry) +void __swap_free(swp_entry_t entry, unsigned short count) { struct swap_info_struct * p; unsigned long offset, type; @@ -148,7 +155,9 @@ if (!p->swap_map[offset]) goto bad_free; if (p->swap_map[offset] < SWAP_MAP_MAX) { - if (!--p->swap_map[offset]) { + if (p->swap_map[offset] < count) + goto bad_count; + if (!(p->swap_map[offset] -= count)) { if (offset < p->lowest_bit) p->lowest_bit = offset; if (offset > p->highest_bit) @@ -170,6 +179,9 @@ goto out; bad_free: printk("VM: Bad swap entry %08lx\n", entry.val); + goto out; +bad_count: + printk(KERN_ERR "VM: Bad count %hd current count %hd\n", count, p->swap_map[offset]); goto out; } The only ordering rule I added is that shm_lock must be acquired _before_ pagecache_lock. I am stressing the code with your shmtst on SMP and it works fine here. I suggest applying my race fixes to the stock kernel as the design looks like the right one to me now. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Manfred (manfreds@colorfullife.com) Fri, 12 Nov 1999 05:05:34 -0500 (EST) > > On 11 Nov 1999, Christoph Rohland wrote: > > >The patch also fixes some int/size_t issues. > But this is a minor implementation issue. There is a worse problem. The > real issue is that in SMP even removing the readahead is still racy. All > the checks for the pte you added are racy. The current code is UP only. There are new ipc helper function in ipc/util.h and I'll convert the code RSN. > > This my patch should fix all races (both UP and SMP). Please try it out. > It's against 2.3.27pre4. As the anonymous swapin we are protected by the > per-mm semaphore, in shm we must protect us with a per-shm-segment > semaphore to handle the swapin case safely. The design is almost the same > as in the anonymous swapin then. Intersting idea. I thought about acquiring the kernel lock a bit earlier, but perhaps I can avoid that with a semaphore. > > The only ordering rule I added is that shm_lock must be acquired _before_ > pagecache_lock. > Yes. > I am stressing the code with your shmtst on SMP and it works fine here. > > I suggest applying my race fixes to the stock kernel as the design looks > like the right one to me now. > I don't like the semaphore, because (AFAICS, I'm only looking at the diff) you single-thread the swapin code (per-segment, but still single thread) I'll think about it, Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Linus Torvalds (torvalds@transmeta.com) Fri, 12 Nov 1999 04:09:24 -0800 (PST) On Fri, 12 Nov 1999, Manfred wrote: > > I don't like the semaphore, because (AFAICS, I'm only looking at the diff) > you single-thread the swapin code (per-segment, but still single thread) I think the semaphore is a good idea, if only because it makes things much more obviously correct - exactly because of the clear serialization. And I don't think the serialization is a performance problem, because by the time you start paging we're not talking about high performance shared memory anyway, and because it's per-segment it is notgoing to make "system" performance any worse. In fact, my reaction to the semaphore is "do we actually need the spinlock any more"? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Manfred Spraul (manfreds@colorfullife.com) Fri, 12 Nov 1999 16:05:22 +0100 Linus Torvalds wrote: > > On Fri, 12 Nov 1999, Manfred wrote: > > > > I don't like the semaphore, because (AFAICS, I'm only looking at the diff) > > you single-thread the swapin code (per-segment, but still single thread) > > I think the semaphore is a good idea, if only because it makes things much > more obviously correct - exactly because of the clear serialization. I agree that the current code is a total mess (I have converted it to the ipc/util.h helper functions, and I found further SMP and UP races) _if_ I find a simple serialization, then I'll kill the semaphore. > And I don't think the serialization is a performance problem, because > by the time you start paging we're not talking about high performance > shared memory anyway, and because it's per-segment it is notgoing to make > "system" performance any worse. > What about a 100-gigabyte shm segment (on a 64-bit platform) with a fast scsi disk system? The semaphore will prevent any tagged commands, and it will downgrade (performance wise) the scsi system to a slow ide disk. Btw, I'm sure that for multi-threaded applications, the mmap performance of Linux will be poor because everything is single-threaded. I'll write a benchmark and compare it with WinNT/Win95. > > In fact, my reaction to the semaphore is "do we actually need the > spinlock any more"? > shm_swap() must not acquire a semaphore, or we could lock-up during low-memory. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Linus Torvalds (torvalds@transmeta.com) Fri, 12 Nov 1999 10:32:16 -0800 (PST) On Fri, 12 Nov 1999, Manfred Spraul wrote: > > > And I don't think the serialization is a performance problem, because > > by the time you start paging we're not talking about high performance > > shared memory anyway, and because it's per-segment it is notgoing to make > > "system" performance any worse. > > What about a 100-gigabyte shm segment (on a 64-bit platform) with a fast > scsi disk system? The semaphore will prevent any tagged commands, and it > will downgrade (performance wise) the scsi system to a slow ide disk. Nope. The swap-in read-ahead still works - the _only_ thing the semaphore does is serialize different processes accessing the same area, and that's as likely to improve performace as to degrade it (potentially less seeking). > Btw, I'm sure that for multi-threaded applications, the mmap performance > of Linux will be poor because everything is single-threaded. I'll > write a benchmark and compare it with WinNT/Win95. I will bet you 5 bucks we'll kick ass. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Manfred Spraul (manfreds@colorfullife.com) Sat, 13 Nov 1999 02:09:25 +0100 Linus Torvalds wrote: > > > Btw, I'm sure that for multi-threaded applications, the mmap performance > > of Linux will be poor because everything is single-threaded. I'll > > write a benchmark and compare it with WinNT/Win95. > > I will bet you 5 bucks we'll kick ass. > You've lost: Computer: K6-200, 128 MB Ram, Symbios 810 scsi controller, Fujitsu Magneto-Optical drive, 620 MB [I have no empty scsi disc left :(], 620,000,000 bytes test file, fat filesystem, the same disk is used for NT and Linux. command: "./pagein fill 150000 #" where fill is the filename, 150000 means 150000 pages are trashed, and # is the number of threads. Linux: # pages/sec 1 13 4 14 64 14 256 ? [computer unresponsive] NT: # pages/sec 1 18 4 20 64 28 256 31 512 33 Linux is slower, and it cannot use multiple threads to reorder the sector reads; NT gets faster if I add further threads. source code is at http://colorfullife.com/~manfreds/pagein/pagein.cpp -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Alan Cox (alan@lxorguk.ukuu.org.uk) Sat, 13 Nov 1999 01:33:20 +0000 (GMT) > You've lost: > Computer: K6-200, 128 MB Ram, Symbios 810 scsi controller, Fujitsu > Magneto-Optical drive, 620 MB [I have no empty scsi disc left :(], So you benchmarked with a very slow I/O device. Ok that should mean its silly numbers for both tied entirely to the seek rate of the media > 620,000,000 bytes test file, fat filesystem, the same disk is used for > NT and Linux. Linux FAT performance is slow. Try NTFS (or FAT) versus ext2. That would be interesting. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Manfred (manfreds@colorfullife.com) Sat, 13 Nov 1999 09:48:58 +0100 From: Alan Cox <alan@lxorguk.ukuu.org.uk> > > You've lost: > > > Computer: K6-200, 128 MB Ram, Symbios 810 scsi controller, Fujitsu > > Magneto-Optical drive, 620 MB [I have no empty scsi disc left :(], > > So you benchmarked with a very slow I/O device. Ok that should mean its > silly numbers for both tied entirely to the seek rate of the media > Yes, intentionally, that was the slowest disk I found: Linux single-threads the pageing-io, ie it cannot reorder the read operations. I wrote that this is a huge disadvantage, and the numbers show that. > > 620,000,000 bytes test file, fat filesystem, the same disk is used for > > NT and Linux. > > Linux FAT performance is slow. Try NTFS (or FAT) versus ext2. That would > be interesting. I'll try it with a faster disk, but initial tests show that : - NT gets faster if I add further threads - Linux cannot reorder the disk io, and it remains at the same performance for 1 thread and for 64 threads. - the benchmark is io bound, ie the internal efficiency of the os doesn't matter. Jeff Garzik wrote: > Is this test done on kernel 2.3.28? 2.3.27 -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Alan Cox (alan@lxorguk.ukuu.org.uk) Sat, 13 Nov 1999 14:21:11 +0000 (GMT) > Yes, intentionally, that was the slowest disk I found: > Linux single-threads the pageing-io, ie it cannot reorder the read > operations. > I wrote that this is a huge disadvantage, and the numbers show that. Ok now I understand what you are trying to show. That would make sense. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Manfred Spraul (manfreds@colorfullife.com) Sat, 13 Nov 1999 16:15:47 +0100 Alan Cox wrote: > So you benchmarked with a very slow I/O device. > > Linux FAT performance is slow. Try NTFS (or FAT) versus ext2. That would > be interesting. Ok, I switched to a Seagate ST34520N (7200 rpm, scsi2 narrow, 4.5 GB), and I added a new test: Linux-multi-thread vs Linux-multi-process. The results are as I expected: -Linux-multi-process is more or less on par with NT. The 20% difference could be the thread/process overhead. -Linux-multi-thread is sloww. 450000 pages test file, ext2 and NTFS, 128 MB ram, Sym810 controller, AMD K6/200 # is the number of threads/processes which are running. # Linux-threads Linux-processes NT (threads) 1 51 51 60 16 51 67 96 64 50 73 105 128 48 75 107 The modified source code is at http://colorfullife.com/~manfreds/pagein/pagein.cpp -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Gerard Roudier (groudier@club-internet.fr) Sat, 13 Nov 1999 18:49:10 +0100 (MET) Hi Manfred, Could it be possible for you to run benchmarks against O/Ses we have access to the source code instead of binary-only available ones. This would allow to learn a lot better from the differences. For example FreeBSD is as simple as Redhat to install and a base system will consume far less disk space than NT. Basically I an not interested at all by your benchmarks for the reasons my personnal box has only free O/Ses installed. May-be you will reply me that Linux is mostly competing against NT nowadays. Anyway, ignoring other free O/Ses seems to me scornfully given the synergy that existed and still exists in some places. G�rard. On Sat, 13 Nov 1999, Manfred Spraul wrote: > Alan Cox wrote: > > So you benchmarked with a very slow I/O device. > > > > Linux FAT performance is slow. Try NTFS (or FAT) versus ext2. That would > > be interesting. > > Ok, I switched to a Seagate ST34520N (7200 rpm, scsi2 narrow, 4.5 GB), > and I added a new test: Linux-multi-thread vs Linux-multi-process. The > results are as I expected: > > -Linux-multi-process is more or less on par with NT. The 20% difference > could be the thread/process overhead. > -Linux-multi-thread is sloww. > > 450000 pages test file, ext2 and NTFS, 128 MB ram, Sym810 controller, > AMD K6/200 > > # is the number of threads/processes which are running. > > # Linux-threads Linux-processes NT (threads) > 1 51 51 60 > 16 51 67 96 > 64 50 73 105 > 128 48 75 107 > > The modified source code is at > http://colorfullife.com/~manfreds/pagein/pagein.cpp > > -- > Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Manfred Spraul (manfreds@colorfullife.com) Sat, 13 Nov 1999 18:55:53 +0100 Gerard Roudier wrote: > > Hi Manfred, > > Could it be possible for you to run benchmarks against O/Ses we have > access to the source code instead of binary-only available ones. This > would allow to learn a lot better from the differences. For example > FreeBSD is as simple as Redhat to install and a base system will consume > far less disk space than NT. > Source code is at http://colorfullife.com/~manfreds/pagein/pagein.cpp; I don't have FreeBSD. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Dominik Kubla (dominik.kubla@uni-mainz.de) Sat, 13 Nov 1999 19:11:49 +0100 On Sat, Nov 13, 1999 at 06:55:53PM +0100, Manfred Spraul wrote: > Gerard Roudier wrote: > > > > Hi Manfred, > > > > Could it be possible for you to run benchmarks against O/Ses we have > > access to the source code instead of binary-only available ones. This > > would allow to learn a lot better from the differences. For example > > FreeBSD is as simple as Redhat to install and a base system will consume > > far less disk space than NT. > > > Source code is at http://colorfullife.com/~manfreds/pagein/pagein.cpp; > I don't have FreeBSD. Gerard was referring to the source code of the _OS_, not your benchmark! And i have to agree with him: There is no way to understand what a OS is really doing without looking at the source. (Reminds me of our X11 benches back in the "old times": only be running them on really slow hardware we could see that some commercial servers were "optimized for benchmarks" - they simply skipped some drawing operations. DOH!) As for not having FreeBSD: simply look at www.freebsd.org... Yours, Dominik Kubla - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Manfred Spraul (manfreds@colorfullife.com) Sat, 13 Nov 1999 21:00:52 +0100 Dominik Kubla wrote: > Gerard was referring to the source code of the _OS_, not your benchmark! > And i have to agree with him: There is no way to understand what a OS > is really doing without looking at the source. In this case you don't need the source code: Do you have a really noisy drive with a slow seek time? Then you would hear the difference: - WinNT and Linux-fork sound 'round' with lots of threads/processes, and the performance increases. - Linux-multithread always sounds identical (1 thread or 64); the performance doesn't change. You don't need to be a rocket scientist to figure out that the cause is the mmap semaphore, ie that Linux single threads the io for multi-threaded applications. Linux with multiple processes or WinNT reorder the disk io, and thus they get faster with more processes/threads. -- Manfred P.S.: if you prefer to look at the source, then compare Linux-fork and Linux-multithread. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Linus Torvalds (torvalds@transmeta.com) Thu, 18 Nov 1999 19:29:15 -0800 (PST) On Sat, 13 Nov 1999, Manfred Spraul wrote: > > Computer: K6-200, 128 MB Ram, Symbios 810 scsi controller, Fujitsu > Magneto-Optical drive, 620 MB [I have no empty scsi disc left :(], > 620,000,000 bytes test file, fat filesystem, the same disk is used for > NT and Linux. Re-do this without the ridiculous filesystem, and I'll bother to even check the numbers. That said, I don't think this can/will be fixed for a 2.4 timeframe, especially as I haven't heard of any real-life usage where it would be an issue.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Alan Cox (alan@lxorguk.ukuu.org.uk) Fri, 19 Nov 1999 12:33:55 +0000 (GMT) > That said, I don't think this can/will be fixed for a 2.4 timeframe, > especially as I haven't heard of any real-life usage where it would be an > issue.. News servers like Typhoon , high performance threaded web servers (eg Zeus) Fortunately these guys tend to be using pretty serious I/O subsystems not M/O disks and they are fine with 2.2. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Manfred Spraul (manfreds@colorfullife.com) Fri, 19 Nov 1999 15:36:54 +0100 Alan Cox wrote: > > > That said, I don't think this can/will be fixed for a 2.4 timeframe, > > especially as I haven't heard of any real-life usage where it would be an > > issue.. > > News servers like Typhoon , high performance threaded web servers (eg Zeus) > Do you know if they are using mmap? > > Fortunately these guys tend to be using pretty serious I/O subsystems not > M/O disks and they are fine with 2.2. > I did a second test with a faster disk (SCSI-2-narrow 4.5 GB seagate), and the results were nearly identical: the mmap semaphore kill's around 33% performance if I compare 64 threads with 64 processes. (33% slower or 50% faster, depending on your point of view) Please note that the test is extremely I/O bound, ie I defeat read-ahead with a RNG, and I only read one byte in every page, and the file is far larger than available memory. I'll try to find a faster drive (I had somewhere an old 10kRPM wide SCSI drive), but I would be surprised if the performance drop would be < 30%. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Alan Cox (alan@lxorguk.ukuu.org.uk) Fri, 19 Nov 1999 14:40:15 +0000 (GMT) > > News servers like Typhoon , high performance threaded web servers (eg Zeus) > > Do you know if they are using mmap? Yes. Typhoon uses threaded mmap so aggressively it became an unintentional test suite for the Linux mm layer, and in 2.0/2.1 it found a lot of bugs. > Please note that the test is extremely I/O bound, ie I defeat read-ahead > with a RNG, and I only read one byte in every page, and the file is far > larger than available memory. I would expect Typhoon to show some reasonably sane locality - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Linus Torvalds (torvalds@transmeta.com) Fri, 19 Nov 1999 11:25:45 -0800 (PST) On Fri, 19 Nov 1999, Alan Cox wrote: > > News servers like Typhoon , high performance threaded web servers (eg Zeus) > > Fortunately these guys tend to be using pretty serious I/O subsystems not > M/O disks and they are fine with 2.2. Well, the more I look at a read-write semaphore, the more I like it: it looks like something that once the semaphore implementation itself was done, the MM side would be absolutely trivial. It does introduce a new issue (multiple threads updating the page tables at the same time), but that one doesn't look that horrible.. We don't ever export the page table handling to the low-level filesystems any more (we used to a long time ago: the nopage() function got to touch the page tables itself rather than just return the right page), so fixing up the new issue is actually a very local fix in mm/mmeory.c. Is anybody willing to take a stab at creating a read-write semaphore? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Marcelo Tosatti <marc...@conectiva.com.br> Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11 Date: 1999/11/20 Message-ID: <fa.kcmkl8v.v72q3f@ifi.uio.no>#1/1 X-Deja-AN: 552020873 Original-Date: Sat, 20 Nov 1999 09:40:07 -0200 (BRDT) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.20.9911200922480.3198-100000@freak.conectiva> References: <fa.oa9df7v.ika7b9@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> X-Sender: marc...@freak.conectiva Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu > > > > News servers like Typhoon , high performance threaded web servers (eg Zeus) > > > > Fortunately these guys tend to be using pretty serious I/O subsystems not > > M/O disks and they are fine with 2.2. > > Well, the more I look at a read-write semaphore, the more I like it: it > looks like something that once the semaphore implementation itself was > done, the MM side would be absolutely trivial. It does introduce a new > issue (multiple threads updating the page tables at the same time), but > that one doesn't look that horrible.. > > We don't ever export the page table handling to the low-level filesystems > any more (we used to a long time ago: the nopage() function got to touch > the page tables itself rather than just return the right page), so fixing > up the new issue is actually a very local fix in mm/mmeory.c. > > Is anybody willing to take a stab at creating a read-write semaphore? > > Linus http://bazar.conectiva.com.br/~marcelo/rwsem-2.3.18ac7.patch This code is a Linux "port" of the psedo-code implementation found in the "Unix Kernel Internals" book i wrote some time ago. The patch also modifies the "uts_sem" semaphore in kernel/sys.c to a rw semaphore. I've not tested it extensively so there might be ugly bugs/races. Any construtive comments/bug reports are welcome. - Marcelo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] shm bug introduced with pagecache in 2.3.11 Linus Torvalds (torvalds@transmeta.com) Sat, 20 Nov 1999 16:33:49 -0800 (PST) On Sat, 20 Nov 1999, Marcelo Tosatti wrote: > > http://bazar.conectiva.com.br/~marcelo/rwsem-2.3.18ac7.patch > This code is a Linux "port" of the psedo-code implementation found in the > "Unix Kernel Internals" book i wrote some time ago. Well, if it's a port of that, then it won't have the 2-instruction fast-path that is pretty much required, imho. I'll see if I can get a free afternoon some day and try to port the current x86 semaphore code over to a rw version too. The plan was something like this: - read_down(): lock ; incl mem js contention_rw - read_up(): lock ; decl mem js wake_up_writer - write_down(): lock ; btsl $31,mem jc contention_ww testl $0x7fffffff,mem jne contention_wr - write_up(): lock ; andl $0x7fffffff,mem jne wake_up_reader_or_writer where all the three contention cases grab a "contention spinlock" before they then start sorting things out. The only interesting part is making sure that the contention case gets the wakeups, and the above counts on: - if a writer is waiting for readers (contention_wr), then the writer will have already set the high bit, and a reader will know to wake it up because the rw-semaphore value will be negative when it does read_up(). - if a reader is waiting for a writer, then the reader will have incremented the semaphore, and the writer will know to wake it up becasue the semaphore value won't be zero after the "write_up()". - if a writer is waiting for another writer (contention_ww case), it will have to increment the "reader" part of the semaphore value, in order to get the other writer to wake it up on "write_up()". All other races should be trivially handled by just having the spinlock, so the only really hard cases are the fast-path stuff where we cannot get the semaphore because it is too expensive. Does anybody see any holes in the above pseudo-implementation? Please take a look at the way the current x86 semaphores are implemented: they use exactly the above kinds of single-atomic-instruction-plus-condition-codes trickery to get the non-contention case without _any_ extra instructions. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@suse.de> Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11 Date: 1999/11/25 Message-ID: <fa.jrengav.80uk26@ifi.uio.no>#1/1 X-Deja-AN: 552934330 Original-Date: Thu, 25 Nov 1999 14:33:51 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.9911250353370.21876-100000@alpha.random> References: <fa.oa9df7v.ika7b9@ifi.uio.no> X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc To: Linus Torvalds <torva...@transmeta.com> X-Sender: and...@alpha.random Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Fri, 19 Nov 1999, Linus Torvalds wrote: >Well, the more I look at a read-write semaphore, the more I like it: it >looks like something that once the semaphore implementation itself was >done, the MM side would be absolutely trivial. It does introduce a new >issue (multiple threads updating the page tables at the same time), but >that one doesn't look that horrible.. If you allow more than one task to fault for example in the swapin path you'll get in troubles as you can't solve this race cleanly with a spinlock. That's why I added the semaphore to the shm segments in first place. Only replacing the down() with a read_down() in do_page_fault is _not_ enough. The semaphore is not there only to protect from mmap and vma changes under us, right now it's there mainly to protect other threads to fault under us. IMHO the semaphore make a performance difference only with threads doing paging of mmapped files while fooling readahead. The swapin case is not intersting IMHO (and we do readahead also for the swapins). Maybe we can find a way to drop the semaphore in the nopage path. The read semaphore in do_page_fault make not too much sense to me as we should do really tricky code to solve the races by hand without a performance advantage in RL. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11 Date: 1999/11/25 Message-ID: <fa.obabh7v.gkc5b0@ifi.uio.no>#1/1 X-Deja-AN: 552991565 Original-Date: Thu, 25 Nov 1999 09:20:57 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.9911250913590.4390-100000@penguin.transmeta.com> References: <fa.jrengav.80uk26@ifi.uio.no> To: Andrea Arcangeli <and...@suse.de> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 25 Nov 1999, Andrea Arcangeli wrote: > > If you allow more than one task to fault for example in the swapin path > you'll get in troubles as you can't solve this race cleanly with a > spinlock. That's why I added the semaphore to the shm segments in first > place. No, you can solve it cleanly by just changing the code: you only really need to guarantee that the mapping doesn't change under you (that would be disastrous and very hard to recover from). Somebody else filling in the page before you is simple to check for. > Only replacing the down() with a read_down() in do_page_fault is _not_ > enough. The semaphore is not there only to protect from mmap and vma > changes under us, right now it's there mainly to protect other threads to > fault under us. "mainly" is incorrect. The main protection is to maintain the vma list sanely, that was always the case (it used to be easy to crash the kernel by using threads that pagefaulted and mmap'ed at the same time). Protecting against others paging in is trivial, and in fact we used to do that as long ago as 1.2.x if I remember correctly (the mm code was very different back then). The way we used to do that was to remember the original pte value, and before updating it with the newpage that was just paged in we just check that the pte value hasn't changed. In 1.2.x that protected us against threads that paged in simultaneously, and the races introduced by the IO waiting. But it was not enough to protect against mmap's changing the vma, so we introduced the semaphore in 1.3.x, and because we had the semaphore we could also remove the optimistic checking. In 2.3.x, we can use the same trivial approach to protect against threads. It adds basically no overhead at all - we have to get the spinlock anyway, and the final check before changing the page tables is basically a single load and compare. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@suse.de> Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11 Date: 1999/11/25 Message-ID: <fa.jqutfav.9g8l2b@ifi.uio.no>#1/1 X-Deja-AN: 553001796 Original-Date: Thu, 25 Nov 1999 18:18:44 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.9911251808140.22916-100000@alpha.random> References: <fa.obabh7v.gkc5b0@ifi.uio.no> X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc To: Linus Torvalds <torva...@transmeta.com> X-Sender: and...@alpha.random Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 25 Nov 1999, Linus Torvalds wrote: >In 2.3.x, we can use the same trivial approach to protect against threads. For the allocation is trivial of course (I was just doing that in shm.c). But I am not been trivially succesfully in fixing the shm swapin races with "read pte with spinlock acquired, release the spinlock, reacquire the spinlock and the check if the pte is changed". That's why I added the spinlock. The _main_ problem I had is that to swapout we have to grab the kernel lock and we'll sleep and so I would need to acquire the spinlocks in inverse order (deadlock prone). So I givenup and I took the _trivial_ mainstream way to use the semaphore to protect multiple thread accesses (also for shm.c using a semaphore is less interesting as shm.c can't do I/O in the nopage operation unless it's a swapin). I hope I was missing something and that's simpler... Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@suse.de> Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11 Date: 1999/11/25 Message-ID: <fa.jpv1fiv.8g4lq6@ifi.uio.no>#1/1 X-Deja-AN: 553001798 Original-Date: Thu, 25 Nov 1999 18:23:56 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.9911251823040.24875-100000@alpha.random> References: <fa.jqutfav.9g8l2b@ifi.uio.no> X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc To: Linus Torvalds <torva...@transmeta.com> X-Sender: and...@alpha.random Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 25 Nov 1999, Andrea Arcangeli wrote: >spinlock and the check if the pte is changed". That's why I added the >spinlock. The _main_ problem I had is that to swapout we have to grab the ^^^^^^^^ of course I meant "semaphore" ;) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11 Date: 1999/11/25 Message-ID: <fa.o99ri6v.iks4b3@ifi.uio.no>#1/1 X-Deja-AN: 553903012 Original-Date: Thu, 25 Nov 1999 09:57:10 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.9911250950110.4390-100000@penguin.transmeta.com> References: <fa.jqutfav.9g8l2b@ifi.uio.no> To: Andrea Arcangeli <and...@suse.de> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 25 Nov 1999, Andrea Arcangeli wrote: > > But I am not been trivially succesfully in fixing the shm swapin races > with "read pte with spinlock acquired, release the spinlock, reacquire the > spinlock and the check if the pte is changed". That's why I added the > spinlock. I was planning on just depending on the sanity of the page cache on this one. Basically we have two cases: - paging in something new ("no_page"), for which the final test is just to test that the page table is still zero (ie we don't even need to save any "original" value). - paging in something old ("swap_page"), in wich case the final test is to check that the pte is still the same as swp_entry_to_pte(entry). (we have the rw_page case too, but that is already protected by the spinlock appropriately as far as I can tell, exactly because it already has the same race wrt page_out rather than page_in). No, I haven't checked the exact details. Maybe it's worse than I envision, but it _looks_ like adding a simple spinlock and the test. If the test fails, we just return and expect the fault to happen again.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/