From: Linus Torvalds <torva...@transmeta.com> Subject: Linux-2.3.15.. Date: 1999/08/26 Message-ID: <fa.o9abevv.iks73d@ifi.uio.no>#1/1 X-Deja-AN: 517263303 Original-Date: Wed, 25 Aug 1999 16:36:10 -0700 (PDT) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.9908251619060.5207-100000@penguin.transmeta.com> To: Kernel Mailing List <linux-ker...@vger.rutgers.edu> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu There's a rather huge patch-set out there now, taking the 2.3.x series to 2.3.15. This has a lot of the merge code I've been sent over the last two weeks, but I will invariably have missed some, if for no other reason than simply that I got absolutely _flooded_ by people sending me patches. One of the more interesting things was the SMP pipe cleanup sent by Richard, but try as I might it was never really stable under load on x86 - not with the plain semaphores in 2.3.14, and not with the patches Andrea had either. I assume Richard tested it on an alpha with the much more well-thought-out atomic operation that the alpha provides. I ended up rewriting the x86 semaphore code (and some of Richards pipe code too, for that matter, to get rid of some races in waking things up), and it doesn't show the problems I saw before, but hey, maybe I just exchanged one set of problems for another set that I can't trigger any more. Give me feedback, please. Other features that don't impact everybody, but are rather major: - ATM support merged in - firewalling is gone (again), replaced by an even more generic netfilter facility. - general networking merges and updates - Various driver updates (ISDN, ISA PnP, sound, fbcon, usb, intelliport, you name it) - make system call return type "long" even if the system call only returns valid data in the lower order bits - we use the high bits for error handling, and some 64-bit architectures care (read: the Merced calling conventions want this because they don't automatically extend the return type - I bet it will be a new portability issue for other programs than just the kernel) Have fun, Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
On Wed, 25 Aug 1999, Linus Torvalds wrote:
>I ended up rewriting the x86 semaphore code (and some of Richards pipe
>code too, for that matter, to get rid of some races in waking things up),
>and it doesn't show the problems I saw before, but hey, maybe I just
>exchanged one set of problems for another set that I can't trigger any
>more. Give me feedback, please.
I guess the problem is the pipe code since I understood the old semaphores
completly and there weren't SMP races there.
Your new semaphores seems completly buggy to me and I am surprised your
kernel works without crash or corruption with them.
task1 task2 task3 -> effect -> count sleepers
----- ----- ----- ----- --------
1 0
------- task 0 does a down() ------------------ 0 0
------- here task 1,2,3 try to get the lock ---
down() -1 1
(I avoided the details here)
schedule()
down() -2 1
spin_lock()
sleepers++ -2 2
add_neg(1) -1??? 2
sleepers = 1 -1 1
schedule()
down()
spin_lock()
sleepers++ -1 2
add_neg(1) 0???????????
ret not negative!!
sleepers = 0 0 0
wakeup()
spin_unlock()
two task got the lock at the same time!!!!!
the above isn't a subtle SMP race and I think it can trigger easily in
real-world. So I would be suprised if 2.3.15 would be rock solid.
Maybe I am missing something in your semaphores but I can't see what.
Tomorrow I'll continue thinking about this issue and (if I am not dreaming
about the above ;) I'll fix the new semaphores (so the new global set_mb()
patch will be delayed a bit).
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: Linux-2.3.15.. Date: 1999/08/28 Message-ID: <fa.o99nfvv.ikg632@ifi.uio.no>#1/1 X-Deja-AN: 518049611 Original-Date: Fri, 27 Aug 1999 10:45:46 -0700 (PDT) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.9908271040440.1013-100000@penguin.transmeta.com> References: <fa.j2qsg0v.17limb5@ifi.uio.no> To: Andrea Arcangeli <and...@suse.de> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Fri, 27 Aug 1999, Andrea Arcangeli wrote: > > I guess the problem is the pipe code since I understood the old semaphores > completly and there weren't SMP races there. Well, I certainly saw strange behaviour. The trylock code seemed to be the prime culprit - it tried to decrement the "waking" count, but it could end up doing it too late so that people had already seen a increment from a concurrent "up()". I'm not saying the new code is bug-free, but it works for me where the old one did not - and your claim that it is obviously broken is also obviously wrong, see later.. > Your new semaphores seems completly buggy to me and I am surprised your > kernel works without crash or corruption with them. Well, you're not counting right.. > task1 task2 task3 -> effect -> count sleepers > ----- ----- ----- ----- -------- > 1 0 > > ------- task 0 does a down() ------------------ 0 0 > ------- here task 1,2,3 try to get the lock --- > > down() -1 1 > (I avoided the details here) > schedule() > down() -2 1 > spin_lock() > sleepers++ -2 2 > add_neg(1) -1??? 2 > sleepers = 1 -1 1 > schedule() > down() > spin_lock() > sleepers++ -1 2 Wrong counting. The "down()" decremented count, so you have -2 2 which is exactly right, and explains why there will _not_ be two users in the critical region at the same time. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@suse.de> Subject: new semaphores Date: 1999/08/28 Message-ID: <fa.j3b6i0v.1650kb2@ifi.uio.no> X-Deja-AN: 518333748 Original-Date: Sun, 29 Aug 1999 00:53:11 +0200 (CEST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.9908282233070.4789-100000@laser.random> References: <fa.o99nfvv.ikg632@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> X-Sender: and...@laser.random Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list X-Public-Key-URL: http://e-mind.com/~andrea/aa.asc MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu It seems to me that there's no need of wake-up the sleepers in the success path of __down(). If you get the semaphore nobody else can do progresses. You'll sure wakeup the sleepers some time soon in up(). Then I changed the kind of the semaphores from wake-all to wake-one-FIFO. To do that I must always try to wakeup one sleeper if I giveup in down_interruptible() since I may got the wakeup from an up() while I was giving up while I was still in the waitqueue. So I must make sure to pass the wake-one wakeup from me to another sleeper before giving up. down_trylock seems buggy because it always returns 1 and it won't make difference between the case that held the semaphore and the case that find the semaphore busy. In trylock we must not care about the waitqueue at all since we don't add ourself in the waitqueue there. trylock can't be the cause of a missed wakeup from an up(). down_interruptible() is different since we giveup while we are inside the waitqueue. The only detail to care is to undo the sem->count with the spinlock held (so other sleepers will look at the sem->count only when it will be uptodate and so other sleepers will be able to grab the semaphore if an up() happened while we was giving up in trylock). If an up() happened this way: task1 task2 task3 ------------------------------------- down_trylock() up() wakeup() sleeper wakenup try to grab can't grab due the underlaying trylock sleep again __down_trylock() got the semaphore we'll be fine too since __down_trylock will happen some time soon and then it will grab the semaphore instead of the task2. Here it is a patch against clean 2.3.15 to fix all the above issues and to implement wake-one-FIFO: Patch Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: new semaphores Date: 1999/08/29 Message-ID: <fa.o9a7fvv.hkc639@ifi.uio.no>#1/1 X-Deja-AN: 518422412 Original-Date: Sat, 28 Aug 1999 23:19:53 -0700 (PDT) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.9908282313280.2144-100000@penguin.transmeta.com> References: <fa.j3b6i0v.1650kb2@ifi.uio.no> To: Andrea Arcangeli <and...@suse.de> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Sun, 29 Aug 1999, Andrea Arcangeli wrote: > > It seems to me that there's no need of wake-up the sleepers in the success > path of __down(). If you get the semaphore nobody else can do progresses. > You'll sure wakeup the sleepers some time soon in up(). Maybe. Be careful that you don't get this wrong, though: you should NOT think that semaphores are always mutexes, and there could be multiple concurrent "up()" calls with semaphore counts > 1 etc, and I'd rather be safe than sorry. With the rule that you always wake somebody up when the count becomes non-negative, you're at least safe. I'm not sure your new code is. > down_trylock seems buggy because it always returns 1 and it won't make > difference between the case that held the semaphore and the case that find > the semaphore busy. down_trylock has already _failed_. It failed before the call - the down() has failed, and down_trylock just has to correct the counts and then _unconditionally_ tell the world that the failure happened. That part of your patch is definitely bad. There's no way I'll ever add this: you increment the semaphore count without waking people up, which is just a sure way to set yourself up for nasty surprises. Don't try to be clever when the trylock has already proven to have failed once. Th ewhole _point_ of trylock is to try once and give up if it didn't succeed. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@suse.de> Subject: Re: new semaphores Date: 1999/08/29 Message-ID: <fa.hkka27v.5gu1g@ifi.uio.no> X-Deja-AN: 518521130 Original-Date: Sun, 29 Aug 1999 17:08:22 +0200 (CEST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.9908291538230.665-100000@laser.random> References: <fa.o9a7fvv.hkc639@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> X-Sender: and...@laser.random Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list X-Public-Key-URL: http://e-mind.com/~andrea/aa.asc MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Sat, 28 Aug 1999, Linus Torvalds wrote: >Maybe. Be careful that you don't get this wrong, though: you should NOT >think that semaphores are always mutexes, and there could be multiple >concurrent "up()" calls with semaphore counts > 1 etc, and I'd rather be >safe than sorry. The fact that up() doesn't wakeup if the resulting count is > 0, yes, it seems that I can have only one task at once in the critical section even if there could be two tasks (anyway it doesn't seems to be a deadlock). IMHO it would be better to have fast mutex than being forced to use slow semaphores (since everybody uses semaphores as mutex). Does somebody knows a place that uses the semaphores initialized with a count > 1 ? I don't. >down_trylock has already _failed_. Supposing the seamphores are mutex you have two choices in down_trylock(): 1) uncoditionally undo the count-decrease while syncing the count with the sleepers and unconditionally wakeup the sleepers if somebody may get the semaphore now. 2) avoid all wakeups calls and try to grab the semaphore. _Only_ if we can't grab the semaphore then undo the count-decrease. No wakeup in any case in down_trylock (in the mutex case at least). But the interesting point is that my (2) is more efficent than your (1) that will end always running a not necessary wakeup in the common case (this mean overscheduling) just because it's not clever. >That part of your patch is definitely bad. There's no way I'll ever add >this: you increment the semaphore count without waking people up, which >is just a sure way to set yourself up for nasty surprises. Woops I seen a problem in my implementation: instead of failed = atomic_add_negative(sem->sleepers, &sem->count); sem->sleepers = 0; if (failed) /* we didn't got the semaphore so undo the count decrement. */ atomic_inc(&sem->count); it seems to me I should do: failed = atomic_add_negative(sem->sleepers, &sem->count); sem->sleepers = 0; if (failed) sem->sleepers = 1; That will make sure that any underlaying up() will wakeup one sleeper. To fix also the real-semaphore case I should do: failed = atomic_add_negative(sem->sleepers, &sem->count); sem->sleepers = 0; if (failed) sem->sleepers = 1; else wake_up(&sem->wait); In general trying to get the semaphore in __down_trylock as I do is not bad since it will avoid you a wakeup in the common case. See what happens to your __down_trylock in the 99% of cases: task1 task2 count sleepers ----- ----- ----- -------- down() 0 0 down_trylock() -1 0 sleepers = 0 -1 0 add_neg(1) 0 0 wake_up() <- not needed in 99.9% of cases The above example has not sleepers so it won't cause overscheduling but only some waste of CPU and some lock on the bus in wake_up. But the same wakeup will happens even if there was sleepers and that will end in plain _overscheduling_, see below: task1 task2 task3 count sleepers ----- ----- ----- ----- -------- down() 0 0 down() -1 0 sleepers++ -1 1 add_neg(0) -1 1 sleepers = 1 -1 1 go to sleep down_trylock() -2 1 sleepers = 0 -2 0 add_neg(2) 0 0 wake_up() <- not needed in 99.9% of cases wakenup add_neg(-1) -1 0 sleepers = 1 -1 1 go to sleep My patch avoid completly the overscheduling that you "always" cause in __down_trylock. I agree that it's really annoying to rename all struct semaphore to struct mutext to do the right thing and probably it's nicer to have real semaphores called `struct semaphore'. So now I propose you this new patch against 2.3.15. It only does the __down_trylock optimization and the wake-one thing. Comments? --- 2.3.15/arch/i386/kernel/semaphore.c Thu Aug 26 02:54:55 1999 +++ 2.3.15-sem/arch/i386/kernel/semaphore.c Sun Aug 29 17:05:32 1999 @@ -49,8 +49,8 @@ { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); - tsk->state = TASK_UNINTERRUPTIBLE; - add_wait_queue(&sem->wait, &wait); + tsk->state = TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE; + add_wait_queue_exclusive(&sem->wait, &wait); spin_lock_irq(&semaphore_lock); sem->sleepers++; @@ -63,28 +63,28 @@ */ if (!atomic_add_negative(sleepers - 1, &sem->count)) { sem->sleepers = 0; - wake_up(&sem->wait); break; } sem->sleepers = 1; /* us - see -1 above */ spin_unlock_irq(&semaphore_lock); schedule(); - tsk->state = TASK_UNINTERRUPTIBLE; + tsk->state = TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE; spin_lock_irq(&semaphore_lock); } spin_unlock_irq(&semaphore_lock); remove_wait_queue(&sem->wait, &wait); tsk->state = TASK_RUNNING; + wake_up(&sem->wait); } int __down_interruptible(struct semaphore * sem) { - int retval; + int retval = 0; struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); - tsk->state = TASK_INTERRUPTIBLE; - add_wait_queue(&sem->wait, &wait); + tsk->state = TASK_INTERRUPTIBLE|TASK_EXCLUSIVE; + add_wait_queue_exclusive(&sem->wait, &wait); spin_lock_irq(&semaphore_lock); sem->sleepers ++; @@ -98,12 +98,10 @@ * it has contention. Just correct the count * and exit. */ - retval = -EINTR; if (signal_pending(current)) { + retval = -EINTR; sem->sleepers = 0; - if (atomic_add_negative(sleepers, &sem->count)) - break; - wake_up(&sem->wait); + atomic_add(sleepers, &sem->count); break; } @@ -114,8 +112,6 @@ * the lock. */ if (!atomic_add_negative(sleepers - 1, &sem->count)) { - wake_up(&sem->wait); - retval = 0; sem->sleepers = 0; break; } @@ -123,12 +119,13 @@ spin_unlock_irq(&semaphore_lock); schedule(); - tsk->state = TASK_INTERRUPTIBLE; + tsk->state = TASK_INTERRUPTIBLE|TASK_EXCLUSIVE; spin_lock_irq(&semaphore_lock); } spin_unlock_irq(&semaphore_lock); tsk->state = TASK_RUNNING; remove_wait_queue(&sem->wait, &wait); + wake_up(&sem->wait); return retval; } @@ -142,17 +139,18 @@ */ int __down_trylock(struct semaphore * sem) { - int retval, sleepers; + int failed; spin_lock_irq(&semaphore_lock); - sleepers = sem->sleepers + 1; - sem->sleepers = 0; - /* - * Add "everybody else" and us into it. They aren't + * Add "everybody else" into it. They aren't * playing, because we own the spinlock. */ - if (!atomic_add_negative(sleepers, &sem->count)) + failed = atomic_add_negative(sem->sleepers, &sem->count); + sem->sleepers = 0; + if (failed) + sem->sleepers = 1; + else wake_up(&sem->wait); spin_unlock_irq(&semaphore_lock); Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: new semaphores Date: 1999/08/29 Message-ID: <fa.o8ptg7v.g4i6bb@ifi.uio.no>#1/1 X-Deja-AN: 518573317 Original-Date: Sun, 29 Aug 1999 11:27:09 -0700 (PDT) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.10.9908291113210.2557-100000@penguin.transmeta.com> References: <fa.hkka27v.5gu1g@ifi.uio.no> To: Andrea Arcangeli <and...@suse.de> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Sun, 29 Aug 1999, Andrea Arcangeli wrote: > > The fact that up() doesn't wakeup if the resulting count is > 0, yes, it > seems that I can have only one task at once in the critical section even > if there could be two tasks (anyway it doesn't seems to be a deadlock). > > IMHO it would be better to have fast mutex than being forced to use slow > semaphores (since everybody uses semaphores as mutex). We =have= a fast MUTEX already. It's the semaphore code. All the code you're looking at is almost completely uninteresting for performance handling: by the time the code is invoced you've already had contention on the semaphore, and you just need to make sure that you get the right result. And even if that code ever becomes a bottle-neck, the right approach is NOT to make it all that much faster: the right approach is to make sure that the semaphore that gets tons of contention is fixed up. So I want the semaphores to be stable and simple, because the fast path has already been done somewhere else. I agree with your wake-one semantic issue, but I disagree with just about all of the other changes. > Does somebody knows a place that uses the semaphores initialized with a > count > 1 ? I don't. So? > Supposing the seamphores are mutex you have two choices in down_trylock(): You have one choice: fix things up. It already failed, there's no point in doing anything else. We tried to be clever before. There was absolutely no data that it was ever a win, and there were lots of indications that it was buggy. Let's not make that mistake again. Don't optimize code that doesn't need optimization. Btw, the case you optimize for is the case that is supposed to be _extremely_ rare even in the presense of contention. You optimize not just for the contention-case, you optimize for the specific case where the values are racing and changing on different CPU's at the same time. Do you _really_ think that it is worth it, considering that you make the semaphore behaviour more complex? I really don't. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/