From: Linus Torvalds <torva...@transmeta.com>
Subject: Linux-2.3.15..
Date: 1999/08/26
Message-ID: <fa.o9abevv.iks73d@ifi.uio.no>#1/1
X-Deja-AN: 517263303
Original-Date: Wed, 25 Aug 1999 16:36:10 -0700 (PDT)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9908251619060.5207-100000@penguin.transmeta.com>
To: Kernel Mailing List <linux-ker...@vger.rutgers.edu>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

There's a rather huge patch-set out there now, taking the 2.3.x series to
2.3.15. 

This has a lot of the merge code I've been sent over the last two weeks,
but I will invariably have missed some, if for no other reason than simply
that I got absolutely _flooded_ by people sending me patches.

One of the more interesting things was the SMP pipe cleanup sent by
Richard, but try as I might it was never really stable under load on x86 -
not with the plain semaphores in 2.3.14, and not with the patches Andrea
had either. I assume Richard tested it on an alpha with the much more
well-thought-out atomic operation that the alpha provides.

I ended up rewriting the x86 semaphore code (and some of Richards pipe
code too, for that matter, to get rid of some races in waking things up),
and it doesn't show the problems I saw before, but hey, maybe I just
exchanged one set of problems for another set that I can't trigger any
more. Give me feedback, please.

Other features that don't impact everybody, but are rather major:
 - ATM support merged in
 - firewalling is gone (again), replaced by an even more generic netfilter
   facility.
 - general networking merges and updates
 - Various driver updates (ISDN, ISA PnP, sound, fbcon, usb, intelliport,
   you name it)
 - make system call return type "long" even if the system call only
   returns valid data in the lower order bits - we use the high bits for
   error handling, and some 64-bit architectures care (read: the Merced
   calling conventions want this because they don't automatically extend
   the return type - I bet it will be a new portability issue for other
   programs than just the kernel)

Have fun,

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux-2.3.15..

Andrea Arcangeli (andrea@suse.de)
Fri, 27 Aug 1999 18:54:26 +0200 (CEST)

On Wed, 25 Aug 1999, Linus Torvalds wrote:

>I ended up rewriting the x86 semaphore code (and some of Richards pipe
>code too, for that matter, to get rid of some races in waking things up),
>and it doesn't show the problems I saw before, but hey, maybe I just
>exchanged one set of problems for another set that I can't trigger any
>more. Give me feedback, please.

I guess the problem is the pipe code since I understood the old semaphores
completly and there weren't SMP races there.

Your new semaphores seems completly buggy to me and I am surprised your
kernel works without crash or corruption with them.

task1 task2 task3 -> effect -> count sleepers
----- ----- ----- ----- --------
1 0

------- task 0 does a down() ------------------ 0 0
------- here task 1,2,3 try to get the lock ---

down() -1 1
(I avoided the details here)
schedule()
down() -2 1
spin_lock()
sleepers++ -2 2
add_neg(1) -1??? 2
sleepers = 1 -1 1
schedule()
down()
spin_lock()
sleepers++ -1 2
add_neg(1) 0???????????
ret not negative!!
sleepers = 0 0 0
wakeup()
spin_unlock()
two task got the lock at the same time!!!!!

the above isn't a subtle SMP race and I think it can trigger easily in
real-world. So I would be suprised if 2.3.15 would be rock solid.

Maybe I am missing something in your semaphores but I can't see what.

Tomorrow I'll continue thinking about this issue and (if I am not dreaming
about the above ;) I'll fix the new semaphores (so the new global set_mb()
patch will be delayed a bit).

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: Linux-2.3.15..
Date: 1999/08/28
Message-ID: <fa.o99nfvv.ikg632@ifi.uio.no>#1/1
X-Deja-AN: 518049611
Original-Date: Fri, 27 Aug 1999 10:45:46 -0700 (PDT)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9908271040440.1013-100000@penguin.transmeta.com>
References: <fa.j2qsg0v.17limb5@ifi.uio.no>
To: Andrea Arcangeli <and...@suse.de>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 27 Aug 1999, Andrea Arcangeli wrote:
> 
> I guess the problem is the pipe code since I understood the old semaphores
> completly and there weren't SMP races there.

Well, I certainly saw strange behaviour. The trylock code seemed to be the
prime culprit - it tried to decrement the "waking" count, but it could end
up doing it too late so that people had already seen a increment from a
concurrent "up()".

I'm not saying the new code is bug-free, but it works for me where the old
one did not - and your claim that it is obviously broken is also obviously
wrong, see later..

> Your new semaphores seems completly buggy to me and I am surprised your
> kernel works without crash or corruption with them.

Well, you're not counting right..

> task1	task2		task3	   -> effect ->	count		sleepers
> -----	-----		-----			-----		--------
> 						1		0
> 
> ------- task 0 does a down() ------------------	0		0
> ------- here task 1,2,3 try to get the lock ---
> 
> down()						-1		1
> (I avoided the details here)
> schedule()
> 	down()					-2		1
> 	spin_lock()
> 	sleepers++				-2		2
> 	add_neg(1)				-1???		2
> 	sleepers = 1				-1		1
> 	schedule()
> 			down()
> 			spin_lock()
> 			sleepers++		-1		2

Wrong counting. The "down()" decremented count, so you have

						-2		2

which is exactly right, and explains why there will _not_ be two users in
the critical region at the same time.

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: new semaphores
Date: 1999/08/28
Message-ID: <fa.j3b6i0v.1650kb2@ifi.uio.no>
X-Deja-AN: 518333748
Original-Date: Sun, 29 Aug 1999 00:53:11 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9908282233070.4789-100000@laser.random>
References: <fa.o99nfvv.ikg632@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@laser.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
X-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

It seems to me that there's no need of wake-up the sleepers in the success
path of __down(). If you get the semaphore nobody else can do progresses.
You'll sure wakeup the sleepers some time soon in up().

Then I changed the kind of the semaphores from wake-all to wake-one-FIFO.
To do that I must always try to wakeup one sleeper if I giveup in
down_interruptible() since I may got the wakeup from an up() while I was
giving up while I was still in the waitqueue. So I must make sure to pass
the wake-one wakeup from me to another sleeper before giving up.

down_trylock seems buggy because it always returns 1 and it won't make
difference between the case that held the semaphore and the case that find
the semaphore busy. In trylock we must not care about the waitqueue at all
since we don't add ourself in the waitqueue there. trylock can't be the
cause of a missed wakeup from an up(). down_interruptible() is different
since we giveup while we are inside the waitqueue. The only detail to care
is to undo the sem->count with the spinlock held (so other sleepers will
look at the sem->count only when it will be uptodate and so other sleepers
will be able to grab the semaphore if an up() happened while we was giving
up in trylock). If an up() happened this way:

	task1	task2			task3
	-------------------------------------
					down_trylock()
	up()
	wakeup()
		sleeper wakenup
		try to grab
		can't grab due the
		underlaying trylock
		sleep again
					__down_trylock()
					got the semaphore

we'll be fine too since __down_trylock will happen some time soon and then
it will grab the semaphore instead of the task2.

Here it is a patch against clean 2.3.15 to fix all the above issues and to
implement wake-one-FIFO:

Patch

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: new semaphores
Date: 1999/08/29
Message-ID: <fa.o9a7fvv.hkc639@ifi.uio.no>#1/1
X-Deja-AN: 518422412
Original-Date: Sat, 28 Aug 1999 23:19:53 -0700 (PDT)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9908282313280.2144-100000@penguin.transmeta.com>
References: <fa.j3b6i0v.1650kb2@ifi.uio.no>
To: Andrea Arcangeli <and...@suse.de>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sun, 29 Aug 1999, Andrea Arcangeli wrote:
>
> It seems to me that there's no need of wake-up the sleepers in the success
> path of __down(). If you get the semaphore nobody else can do progresses.
> You'll sure wakeup the sleepers some time soon in up().

Maybe. Be careful that you don't get this wrong, though: you should NOT
think that semaphores are always mutexes, and there could be multiple
concurrent "up()" calls with semaphore counts > 1 etc, and I'd rather be
safe than sorry.

With the rule that you always wake somebody up when the count becomes
non-negative, you're at least safe. I'm not sure your new code is.

> down_trylock seems buggy because it always returns 1 and it won't make
> difference between the case that held the semaphore and the case that find
> the semaphore busy.

down_trylock has already _failed_.

It failed before the call - the down() has failed, and down_trylock just
has to correct the counts and then _unconditionally_ tell the world that
the failure happened.

That part of your patch is definitely bad. There's no way I'll ever add
this: you increment the semaphore count without waking people up, which is
just a sure way to set yourself up for nasty surprises.

Don't try to be clever when the trylock has already proven to have failed
once. Th ewhole _point_ of trylock is to try once and give up if it didn't
succeed.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: Re: new semaphores
Date: 1999/08/29
Message-ID: <fa.hkka27v.5gu1g@ifi.uio.no>
X-Deja-AN: 518521130
Original-Date: Sun, 29 Aug 1999 17:08:22 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9908291538230.665-100000@laser.random>
References: <fa.o9a7fvv.hkc639@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@laser.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
X-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sat, 28 Aug 1999, Linus Torvalds wrote:

>Maybe. Be careful that you don't get this wrong, though: you should NOT
>think that semaphores are always mutexes, and there could be multiple
>concurrent "up()" calls with semaphore counts > 1 etc, and I'd rather be
>safe than sorry.

The fact that up() doesn't wakeup if the resulting count is > 0, yes, it
seems that I can have only one task at once in the critical section even
if there could be two tasks (anyway it doesn't seems to be a deadlock).

IMHO it would be better to have fast mutex than being forced to use slow
semaphores (since everybody uses semaphores as mutex).

Does somebody knows a place that uses the semaphores initialized with a
count > 1 ? I don't.

>down_trylock has already _failed_.

Supposing the seamphores are mutex you have two choices in down_trylock():

1) uncoditionally undo the count-decrease while syncing the count with the
   sleepers and unconditionally wakeup the sleepers if somebody may get
   the semaphore now.
2) avoid all wakeups calls and try to grab the semaphore. _Only_ if we
   can't grab the semaphore then undo the count-decrease. No wakeup in any 
   case in down_trylock (in the mutex case at least).

But the interesting point is that my (2) is more efficent than your (1)
that will end always running a not necessary wakeup in the common case
(this mean overscheduling) just because it's not clever.

>That part of your patch is definitely bad. There's no way I'll ever add
>this: you increment the semaphore count without waking people up, which
>is just a sure way to set yourself up for nasty surprises.

Woops I seen a problem in my implementation: instead of

	failed = atomic_add_negative(sem->sleepers, &sem->count);
	sem->sleepers = 0;
	if (failed)
		/* we didn't got the semaphore so undo the count
		   decrement. */
		atomic_inc(&sem->count);

it seems to me I should do:

        failed = atomic_add_negative(sem->sleepers, &sem->count);
        sem->sleepers = 0;
        if (failed)
	        sem->sleepers = 1;

That will make sure that any underlaying up() will wakeup one sleeper.

To fix also the real-semaphore case I should do:

        failed = atomic_add_negative(sem->sleepers, &sem->count);
        sem->sleepers = 0;
        if (failed)
                sem->sleepers = 1;
	else
		wake_up(&sem->wait);

In general trying to get the semaphore in __down_trylock as I do is not
bad since it will avoid you a wakeup in the common case. See what happens
to your __down_trylock in the 99% of cases:

task1	task2		count	sleepers
-----	-----		-----	--------
down()			0	0
	down_trylock()	-1	0
	sleepers = 0	-1	0
	add_neg(1)	0	0
	wake_up() <- not needed in 99.9% of cases

The above example has not sleepers so it won't cause overscheduling but
only some waste of CPU and some lock on the bus in wake_up. But the same
wakeup will happens even if there was sleepers and that will end in
plain _overscheduling_, see below:

task1	task2	task3		count	sleepers
-----	-----	-----		-----	--------
down()				0	0

	down()			-1	0
	sleepers++		-1	1
	add_neg(0)		-1	1
	sleepers = 1		-1	1
	go to sleep

		down_trylock()	-2	1
		sleepers = 0	-2	0
		add_neg(2)	0	0
		wake_up() <- not needed in 99.9% of cases

	wakenup
	add_neg(-1)		-1	0
	sleepers = 1		-1	1
	go to sleep

My patch avoid completly the overscheduling that you "always" cause in
__down_trylock.

I agree that it's really annoying to rename all struct semaphore to struct
mutext to do the right thing and probably it's nicer to have real
semaphores called `struct semaphore'. So now I propose you this new patch
against 2.3.15. It only does the __down_trylock optimization and the
wake-one thing.

Comments?

--- 2.3.15/arch/i386/kernel/semaphore.c	Thu Aug 26 02:54:55 1999
+++ 2.3.15-sem/arch/i386/kernel/semaphore.c	Sun Aug 29 17:05:32 1999
@@ -49,8 +49,8 @@
 {
 	struct task_struct *tsk = current;
 	DECLARE_WAITQUEUE(wait, tsk);
-	tsk->state = TASK_UNINTERRUPTIBLE;
-	add_wait_queue(&sem->wait, &wait);
+	tsk->state = TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE;
+	add_wait_queue_exclusive(&sem->wait, &wait);
 
 	spin_lock_irq(&semaphore_lock);
 	sem->sleepers++;
@@ -63,28 +63,28 @@
 		 */
 		if (!atomic_add_negative(sleepers - 1, &sem->count)) {
 			sem->sleepers = 0;
-			wake_up(&sem->wait);
 			break;
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irq(&semaphore_lock);
 
 		schedule();
-		tsk->state = TASK_UNINTERRUPTIBLE;
+		tsk->state = TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE;
 		spin_lock_irq(&semaphore_lock);
 	}
 	spin_unlock_irq(&semaphore_lock);
 	remove_wait_queue(&sem->wait, &wait);
 	tsk->state = TASK_RUNNING;
+	wake_up(&sem->wait);
 }
 
 int __down_interruptible(struct semaphore * sem)
 {
-	int retval;
+	int retval = 0;
 	struct task_struct *tsk = current;
 	DECLARE_WAITQUEUE(wait, tsk);
-	tsk->state = TASK_INTERRUPTIBLE;
-	add_wait_queue(&sem->wait, &wait);
+	tsk->state = TASK_INTERRUPTIBLE|TASK_EXCLUSIVE;
+	add_wait_queue_exclusive(&sem->wait, &wait);
 
 	spin_lock_irq(&semaphore_lock);
 	sem->sleepers ++;
@@ -98,12 +98,10 @@
 		 * it has contention. Just correct the count
 		 * and exit.
 		 */
-		retval = -EINTR;
 		if (signal_pending(current)) {
+			retval = -EINTR;
 			sem->sleepers = 0;
-			if (atomic_add_negative(sleepers, &sem->count))
-				break;
-			wake_up(&sem->wait);
+			atomic_add(sleepers, &sem->count);
 			break;
 		}
 
@@ -114,8 +112,6 @@
 		 * the lock.
 		 */
 		if (!atomic_add_negative(sleepers - 1, &sem->count)) {
-			wake_up(&sem->wait);
-			retval = 0;
 			sem->sleepers = 0;
 			break;
 		}
@@ -123,12 +119,13 @@
 		spin_unlock_irq(&semaphore_lock);
 
 		schedule();
-		tsk->state = TASK_INTERRUPTIBLE;
+		tsk->state = TASK_INTERRUPTIBLE|TASK_EXCLUSIVE;
 		spin_lock_irq(&semaphore_lock);
 	}
 	spin_unlock_irq(&semaphore_lock);
 	tsk->state = TASK_RUNNING;
 	remove_wait_queue(&sem->wait, &wait);
+	wake_up(&sem->wait);
 	return retval;
 }
 
@@ -142,17 +139,18 @@
  */
 int __down_trylock(struct semaphore * sem)
 {
-	int retval, sleepers;
+	int failed;
 
 	spin_lock_irq(&semaphore_lock);
-	sleepers = sem->sleepers + 1;
-	sem->sleepers = 0;
-
 	/*
-	 * Add "everybody else" and us into it. They aren't
+	 * Add "everybody else" into it. They aren't
 	 * playing, because we own the spinlock.
 	 */
-	if (!atomic_add_negative(sleepers, &sem->count))
+	failed = atomic_add_negative(sem->sleepers, &sem->count);
+	sem->sleepers = 0;
+	if (failed)
+		sem->sleepers = 1;
+	else
 		wake_up(&sem->wait);
 
 	spin_unlock_irq(&semaphore_lock);

Andrea


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: new semaphores
Date: 1999/08/29
Message-ID: <fa.o8ptg7v.g4i6bb@ifi.uio.no>#1/1
X-Deja-AN: 518573317
Original-Date: Sun, 29 Aug 1999 11:27:09 -0700 (PDT)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9908291113210.2557-100000@penguin.transmeta.com>
References: <fa.hkka27v.5gu1g@ifi.uio.no>
To: Andrea Arcangeli <and...@suse.de>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sun, 29 Aug 1999, Andrea Arcangeli wrote:
> 
> The fact that up() doesn't wakeup if the resulting count is > 0, yes, it
> seems that I can have only one task at once in the critical section even
> if there could be two tasks (anyway it doesn't seems to be a deadlock).
> 
> IMHO it would be better to have fast mutex than being forced to use slow
> semaphores (since everybody uses semaphores as mutex).

We =have= a fast MUTEX already. It's the semaphore code.

All the code you're looking at is almost completely uninteresting for
performance handling: by the time the code is invoced you've already had
contention on the semaphore, and you just need to make sure that you get
the right result. 

And even if that code ever becomes a bottle-neck, the right approach is
NOT to make it all that much faster: the right approach is to make sure
that the semaphore that gets tons of contention is fixed up. 

So I want the semaphores to be stable and simple, because the fast path
has already been done somewhere else.

I agree with your wake-one semantic issue, but I disagree with just about
all of the other changes.

> Does somebody knows a place that uses the semaphores initialized with a
> count > 1 ? I don't.

So?

> Supposing the seamphores are mutex you have two choices in down_trylock():

You have one choice: fix things up. It already failed, there's no point in
doing anything else. 

We tried to be clever before. There was absolutely no data that it was
ever a win, and there were lots of indications that it was buggy. Let's
not make that mistake again. Don't optimize code that doesn't need
optimization.

Btw, the case you optimize for is the case that is supposed to be
_extremely_ rare even in the presense of contention. You optimize not just
for the contention-case, you optimize for the specific case where the
values are racing and changing on different CPU's at the same time. Do you
_really_ think that it is worth it, considering that you make the
semaphore behaviour more complex?

I really don't.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/