Overscheduling DOES happen with high web server load

From: Phillip Ezolt <ez...@perf.zko.dec.com>
Subject: Overscheduling DOES happen with high web server load.
Date: 1999/05/05
Message-ID: <fa.lrbmocv.13gphs@ifi.uio.no>
X-Deja-AN: 474518895
Original-Date: Wed, 5 May 1999 14:54:40 -0400 (EDT)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.OSF.3.96.990505144826.13001A-100000@perf.zko.dec.com>
To: linux-ker...@vger.rutgers.edu
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi all,

In doing some performance work with SPECWeb96 on ALpha/Linux with apache,
it looks like "schedule" is the main bottleneck. 

(Kernel v2.2.5, Apache 1.3.4, egcs-1.1.1, iprobe-4.1)

When running a SPECWeb96 strobe run on Alpha/linux, I found that when the
CPU is pegged, 18% of the time is spent in the scheduler.

Using Iprobe, I got the following function breakdown: (only functions >1%
are shown)

Begin            End                                    Sample Image Total
Address          Address          Name                   Count   Pct   Pct
-------          -------          ----                   -----   ---   ---
0000000000000000-00000000000029FC /usr/bin/httpd        127463        18.5 
00000001200419A0-000000012004339F   ap_vformatter        15061  11.8   2.2 
FFFFFC0000300000-00000000FFFFFFFF vmlinux               482385        70.1 
FFFFFC00003103E0-FFFFFC000031045F   entInt                7848   1.6   1.1
FFFFFC0000315E40-FFFFFC0000315F7F   do_entInt            48487  10.1   7.0
FFFFFC0000327A40-FFFFFC0000327D7F   schedule            124815  25.9  18.1
FFFFFC000033FAA0-FFFFFC000033FCDF   kfree                 7876   1.6   1.1
FFFFFC00003A9960-FFFFFC00003A9EBF   ip_queue_xmit         8616   1.8   1.3
FFFFFC00003B9440-FFFFFC00003B983F   tcp_v4_rcv           11131   2.3   1.6
FFFFFC0000441CA0-FFFFFC000044207F   do_csum_partial      43112   8.9   6.3 
                                    _copy_from_user 

I can't pin it down to the exact source line, but the cycles are spent in
close proximity of one another. 

FFFFFC0000327A40 schedule vmlinux
FFFFFC0000327C1C   01DC      2160 (  1.7) *
FFFFFC0000327C34   01F4     28515 ( 22.8) **********************
FFFFFC0000327C60   0220      1547 (  1.2) *
FFFFFC0000327C64   0224     26432 ( 21.2) *********************
FFFFFC0000327C74   0234     36470 ( 29.2) *****************************
FFFFFC0000327C9C   025C     24858 ( 19.9) *******************       

(For those interested, I have the disassembled code. )

Apache has a fairly even cycle distribution, but in the kernel, 'schedule' 
really sticks out as the CPU burner. 

I think that the linear search for next runnable process is where time is
being spent. 

As an independent test, I ran vmstat while SPECWeb was running.

The leftmost column is the number of processes waiting to run.  These number
are above the 3 or 4 that are normally quoted. 

 procs                  memory    swap        io    system         cpu
 r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
 0 21 0   208  5968  5240 165712   0   0 4001  303 10263 6519  31  66   4
26 27 1   208  6056  5240 165848   0   0 2984   96 5623 3440  29  60  11
 0 15 0   208  5096  5288 166384   0   0 4543  260 10850 7346  32  66   3
 0 17 0   208  6928  5248 164936   0   0 5741  309 13129 8052  32  65   3
37 19 1   208  5664  5248 166144   0   0 2502  142 6837 3896  33  63   5
 0 14 0   208  5984  5240 165656   0   0 3894  376 12432 7276  32  65   3
 0 19 1   208  4872  5272 166248   0   0 2247  124 7641 4514  32  64   4
 0 17 0   208  5248  5264 166336   0   0 4229  288 8786 5144  31  67   2
56 16 1   208  6512  5248 165592   0   0 2159  205 8098 4641  32  62   6
94 18 1   208  5920  5248 165896   0   0 1745  191 5288 2952  32  60   7
71 14 1   208  5920  5256 165872   0   0 2063  160 6493 3729  30  62   8
 0 25 1   208  5032  5256 166544   0   0 3008  112 5668 3612  31  60   9
62 22 1   208  5496  5256 165560   0   0 2512  109 5661 3392  28  62  11
43 22 1   208  4536  5272 166112   0   0 3003  202 7198 4813  30  63   7
 0 26 1   208  4800  5288 166256   0   0 2407   93 5666 3563  29  60  11
32 17 1   208  5984  5296 165632   0   0 2046  329 7296 4305  31  62   6
23 7 1   208  6744  5248 164904   0   0 1739  284 9496 5923  33  65   2
14 18 1   208  5128  5272 166416   0   0 3755  322 9663 6203  32  65   3
 0 22 1   208  4256  5304 167288   0   0 2593  156 5678 3219  31  60   9
44 20 1   208  3688  5264 167184   0   0 3010  149 7277 4398  31  62   7
29 24 1   208  5232  5264 166248   0   0 1954  104 5687 3496  31  61   9
26 23 1   208  5688  5256 165568   0   0 3029  169 7124 4473  30  60  10
 0 18 1   208  5576  5256 165656   0   0 3395  270 8464 5702  30  63   7      

It looks like the run queue is much longer than expected. 

I imagine this problem is compounded by the number of times "schedule" is
called. 

On a webserver that does not have all of the web pages in memory, an httpd
processes life is the following:

1. Wake up for a request from the network.
2. Figure out what web page to load.
3. Ask the disk for it.
4. Sleep (Schedule()) until the page is ready.

This means that schedule will be called alot. In addition a process will wake 
and sleep in a time much shorter than its allotted time slice. 

Each time we schedule, we have to walk through the entire run queue. This will
cause less requests to be serviced.  This will cause more processes to be stuck
on the run queue,  this will make the walk down the runqueue even longer...

Bottom line, under a heavy web load, the linux kernel seems to spend and
unnecessary amount of time scheduling processes.

Is it necessary to calculate the goodness of every process at every schedule?
Can't we make the goodnesses static?  Monkeying with the scheduler is big 
business, and I realize that this will not be a v2.2 issue, but what about 
v2.3? 

--Phil

Digital/Compaq:                     HPSD/Benchmark Performance Engineering
Phillip.Ez...@compaq.com                            ez...@perf.zko.dec.com

ps. <shameless plug> For those interested in more detail there will be a
WIP paper describing this work presented at Linux Expo. </shameless plug>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Richard Gooch <rgo...@atnf.csiro.au>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/06
Message-ID: <fa.i397bcv.k5i7ak@ifi.uio.no>#1/1
X-Deja-AN: 474602020
Original-Date: Thu, 6 May 1999 10:42:18 +1000
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <199905060042.KAA19778@vindaloo.atnf.CSIRO.AU>
References: <fa.lrbmocv.13gphs@ifi.uio.no>
To: Phillip Ezolt <ez...@perf.zko.dec.com>
Original-References: <Pine.OSF.3.96.990505144826.13001A-100...@perf.zko.dec.com>
Notfrom: spam...@atnf.csiro.au
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Phillip Ezolt writes:
> Hi all,
> 
> In doing some performance work with SPECWeb96 on ALpha/Linux with apache,
> it looks like "schedule" is the main bottleneck. 
> 
> (Kernel v2.2.5, Apache 1.3.4, egcs-1.1.1, iprobe-4.1)
> 
> When running a SPECWeb96 strobe run on Alpha/linux, I found that when the
> CPU is pegged, 18% of the time is spent in the scheduler.

This is some very interesting data!

> Using Iprobe, I got the following function breakdown: (only functions >1%
> are shown)
> 
> Begin            End                                    Sample Image Total
> Address          Address          Name                   Count   Pct   Pct
> -------          -------          ----                   -----   ---   ---
> 0000000000000000-00000000000029FC /usr/bin/httpd        127463        18.5 
> 00000001200419A0-000000012004339F   ap_vformatter        15061  11.8   2.2 
> FFFFFC0000300000-00000000FFFFFFFF vmlinux               482385        70.1 
> FFFFFC00003103E0-FFFFFC000031045F   entInt                7848   1.6   1.1
> FFFFFC0000315E40-FFFFFC0000315F7F   do_entInt            48487  10.1   7.0
> FFFFFC0000327A40-FFFFFC0000327D7F   schedule            124815  25.9  18.1
> FFFFFC000033FAA0-FFFFFC000033FCDF   kfree                 7876   1.6   1.1
> FFFFFC00003A9960-FFFFFC00003A9EBF   ip_queue_xmit         8616   1.8   1.3
> FFFFFC00003B9440-FFFFFC00003B983F   tcp_v4_rcv           11131   2.3   1.6
> FFFFFC0000441CA0-FFFFFC000044207F   do_csum_partial      43112   8.9   6.3 
>                                     _copy_from_user 

Why don't we see the time taken by the goodness() function?

> Apache has a fairly even cycle distribution, but in the kernel, 'schedule' 
> really sticks out as the CPU burner. 
> 
> I think that the linear search for next runnable process is where time is
> being spent. 

Could well be, especially if the context switches are happening
between threads rather than separate processes. Thread switches are
*really* fast under Linux.

> As an independent test, I ran vmstat while SPECWeb was running.
> 
> The leftmost column is the number of processes waiting to run.  These number
> are above the 3 or 4 that are normally quoted. 
> 
>  procs                  memory    swap        io    system         cpu
>  r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
[...]
> 94 18 1   208  5920  5248 165896   0   0 1745  191 5288 2952  32  60   7
> 
> It looks like the run queue is much longer than expected. 

Indeed. As a separate question, we may wonder why so many
processes/threads are being used, and whether that number could/should
be reduced. Perhaps the server is doing something silly. But that's an
aside. Instead, I'd like to explore ways of reducing the (already low)
scheduler overhead.

In September last year I wrote a patch which put RT processes on a
separate run queue. While you don't have RT processes (I expect), one
of the benefits of this patch is that it cleans up some of the
scheduler code. Specifically, the goodness() function has some
special-casing for RT processes removed. For a short run queue, the
improvement is pretty marginal. However, for a long run queue, the
improvement may be significant. So I'd ask you to redo your tests
again with this patch applied. I've ported the patch to 2.2.7. It's
untested in 2.2.7, but it worked fine in 2.1.x.

See: http://www.atnf.csiro.au/~rgooch/linux/kernel-patches.html

				Regards,

					Richard....

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel <r...@nl.linux.org>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/06
Message-ID: <fa.n4qmvjv.b400bh@ifi.uio.no>#1/1
X-Deja-AN: 474660725
Original-Date: Thu, 6 May 1999 09:32:39 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.03.9905060929130.219-100000@mirkwood.dummy.home>
References: <fa.i397bcv.k5i7ak@ifi.uio.no>
To: Richard Gooch <rgo...@atnf.csiro.au>
X-Sender: r...@mirkwood.dummy.home
X-Authentication-Warning: mirkwood.nl.linux.org: riel owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Search-Engine-Bait: http://humbolt.nl.linux.org/
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-My-Own-Server: http://www.nl.linux.org/
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 6 May 1999, Richard Gooch wrote:
> Phillip Ezolt writes:

> > (Kernel v2.2.5, Apache 1.3.4, egcs-1.1.1, iprobe-4.1)
> > 
> > When running a SPECWeb96 strobe run on Alpha/linux, I found that when the
> > CPU is pegged, 18% of the time is spent in the scheduler.
> 
> This is some very interesting data!

Indeed.

> > It looks like the run queue is much longer than expected. 
> 
> Instead, I'd like to explore ways of reducing the (already low)
> scheduler overhead.
> 
> In September last year I wrote a patch which put RT processes on a
> separate run queue.

I'll start working on my scheduler bigpatch again. There
must be more things that can be incorporated in order to
improve the efficiency and performance of the Linux
scheduler...

regards,

Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV:               http://www.reseau.nl/ |
| Linux Memory Management site:  http://humbolt.geo.uu.nl/Linux-MM/ |
| Nederlandse Linux documentatie:          http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: j...@pa.dec.com (Jim Gettys)
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/06
Message-ID: <fa.gsrd1gv.lkcv2h@ifi.uio.no>#1/1
X-Deja-AN: 474791689
Original-Date: Thu, 6 May 1999 06:30:31 -0700
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <9905061330.AA24536@pachyderm.pa.dec.com>
References: <fa.lrcap4v.34p9h@ifi.uio.no>
To: Phillip Ezolt <ez...@perf.zko.dec.com>
Content-Type: text/plain
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Mime-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Whle this discussion continues, I'd like to remind people of the
the work of the linux scalability project, at 
http://www.citi.umich.edu/projects/citi-netscape/

Some of these patches have already been applied to current kernels; but 
some have not, and there are some issues described there for which there 
are yet no solutions, including the poll() work there....
			- Jim

--
Jim Gettys
Industry Standards and Consortia
Compaq Computer Corporation
Visting Scientist, World Wide Web Consortium, M.I.T.
http://www.w3.org/People/Gettys/
j...@w3.org, j...@pa.dec.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/06
Message-ID: <fa.j5b0g0v.175qmbb@ifi.uio.no>#1/1
X-Deja-AN: 474816447
Original-Date: Thu, 6 May 1999 17:39:35 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.05.9905061734270.2228-100000@laser.random>
References: <fa.lrbmocv.13gphs@ifi.uio.no>
To: Phillip Ezolt <ez...@perf.zko.dec.com>
X-Sender: and...@laser.random
X-Authentication-Warning: laser.random: andrea owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
X-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 5 May 1999, Phillip Ezolt wrote:

>Hi all,
>
>In doing some performance work with SPECWeb96 on ALpha/Linux with apache,
>it looks like "schedule" is the main bottleneck. 

Alt. Alpha uses HZ at 1024 so you get a scheduling rate by default 10
times higher than in all other archs.

>Is it necessary to calculate the goodness of every process at every schedule?

I think what we should do is to reschedule _only_ if a different process
will be scheduled. What I consider oversheduling is to schedule() and then
not switch to another task. Could you apply the patch below and run vmstat
1 again and see the rate of the overscheduling?

Index: kernel//sched.c
===================================================================
RCS file: /var/cvs/linux/kernel/sched.c,v
retrieving revision 1.1.2.33
diff -u -r1.1.2.33 sched.c
--- sched.c	1999/05/05 11:26:37	1.1.2.33
+++ sched.c	1999/05/06 15:37:16
@@ -764,12 +764,12 @@
 #ifdef __SMP__
 		sched_data->prev = prev;
 #endif
-	 	kstat.context_swtch++;
 		get_mmu_context(next);
 		switch_to(prev,next);

 		__schedule_tail();
 	}
+	 	kstat.context_swtch++;

 	reacquire_kernel_lock(current);
 	return;

If the rate is low there is no overscheduling and you should simply
decrease HZ to spend less time in the scheduler.

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Kurt Garloff <garl...@suse.de>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/07
Message-ID: <fa.ed3pouv.ohimj6@ifi.uio.no>#1/1
X-Deja-AN: 475019424
X-Operating-System: Linux 2.2.5-ac3_devfs i686
Original-Date: Thu, 6 May 1999 21:02:04 +0200
Sender: owner-linux-ker...@vger.rutgers.edu
References: <fa.j5b0g0v.175qmbb@ifi.uio.no>
Mime-Version: 1.0
X-PGP-Version: 2.6.3i
Original-Message-ID: <19990506210204.B5468@kurt.kdt.de>
X-PGP-Fingerprint: 92 00 AC 56 59 50 13 83  3C 18 6F 1B 25 A0 3A 5F
X-PGP-Key: 1024/CEFC9215
To: Andrea Arcangeli <and...@e-mind.com>
Original-References: <Pine.OSF.3.96.990505144826.13001A-100...@perf.zko.dec.com> 
<Pine.LNX.4.05.9905061734270.2228-100...@laser.random>
X-PGP-Info: on http://www.garloff.de/kurt/pgp.public.key.kurt.home.asc
Content-Type: multipart/signed; boundary="2B/JsCI69OhZNC5r"; 
micalg=pgp-md5; protocol="application/pgp-signature"
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Mail-Followup-To: Andrea Arcangeli <and...@e-mind.com>, Phillip Ezolt 
<ez...@perf.zko.dec.com>, linux-ker...@vger.rutgers.edu, j...@pa.dec.com, 
greg.ta...@digital.com
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu


--2B/JsCI69OhZNC5r
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

On Thu, May 06, 1999 at 05:39:35PM +0200, Andrea Arcangeli wrote:
> On Wed, 5 May 1999, Phillip Ezolt wrote:
> >
> >In doing some performance work with SPECWeb96 on ALpha/Linux with apache,
> >it looks like "schedule" is the main bottleneck.=20
>=20
> Alt. Alpha uses HZ at 1024 so you get a scheduling rate by default 10
> times higher than in all other archs.

=46rom what I understand, the scheduling is not caused by the timer but by
blocking and woken processes.

--=20
Kurt Garloff  <garl...@suse.de>           SuSE GmbH, N=FCrnberg, FRG
Linux kernel development;    SCSI driver: DC390 (tmscsim/AM53C974)

--2B/JsCI69OhZNC5r
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3in

iQCVAwUBNzHnLBaQN/7O/JIVAQE7kwP/fKhXWZhSES0WhOUUzYQpXNUAzlqF+yws
MGoqncJT/fwexzLI06ghRPk22tISM4B26nW/ihQD4RnpXiM9IPyyRubf3HjJ5BD8
UWCqe4597aYeWXbvtOC43pKvityWKFfaKXqXF5tuIfZ6pZ5upbllIr4Lux01xcWY
T9CAazJ52a8=
=yird
-----END PGP SIGNATURE-----

--2B/JsCI69OhZNC5r--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel <r...@nl.linux.org>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/07
Message-ID: <fa.n7aov3v.9km0rq@ifi.uio.no>#1/1
X-Deja-AN: 475202853
Original-Date: Fri, 7 May 1999 17:15:25 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.03.9905071703560.219-100000@mirkwood.dummy.home>
References: <fa.ed3pouv.ohimj6@ifi.uio.no>
To: Kurt Garloff <garl...@suse.de>
X-Sender: r...@mirkwood.dummy.home
X-Authentication-Warning: mirkwood.nl.linux.org: riel owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Search-Engine-Bait: http://humbolt.nl.linux.org/
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-My-Own-Server: http://www.nl.linux.org/
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 6 May 1999, Kurt Garloff wrote:

> From what I understand, the scheduling is not caused by the timer
> but by blocking and woken processes.

Not only that, the fact that we're using a traditional O(n)
run queue is somewhat of a performance bottleneck.

I'm currently looking into the scheduler used by Alliance
(www.allos.org) and it looks very promising. If I have any
time left, I'll port it to Linux.

It's based on the idea of a priority heap. Processes have
a quantum and a defer value. The quantum value is the length
of the process' time slice and the defer value is the time
at which it will be run next (sort of a deadline).

The process at the top of the heap is the process with the
lowest (run next) defer value. After a process finishes it's
time slice, the defer value is incremented (say, by p->priority)
and the process is sorted down the B-tree. This means that the
algorithm for selecting the best process is O(1) on UP machines
and O(log(nr_cpus)) on SMP machines.

Schedule_timeout() can be replaced by a simple resorting of
the heap. Basically we only need to remove processes from
the heap if they do blocking I/O or wait for a signal
(otherwise they would clutter up the heap).

cheers,

Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV:               http://www.reseau.nl/ |
| Linux Memory Management site:  http://humbolt.geo.uu.nl/Linux-MM/ |
| Nederlandse Linux documentatie:          http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar <mi...@chiara.csoma.elte.hu>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/07
Message-ID: <fa.lhohhov.f563o4@ifi.uio.no>#1/1
X-Deja-AN: 475225521
Original-Date: Fri, 7 May 1999 18:17:33 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990507181436.8938B-100000@chiara.csoma.elte.hu>
References: <fa.n7aov3v.9km0rq@ifi.uio.no>
To: Rik van Riel <r...@nl.linux.org>
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 7 May 1999, Rik van Riel wrote:

> Schedule_timeout() can be replaced by a simple resorting of
> the heap. Basically we only need to remove processes from
> the heap if they do blocking I/O or wait for a signal
> (otherwise they would clutter up the heap).

schedule_timeout() is not really expected to have high performance. Its
functionality was moved out of the main scheduler partly for this reason. 
Btw., the timer code (which is used by schedule_timeout() intrnally) has a
sort of heap structure already.

-- mingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar <mi...@chiara.csoma.elte.hu>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/07
Message-ID: <fa.li8jhgv.cl43g4@ifi.uio.no>#1/1
X-Deja-AN: 475225522
Original-Date: Fri, 7 May 1999 18:09:43 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990507180747.8938A-100000@chiara.csoma.elte.hu>
References: <fa.n7aov3v.9km0rq@ifi.uio.no>
To: Rik van Riel <r...@nl.linux.org>
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 7 May 1999, Rik van Riel wrote:

> It's based on the idea of a priority heap. Processes have
> a quantum and a defer value. The quantum value is the length
> of the process' time slice and the defer value is the time
> at which it will be run next (sort of a deadline).
> 
> The process at the top of the heap is the process with the
> lowest (run next) defer value. After a process finishes it's
> time slice, the defer value is incremented (say, by p->priority)
> and the process is sorted down the B-tree. This means that the
> algorithm for selecting the best process is O(1) on UP machines
> and O(log(nr_cpus)) on SMP machines.
> 
> Schedule_timeout() can be replaced by a simple resorting of
> the heap. Basically we only need to remove processes from
> the heap if they do blocking I/O or wait for a signal
> (otherwise they would clutter up the heap).

so? And how do you handle VM and CPU (and who knows what type of other
future) affinity?

-- mingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel <r...@nl.linux.org>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/07
Message-ID: <fa.n7a703v.8ks1rs@ifi.uio.no>#1/1
X-Deja-AN: 475316640
Original-Date: Fri, 7 May 1999 20:48:07 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.03.9905072045440.219-100000@mirkwood.dummy.home>
References: <fa.li8jhgv.cl43g4@ifi.uio.no>
To: Ingo Molnar <mi...@chiara.csoma.elte.hu>
X-Sender: r...@mirkwood.dummy.home
X-Authentication-Warning: mirkwood.nl.linux.org: riel owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Search-Engine-Bait: http://humbolt.nl.linux.org/
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-My-Own-Server: http://www.nl.linux.org/
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 7 May 1999, Ingo Molnar wrote:
> On Fri, 7 May 1999, Rik van Riel wrote:
> 
> > It's based on the idea of a priority heap.
	[snip]

> so? And how do you handle VM and CPU (and who knows what type of
> other future) affinity?

I haven't thought about that yet, but I think we _will_
have to think up something to make the current scheduler
more responsive and more predictable under load and to
reduce the CPU time used by niced tasks.

I think that even without the priority heap we can use the
Quantum/Defer model in order to get better predictability
and responsiveness under load.

Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV:               http://www.reseau.nl/ |
| Linux Memory Management site:  http://humbolt.geo.uu.nl/Linux-MM/ |
| Nederlandse Linux documentatie:          http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar <mi...@chiara.csoma.elte.hu>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/08
Message-ID: <fa.lfo5h8v.95m087@ifi.uio.no>#1/1
X-Deja-AN: 475726641
Original-Date: Sat, 8 May 1999 17:14:39 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990508170633.5517A-100000@chiara.csoma.elte.hu>
References: <fa.n7a703v.8ks1rs@ifi.uio.no>
To: Rik van Riel <r...@nl.linux.org>
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 7 May 1999, Rik van Riel wrote:

> > > It's based on the idea of a priority heap.
> 	[snip]
> 
> > [...] And how do you handle VM and CPU (and who knows what type of
> > other future) affinity?
> 
> I haven't thought about that yet [...]

thats one of the tough things IMO ... without those constraints we could
easily get rid of the linear behavior in the scheduler. We are trying to
avoid situations where there are lots of tasks on the runqueue _and_ there
are lots of reschedules happening, this way linearity doesnt matter. 

>                           [...], but I think we _will_
> have to think up something to make the current scheduler
> more responsive and more predictable under load [...]

do you still see responsiveness problems under pre4-2.2.8? If yes then i'd
be very interested in reproducing it.

>                                             [...] and to
> reduce the CPU time used by niced tasks.

this can be achieved by increasing (or changing) the scale of priorities. 
(this can be done seemlessly) But i doubt this matters in most cases.
(except when there are _lots_ of niced 100%-CPU using processes running)

-- mingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel <r...@nl.linux.org>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/09
Message-ID: <fa.lge9onv.6gipos@ifi.uio.no>#1/1
X-Deja-AN: 475889559
Original-Date: Sun, 9 May 1999 18:21:55 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.03.9905091817410.256-100000@mirkwood.nl.linux.org>
References: <fa.lfo5h8v.95m087@ifi.uio.no>
To: Ingo Molnar <mi...@chiara.csoma.elte.hu>
X-Authentication-Warning: mirkwood.nl.linux.org: riel owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Search-Engine-Bait: http://humbolt.nl.linux.org/
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-My-Own-Server: http://www.nl.linux.org/
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sat, 8 May 1999, Ingo Molnar wrote:
> On Fri, 7 May 1999, Rik van Riel wrote:
> 
> >                           [...], but I think we _will_
> > have to think up something to make the current scheduler
> > more responsive and more predictable under load [...]
> 
> do you still see responsiveness problems under pre4-2.2.8? If yes
> then i'd be very interested in reproducing it.

I've read the patch and think you're right about this one.
Under normal loads 2.2.8 will have no problems at all except
possibly the fact that it calls reschedule_idle() twice on
most reschedules and the fact that the kernel still recalculates
the priority of _all_ processes as soon as one process runs
out of it's time slice.

> >                                             [...] and to
> > reduce the CPU time used by niced tasks.
> 
> this can be achieved by increasing (or changing) the scale of
> priorities.  (this can be done seemlessly)

More importantly, it can be done without an increase in
the scheduling overhead and with removal of the current
process recalculation ritual.

It will increase SMP scalability of the scheduler, allow
for a wider priority scale and maybe even decrease scheduler
overhead (I'm not sure about the last one being measurable
though)...

cheers,

Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV:               http://www.reseau.nl/ |
| Linux Memory Management site:  http://humbolt.geo.uu.nl/Linux-MM/ |
| Nederlandse Linux documentatie:          http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar <mi...@chiara.csoma.elte.hu>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/09
Message-ID: <fa.nkq90mv.1j4eios@ifi.uio.no>#1/1
X-Deja-AN: 475897443
Original-Date: Sun, 9 May 1999 19:33:44 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990509192416.17632G-100000@chiara.csoma.elte.hu>
References: <fa.lge9onv.6gipos@ifi.uio.no>
To: Rik van Riel <r...@nl.linux.org>
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sun, 9 May 1999, Rik van Riel wrote:

> I've read the patch and think you're right about this one.
> Under normal loads 2.2.8 will have no problems at all except
> possibly the fact that it calls reschedule_idle() twice on
> most reschedules [...]

what do you mean? We call reschedule_idle() only when a process 1) gets
runnable 2) gets scheduled away (but is still runnable). These are two
different situations.

>            [...] and the fact that the kernel still recalculates
> the priority of _all_ processes as soon as one process runs
> out of it's time slice.

this recalculation argument is a complete red herring. I've explained this
earlier too: the number of recalculations happens at a _constant
frequency_, with the additional mechanizm that there are less
recalculations if there are more CPU-bound processes. The goal is to have
tasks' increase their 'dynamic priority' even if they are not on the
runqueue. So any architectural change that concentrates on removing these
recalculations is seriously misguided.

> 
> > >                                             [...] and to
> > > reduce the CPU time used by niced tasks.
> > 
> > this can be achieved by increasing (or changing) the scale of
> > priorities.  (this can be done seemlessly)
> 
> More importantly, it can be done without an increase in
> the scheduling overhead and with removal of the current
> process recalculation ritual.
> 
> It will increase SMP scalability of the scheduler, allow
> for a wider priority scale and maybe even decrease scheduler
> overhead (I'm not sure about the last one being measurable
> though)...
> 
> cheers,
> 
> Rik -- Open Source: you deserve to be in control of your data.
> +-------------------------------------------------------------------+
> | Le Reseau netwerksystemen BV:               http://www.reseau.nl/ |
> | Linux Memory Management site:  http://humbolt.geo.uu.nl/Linux-MM/ |
> | Nederlandse Linux documentatie:          http://www.nl.linux.org/ |
> +-------------------------------------------------------------------+
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel <r...@nl.linux.org>
Subject: Re: Overscheduling DOES happen with high web server load.
Date: 1999/05/09
Message-ID: <fa.lfu3pfv.40cogo@ifi.uio.no>#1/1
X-Deja-AN: 475921984
Original-Date: Sun, 9 May 1999 20:49:23 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.03.9905092043260.256-100000@mirkwood.nl.linux.org>
References: <fa.nkq90mv.1j4eios@ifi.uio.no>
To: Ingo Molnar <mi...@chiara.csoma.elte.hu>
X-Authentication-Warning: mirkwood.nl.linux.org: riel owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Search-Engine-Bait: http://humbolt.nl.linux.org/
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-My-Own-Server: http://www.nl.linux.org/
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sun, 9 May 1999, Ingo Molnar wrote:
> On Sun, 9 May 1999, Rik van Riel wrote:
> 
> > I've read the patch and think you're right about this one.
> > Under normal loads 2.2.8 will have no problems at all except
> > possibly the fact that it calls reschedule_idle() twice on
> > most reschedules [...]
> 
> what do you mean? We call reschedule_idle() only when a process 1)
> gets runnable 2) gets scheduled away (but is still runnable).
> These are two different situations.

OK, you're right about that one -- a slight brain fart on my
side. Then again, we probably want to make goodness() a bit
cheaper than it is now because we're calling it more often
than before....

> >            [...] and the fact that the kernel still recalculates
> > the priority of _all_ processes as soon as one process runs
> > out of it's time slice.
> 
> this recalculation argument is a complete red herring. I've explained this
> earlier too: the number of recalculations happens at a _constant
> frequency_, with the additional mechanizm that there are less
> recalculations if there are more CPU-bound processes. The goal is to have
> tasks' increase their 'dynamic priority' even if they are not on the
> runqueue. So any architectural change that concentrates on removing these
> recalculations is seriously misguided.

I've got a new scheme in mind which uses a new variable in
the task struct in order to make goodness simpler, increase
the range of CPU usage (for niced tasks), favor interactive
performance better and get rid of the recalculation as a
nice side effect...

Btw, the rate of recalculation increases with more CPUs in
the machine so I wouldn't be surprised if it didn't scale
well to number crunching machines with a lot of nice +19
tasks.

Anyway, I think I can bring my point across much better in
C than in english, so I'll focus on some source code now...

cheers,

Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV:               http://www.reseau.nl/ |
| Linux Memory Management site:  http://humbolt.geo.uu.nl/Linux-MM/ |
| Nederlandse Linux documentatie:          http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/