Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Linux Software

Robert Love Explains Variable HZ 62

An anonymous reader writes "Robert Love, author of the kernel preemption patch for Linux, has backported a new performancing boosting patch from the 2.5 development kernel to the 2.4 stable kernel. This patch allows one to tune the frequency of the timer interrupt, defined in 2.4 as "HZ=100". Robert explains 'The timer interrupt is at the heart of the system. Everything lives and dies based on it. Its period is basically the granularity of the system: timers hit on 10ms intervals, timeslices come due at 10ms intervals, etc.' The 2.5 kernel has bumped the HZ value up to 1000, boosting performance."
This discussion has been archived. No new comments can be posted.

Robert Love Explains Variable HZ

Comments Filter:
  • In FreeBSD (Score:3, Interesting)

    by CounterZer0 ( 199086 ) on Tuesday October 15, 2002 @09:51PM (#4458447) Homepage
    This is actually a easy to tune kernel config variable. Quick and easy performance boosts to be had by all!
  • Finally... (Score:5, Funny)

    by ActiveSX ( 301342 ) on Tuesday October 15, 2002 @09:52PM (#4458449) Homepage
    An overclockable operating system. You wouldn't believe how long I've waited for this. Now where do I get a software heatsink?
    • I just direct my heat to /dev/null, and use a regular heat sink on my pci bitbucket accelerator card.
      • The famous German computer magazine, c't, actually featured a hardware accelerated null device, the "Hypertronics 82C997 ENUL" in their 4/95 issue (as an April fool's joke, of course).

        The article is not available online unfortunately, but some of the amused reactions of their readers are here [heise.de] (in German), and you can even find a picture [heise.de] of the gizmo (note the photoshopped activity LED).
  • by Professor Collins ( 604482 ) on Tuesday October 15, 2002 @09:53PM (#4458456) Homepage
    One of the great paradoxes of computer science is that perceived performance and actual performance almost always come at a tradeoff. By raising the frequency of the timer interrupt, individual timeslices are shorter and the processor needs to make more context switches, resulting in less overall processing being performed. However, because these context switches occur more frequently, it appears to the user that apps are more responsive and fluid.

    To make a long story short, for number crunching machines, servers, and other applications which don't need much user interaction, larger timeslices are preferable because it doesn't matter how responsive the user interface is. For desktop systems, the timeslice can be decreased to improve the responsiveness of the user interface and give a better "feel" to the system at the expense of a minor performance loss. Being able to tune these parameters to meet your needs is one of Linux's great strengths.

    • by zenyu ( 248067 ) on Tuesday October 15, 2002 @11:16PM (#4458977)
      One of the great paradoxes of computer science is that perceived performance and actual performance almost always come at a tradeoff. By raising the frequency of the timer interrupt, individual timeslices are shorter and the processor needs to make more context switches, resulting in less overall processing being performed.

      This is not quite true. If you only have a single program running just one thread this is true. You have to do a context switch at each tick to Ring 0 and back, which takes maybe 500 cycles, or 1/20 microseconds on a 1Ghz machine. Do this 1000 times and you've lost 50 microseconds of processing time.

      BUT once you have more than one program or thread running the situation is different. Say you have one thread running flat out and another that needs to do 100microseconds of work. With 100 ticks per second you will lose 5 usec to context switching and 9900 usec to waiting for the next context switch. With 1000 ticks per second you lose 50us to context switching and 900 usec to waiting for the next context switch. So you get more work done.

      For someone who always runs at 100% processor utilization 1000 ticks per second is probably a setting since you are probably just running one thread 99% of the time and once in a while writing logs to disk or responding to some other events. If you are more like me and run at 1% of your processor utilization most of the time, with the 100% utilization only happening when you compile so you would rather be able to continue to use the computer than save 1ms on the 5 minute compile then an even higher value might make sense. 10000 maybe, assuming there aren't limitations in the kernel that prevent the higher value.

      Disclaimer: I've been applying Love's patches for a while now. They make a real difference in the responsiveness of X, esp if your running stuff like Mozilla or Gnome/KDE on your box. I haven't applied it on any servers cuz the preempt patch is not quite stable.
      • I disagree with your analysis.

        If a process isn't doing processing, that's because it's blocked in the kernel. (Q: What does a HLT do in userland?) As soon as the kernel puts a process on a wait queue, it reschedules. So you don't have any loss 'waiting for the next context switch'; that's just time that another process is running, or if nothing has anything to do, that the kernel halts the processor.

        Note: I haven't studied how process scheduling is handled under Linux, but I can't imagine any OS that wouldn't do what I said here... or at least, I can't imagine one that would halt the processor after a process blocks, while it waits for a timer interrupt to schedule the next process.

        Okay, maybe one.

        • Okay, after re-reading the article, I did see one performance gain this could get: the case of select/poll. (This is blatantly stated in the article; I shot my mouth off before reading the article closely enough.)

          Under BSD, as I understand it (I don't have the Daemon Book handy, but a quick reading of the source seems to agree), select will put a process on the wait queue until something arrives. During a select, the kernel does nothing with the process-- timer or not.

          From the look of the article, under Linux, select actually does some sort of polling at or related to HZ. It may be on some sort of almost-run queue: a selecting process gets allocated timeslices; on its slice, it polls and either returns to userland or goes back to onto the almost-run queue. I don't have time to verify that-- I don't know my way around the Linux kernel-- but it seems to be reasonable, based on the article. Can I get a Linux developer to confirm/deny my guess?

          So it seems that in the case of something selecting, primarily on an otherwise idle or near-idle system, increasing HZ may improve performance. This situation is less common than it used to be in today's world of multithreaded servers (since each thread typically blocks only on a single fd), but it's still potentially significant.

          • by pthisis ( 27352 ) on Thursday October 17, 2002 @03:17PM (#4472099) Homepage Journal
            From the look of the article, under Linux, select actually does some sort of polling at or related to HZ. It may be on some sort of almost-run queue: a selecting process gets allocated timeslices; on its slice, it polls and either returns to userland or goes back to onto the almost-run queue. I don't have time to verify that-- I don't know my way around the Linux kernel-- but it seems to be reasonable, based on the article. Can I get a Linux developer to confirm/deny my guess?

            Deny. It's actually the idle timeout that's affected by HZ. select() itself doesn't poll at all, and e.g. a select() call with an infinite timeout will be completely unaffected by HZ (select will wake up when the network gets an interrupt resulting in readable data/writeable buffer space).

            Example of the timeout effect: a game could have a select() loop that waits on user input, but also has a timeout argument so that it can go ahead and update the screen, do enemy AI, etc. The kernel, in absence of interupts, schedules on HZ boundaries. Suppose that you as a programmer put a 1/60 second timeout argument in the select loop (intending to update the screen with a 60 HZ refresh and figure out where everything's moving). If you call select() right after a HZ boundary, you could find yourself waiting until 1/50 second passes even on an idle machine with HZ=100; after 1/100 sec, your timeout hasn't expired yet. Next chance to schedule is at 2/100 (1/50) sec.

            With HZ=1000, you'll schedule no more than 1/1000 sec after the 1/60 sec boundry (on an idle machine).

            This example is really simplified; a real-life app would adjust for scheduling creep by keeping track of wall-time. But the same concept, with more complicated apps, can cause faster HZ ticks to give you better CPU utilization (especially in e.g. video editing apps and such) because you get around to using the CPU closer to when you want it.

            The preempt kernel is an even better example of where decreasing latency can increase throughput, sometimes significantly. There you can really get around to dealing with I/O quickly, keeping CPU saturated (and saturated with cache-warm data) and benefiting things like heavily loaded web servers just as much as sound editing stations.

            Sumner
            • Okay, that makes much more sense, thank you!

              I don't yet see why the timeout of select would be such a big deal (outside of a few specific cases), but I'll have to think about it, and your examples, more carefully.

              FWIW: BSD has the same select setup as you described.

              So here's my thought: how expensive is it to reprogram the timer chip? Would it be possible to adjust it dynamically to create perfect granularity in sleep/select?

              • FWIW: BSD has the same select setup as you described.

                Yeah, pretty much every Unix has interrupt-driven returns for the non-timeout case, anything else would be pretty bogus--though some systems (e.g. Linux 2.5.x) do interrupt mitigation under high load, but that's more of an "above and beyond" thing. The timeout case is handled differently on several Unices.

                So here's my thought: how expensive is it to reprogram the timer chip? Would it be possible to adjust it dynamically to create perfect granularity in sleep/select?

                There is a tickless Linux implementation.

                I can't find the home page at the moment, but see e.g.
                http://www.uwsg.iu.edu/hypermail/linux/kernel/01 04 .1/0137.html

                There are a lot of other ways of dealing with this, and tickless has some negative attributes I don't fully understand (among them is that it's not portable to older hardware, and there is some overhead to programming timer interrupts). I think the nanosecond kernel patches (which are starting to go into 2.5) address the select/sleep granularity issue in a different way but I'm really fuzzy on the details.

                Sumner
                • There's a better link at LWN [lwn.net] explaining the approach and drawbacks. It links to the high-resolution timers project (Anzinger's), which I believe is going into 2.5.

                  Sumner
                  • Terrific, thanks! The IBM project it discusses sounds a lot like my half-verbalized idea. I'll have to delve deeper into this idea, and what they've done so far.

                    • Note that while the list of drawbacks is only addressed briefly, increased schedule() overhead and increased system call overhead are potentially large drawbacks.

                      Also, after further investigation the Anzinger solution is _not_ in 2.5.x yet; Linus has looked at the patch, asked for clarification, and Anzinger recently replied with an updated patch. Search linux-kernel archives for "high-res-timer" or "POSIX timer" patches for more info.

                      Sumner
      • AFAIK, the parent is not quite true. 8-)

        A reschedule does not happen only on the timer tick (100 or 1000 times a second depending on HZ setting), it happens on a number of occasions, timer tick being one of them. The other ones remove the concerns zenyu seems to be having:

        1. when a process sleeps - when a process calls the kernel in order to sleep, the kernel reschedules because sleeping can be handled using normal timer and in the meantime other processes may work
        2. when a process yields - when a process says that it's done its stuff in this tick, whatever that means
        3. when idle, on any interrupt - when no process wants to work, the first one that wants to work is scheduled right away

        The second point may seem a little weird, but a process can only become willing to do something as a result of some interrupt - a timer if the process was sleeping for a given amount of time; a i/o interrupt if the process handles the keyboard or the mouse. In any case, interrupts are handled by the kernel and so if a process is to wake up from its sleep or if a process gets something in some stream on which it is waiting (stdin on keyboard interrupt, socket on network card interrupt etc.), that process is just scheduled to wake up and work.

        So on an idle machine the HZ does not really have much impact, and on a utilized machine the smoothness of process interaction (like window manager vs. X server) increases with increased HZ but this also increases the overhead.

        Hope it's clearer.

      • Say you have one thread running flat out and another that needs to do 100microseconds of work. With 100 ticks per second you will lose 5 usec to context switching and 9900 usec to waiting for the next context switch. With 1000 ticks per second you lose 50us to context switching and 900 usec to waiting for the next context switch. So you get more work done.
        This is not true. When a process/thread has nothing to do, it does not just sit around waiting to be preempted. By definition, if it has "nothing to do", that is because it has yielded the CPU. For instance, if the process does 100us of work and then makes an IO call, it will immediately yield to another process.

        Sure, there are some poorly-written apps that do excessive busy-waiting, but they are the exception, and there's not much the OS can do about it anyway.

        The only benefit of increasing HZ is latency.

        <RANT>
        BTW, I'd just like to mention a pet peeve of mine. In the article, they mention that "RedHat shipped their 8.0 kernel at HZ=512". There is no reason whatsoever that this should be a power of two, so I believe it should not be. Powers of two have a magical status in the computer world, but I think you should not give your code this kind of connotation unless you have actually decided that a power of two is the best choice. Otherwise, you should pick a number that reflects the ad-hoc nature of your choice. Powers of ten reflect this better than powers of two. Thus, all else being equal, they should have chosen 500 over 512.
        </RANT>

        • The only benefit of increasing HZ is latency.

          Presumably you meant "The only benefit of increasing HZ is decreasing latency" which is not a bad thing unto itself. Most people run interactive desktop applications, not scientific number crunching jobs for days at a time.

          Having a minimum granularity of 1/50th of a second for a select() when HZ=100 really sucks, quite frankly.
          Music players and animation programs have to resort to busy wait loops to get good response and tie up all CPU in the process. This is completely unnecessary in a modern OS.
          It's 1/50th not 1/100th of a second with HZ=100 because of the way POSIX defines select() you have to wait for two jiffies at a minimum according to Linus [iu.edu].

          Anyway, HZ > 500 sure as hell is better than HZ=100.
          A HZ-less kernel with on-demand timer scheduling would be much better, though. IBM has such a kernel patch for their mainframe version of Linux to improve responsiveness when hundreds of Linux VMs are running concurrently.

          Pity about the USER_HZ = 100 thing to accomodate all the borken programs that pick up HZ from the linux kernel header file and assume it is a) constant, or worse yet b) 100.
          Had HZ had been a proper syscall instead of a #define in the first place for user-land programs this would not have been a problem today.

          Can someone do me a big favor and post RedHat 8.0's asm-i386/param.h file so I can see how they defined HZ, USER_HZ and friends? I'd like to see it without actually going to the trouble of installing RedHat 8.0.
      • BUT once you have more than one program or thread running the situation is different.

        Yes. Say you have one thread running flat out and another that needs to do 100microseconds of work. With 100 ticks per second you will lose 5 usec to context switching and 9900 usec to waiting for the next context switch.

        No! The task does 100 microseconds of work and then calls the sleep command, or does I/O or whatever. This ultimately goes through the kernel and the kernel does an early context switch. It certainly doesn't waste the rest of the timeslice.

        Incidentally, the overhead of doing the context switch is much bigger than you say here- one of the things that the kernel has to do is flush the caches as it swaps the virtual memory in and out- that will slow the system for tens of thousands of instructions afterwards.

        Anyway, you're wrong about it not improving performance; it certainly can improve latency, which is very definitely a performance metric; but obviously you'll lose some cpu time due to the more frequent context switches that will occur.

        • Incidentally, the overhead of doing the context switch is much bigger than you say here- one of the things that the kernel has to do is flush the caches as it swaps the virtual memory in and out- that will slow the system for tens of thousands of instructions afterwards.

          It is higher if you switch to another userland application. If you go to the scheduler and decide to keep running the same app the TLB does not have to be flushed. Even if you do switch to another app it's unlikely that it's going to thrash the cache. Those gnome-apps aren't so data intensive. It's even more unlikely that you will have to page in virtual memory. I don't think I've even bothered to allocate virtual memory lately, when 2-4 Gigs are cheap why bother? (On a P4 you can even tell the processor not to flush the local's out of the TLB when you load a new LDT, and it never flushes the kernel's entries unless you change the GDT for some reason.) You can thrash the cache if you want to, just start a compile with -j# with # greater than the number of processors. But those little applications that need a small timeslice once in a while aren't gonna do it. There might be a security arguement for flushing the cache, so that some app can't communicate with another by reading or not reading in a memory location into the cache. But at that level of paranoia I wouldn't be using Linux anyway.

          If you're swaping in virtual memory from a hard drive who cares what your timeslice is? It's going to take milliseconds just to get the page anyway! The only benefit of virtual memory is that it can swap out unused code so only the working set uses up RAM, in which case you still rarely actually swap things in since your working set is in RAM by definition. Overlays probably had a better granularity for that purpose. I'm always afraid virtual memory will be abandoned, even though it could be useful at some future date when you might have just 4GB of fast RAM, and 64TB of plan old DDR-RAM or something else the processor can't handle without OS help. (Yes I know there are machines that actually use virtual memory, but I'm not going to argue that they should have more ticks, they might benefit from fewer, in fact. I just haven't seen one of those machines in at least two years, so I think addressing the Athlons, and P3 & P4's of the world isn't such a bad idea.)
          • I think once you've switched on the memory management unit; you lose very little by using virtual memory. I don't think it's going to go away any time soon. And the cost of switching on the MMU is very small, given the relatively deep pipelines we have right now with these processors.
            • you lose very little by using virtual memory

              True, but if people stop even creating swap partitions in large numbers who's going to want to maintain the code? Code dies when it's not maintaned... I don't think this will happen to Linux or any of the free OS's anytime soon since they are still used on old hardware where it's hard to even find anyone selling compatible memory. And, as long as the embedded people don't all just switch over to uLinux/rtLinux it won't happen either. Something like PS2 Linux really needs swap files with just 32 megs of RAM. (The chip can support gigs of RAM, but the hack requires lots of soldering and expert knowledge of the MMU.)

              I think the chance of Linux abandoning the MMU is 0%, if only because you need memory protection any general purpose machine. Also I think you'd be capped at 36bits of address space on i386, or only 64 Gigs of RAM, sounds great now, but won't in 5 years.
              • No, look swapping is not the same as virtual memory. Virtual memory is useful even in the absence of any disk or swap space at all.

                The point is that virtual memory reduces the amount of real memory you need for each thread- each only takes what it really needs. Sure if memory is cheap, it may not matter so much. But even if it is cheap do you really want to give each process 1 gig of space on the off-chance that it might need it? I don't think so.

                Virtual memory is when a process thinks it has 1 gigabyte of memory, but it actually only has, say 128 megabytes. It can read or write to any bit of it, and the OS does what is necessary to ensure that it never notices the difference; obviously upto the actual system limits.

                Virtual memory and swap space go together very nicely, but one does not imply the other. You can use virtual memory to implement garbage collection for example; with no backing store at all.

                I guess there are other ways to do similar things- for example, don't use virtual memory, use real memory and set up the MMU so that each thread can only see its own map. But there are issues with that, and it isn't necessarily faster.

                • No, look swapping is not the same as virtual memory. Virtual memory is useful even in the absence of any disk or swap space at all.

                  I wasn't clear enough, I see 0% chance that virtual memory will disappear from Linux because it provides protection from one application playing with another's memory.

                  I do fear for swapping, but only years from now when it's not so common. I do not fear for the loss of MMU support including virtual memory.

                  It isn't clear this is what I'm saying from that post, but if you read what I said before I think it's clear. I was agreeing with you on the point of virtual memory not being a big deal, but adding that swapping was in dirge territory on the modern systems that will benefit from upping HZ. Your original comment on swapping is what inspired me to write the comment, cuz I thought you were making the point that it's not a performance loss to use virtual memory even if you never swap, while my point on swapping had nothing to do with performance, but code maintinance. If an signal never fires who cares how long it takes to handle it after all.

                  If you have to do any swapping to disk I don't care how much you try to tune HZ, you need to buy more memory or run fewer apps to get a snappier system.

                  But enough on this point, it's tangental and I think I agree with everything you said in this last comment without exception.
    • by jquirke ( 473496 ) on Tuesday October 15, 2002 @11:39PM (#4459091)
      Actually the last time I checked, the kernel had to be recompiled to change the HZ variable. Not trolling or anything, but it's been pointed out FreeBSD has this as a sysctl parameter. Hopefully Linux will offer this (correct me if I'm wrong!).

      Also, you don't necessarily have to increase the clock frequency by a whole order of magnitude. A fair compromise could be 200Hz, or 250Hz, or 500Hz. A typical workstation running X-Windows could use 250 or 500, for example.
    • Interestingly enough, that is basically the only major difference between NT4 Server and NT4 Workstation. I believe there was also a flag in the registry somewhere which apps could query to determine whether they were on a Server or Workstation. If you changed this setting, it inexplicably was reset when you next viewed the registry. After some research I discovered there is a tiny thread in the kernel specifically for checking and resetting that particular registry setting...

      NT Server has a larger timeslice and more caching for some system functions, while NT Workstation has a smaller timeslice with caching geared for user apps.

      I know NT is old technology, and I'm not sure if this still applies to the latest MS offerings. Hardly justifies the price difference between Server and Workstation!

      • by Anonymous Coward
        This setting is still there in W2K
        System Control Panel -> Performance -> Optimize Performance for 'Applications' or 'Background Services'.

        NT 4.0 Server's default was oddly "Applications"!

        NT also has a priority boost for interactive apps. However, if the GUI is 'dead' for a long period of time (such as on a server), it will stop doing this. That's why if you walk up to an even lightly loaded W2K server, it's got that X11-style laggy mouse that your workstation never has.

        AFAICT, there's no real operational difference between "Server" and "Workstation" at least for NT4 and 5, althout at least W2K Server has some sane non-workstation default settings. The kernel thread/registry entries are for licencing purposes only.
      • Some information about the way NT handles timers can be found at Sysinternals, here [sysinternals.com] and here [sysinternals.com] (Quantums).
  • by GreyWolf3000 ( 468618 ) on Tuesday October 15, 2002 @10:06PM (#4458555) Journal
    ...since one of the biggest criticisms of X is how choppy window movement is due to the networked architecture of X (a signal gets sent to the server, the server responds, etc). When the timeslices are reduced, the "lag" gets significantly decreased, since the signal gets processed sooner, the server gets the message sooner, the server can report back sooner, etc.

    I tried recompiling the stock RedHat kernel, and sure enough that was a on option in there to increase the hz for the internal timer.

  • Note that although that looks like a tenfold change, by time 2.6 is released processing power will have doubled about twice since 2.4, so the change is equivalent to running a 2.4 system with HZ=250 instead of 100.
    • Re:Moore's law (Score:4, Informative)

      by WolfWithoutAClause ( 162946 ) on Tuesday October 15, 2002 @10:19PM (#4458642) Homepage
      It's not quite that simple though. It's more tied to memory speed. The processors are improving at a faster rate than the memory is- and this clock tick is more related to memory speed.

      The reason is that across a scheduling tick the processors cache gets flushed and reloaded. This means that you end up doing a burst of memory reads, and that will dominate if the clock tick is too short.

      • Re:Moore's law (Score:3, Informative)

        by zenyu ( 248067 )
        The reason is that across a scheduling tick the processors cache gets flushed and reloaded.

        Whoa! What architecture is that!

        That just doesn't sound right. The register files get flushed(well swapped), but if that 2 meg cache got flushed on every context switch there wouldn't be much point in having it at all. You can get cache thrashing if too many cache hungry programs are running simultaniously but that's why you get a bigger cache if you run lots of those programs, it so that their working set is saved across context switches.

        Perhaps you mean the L1 caches? They can get tossed out cuz it can only hold a few inner loops and a few small working sets at a time anyway, but all that stuff should still be in the L2 cache and get loaded very quickly into those puny L1 caches, the L1 data cache is practically a register file anyway on P4's, 64 bit moves to/from them happen in a cycle...

        Those L2->L1 moves might start to affect you at 1,000,000 ticks per second, but no one is proposing that, right? Even so in a typical environment the other context is just the scheduler which I can't imagine filling the L1 cache... It's not that complicated on a mostly idle machine. (Quick & Dirty schedulers have been written, some which looked through the entire process list. Erm, but on my machine there are less than 100 processes right now, still not so bad for L1 ;)

        Anyway I think 1000 is just fine, if you're doing real-time music synthesis on lotsa channels a larger number might be better. Someone in Europe is working on a music disto, so maybe they will discover that 8000 is the magic number for 16 channels at 48000khz on a P4 at 2Ghz.

        It would be neat if someone came up with metrics so that the tick was set so that 99.999% of the time the sound systems got their slices once every 500 usec but otherwise the timeslices were as large as possible. Then you could just tune that 500 usec thing, make it longer if you're on a 386, shorter if you really need more than half millisecond timings. I guess any program that needed frequent time slices could write to some proc file how much more often it should be called, or if it could afford to be called less often. For example 1.2 if it want's to be called more often, 0.8 if it's time needs were met. The kernel would only have to insure all the numbers it got were less than 1.0, and if the largest one were less than 0.95 it could even afford fewer time slices. The kernel might also want to ensure through process accounting that the time sensitive processes never got more than a certain percentage of the cycles available even if it meant they got called less often. This to prevent a denial of service where you just always write 10 to that proc file whenever you get run so the time tick grows until you spend all your time in the scheduler. It might also want to set a floor, so that a human can interact with the machine. Ticks should never be less than say 10 for instance on a PC(or 250 if it's my machine). Though for some special purpose interstellar Linux probe you might want to sleep for a whole second at a time before checking your direction once on your way so a tick of 1 would be acceptible once out of your solar system. (You still want 64 bit uptimes for you're interstellar probe it would be so embarassing if it arrived and the aliens were like, "Woah this species can't develop an operating system with more than 3 day uptime for a space probe that took like 40 years to get here, what l0s3rs!")
        • Re:Moore's law (Score:4, Interesting)

          by WolfWithoutAClause ( 162946 ) on Wednesday October 16, 2002 @07:47AM (#4460536) Homepage
          Whoa! What architecture is that!

          All of them AFAIK.

          That just doesn't sound right.

          Well, it is. Deal ;-)

          The problem occurs when the memory management unit gets modified to maintain the virtual memory 'illusion'. Then you have to flush the caches to maintain consistency. Of course it doesn't happen on every clock tick, you hope.

          That means that all the caches above the memory management unit need to get flushed. This includes the program cache; and any other data too.

          I did a quick check on the web for this, but I haven't managed to find a good reference to where the MMU is placed in the different architectures yet.

          Anyway, that's one of the main reasons the OS scheduling isn't shorter, but any decent OS has to do quite a bit of dorking around at that time.

          • any decent OS has to do quite a bit of dorking around at that time.

            Sure, but flush the whole cache? The virtual memory arguement justifies flushing the TLB cache if there is an actual switch to another running process. If it's just to the scheduler and back doesn't that have a valid mapping in any process (that whole reserve 1G for the OS out of the 4GB directly addressible must be for this purpose right?) But while I'm not familiar with the actual implementation of these chips I can't see why the cache wouldn't just be addressed by physical locations in memory, hence no need to invalidate their data, just because you change their virual adresses.

            I'm not an Intel expert, but I know they have a GDT and LDT. That is a Global Descriptor Table and a Local one so the scheduler should be able to use the global one while the application uses the local one. I actually have the manuals so I looked but it's a bit esoteric. What I found that supports the TLB flushing is that whenever you load a new LDT you invalidate all the local TLB entries. You can have over 8000 entries in a LDT, but the OS needs to use one for each user level process in order to protect an applications memory from other applications. So if you're Amiga OS you just use the GLT for the kernel and an LDT for your apps, but if you're Linux each app gets it's own LDT. The Pentium 4 has a PGE flag that can be used to prevent flushing frequently used tables. So you could prevent flushing the entries if you had some use level app that was run frequently enough to get special treatment.

            I'm still not convinced the actual L2 & L3 caches get flushed, esp since you can even avoid TLB flushes. The TLB is small, which is why you would want to flush it before running a different process, the caches are relatively big...
            • I don't think the guy knows what he's talking about :) You never have to flush the TLB either, it just won't be very usefull to the next program. I'm 99% sure that the cache is set associative to the physical address of ram, not the virtual address. The TLB was invented (AFAIK) to help mostly with pre-cache translations, so that the processor isn't waiting on translations before it can get cache. This may not be accurate L1 though.
          • Nope. Recent processors cache physical memory so
            you don't flush the cache on a context switch.
          • Given the speed of processors, how likely is a line in the cache to survive a context switch, even if it weren't flushed explicitly? I would imagine fairly small. (I'm punting on which L cache I mean: too lazy to think hard about it.)

            However, I've always wondered if there was a performance win to multiple threads running in the same memory space as compared to multiple processes, for this very reason.

            Anecdotally no: I spoke to the BeOS guys at a conference back in the days of wanting every cycle you could get, and they didn't give threads from the same process as the previous thread any higher probability of running next, which would be the natural thing to do if it were a performance win.
            • Given the speed of processors, how likely is a line in the cache to survive a context switch, even if it weren't flushed explicitly? I would imagine fairly small.

              It had better be astronomically small otherwise user programs will gradually screw up; and I don't think it is that small in fact.

    • Re:Moore's law (Score:3, Insightful)

      Timeslices didn't decrease in said time, those have been prety constant for a while.

      I seriously doubt we are going to be needing 1/10th second slices for quite a few years, and by that time I expect the kernel to run something in idle time to auto-tune the slices for my current workload average. Remember the higher HZ only improves "responsivness", it actually decreases system performance computation wise. There is a specific number that is best for every system at any particular time, and going above or below that number hurts performance.
  • by Anonymous Coward
    My first thought is, "It's about time..." FreeBSD has had this for ages, and it struck me as strange that Linux was nailed to HZ=100 when I started porting some apps over.

    Among other things, streaming media is an important beneficiary of this change. Let's say you have a medium-bitrate video stream (about 2.5 to 5 megabits). That means that your packets should be spaced about 2 to 4 milliseconds apart. This is easy to schedule when your system has a 1 millisecond granularity, but is a disaster when your clocks are 10 milliseconds apart -- your packets end up going out in clumps. Your 100bT network may not care either way, but if you are pushing video over ADSL, 802.11b, or ATM, you may find your packets getting lost along the way.
    • That means that your packets should be spaced about 2 to 4 milliseconds apart. This is easy to schedule when your system has a 1 millisecond granularity, but is a disaster when your clocks are 10 milliseconds apart -- your packets end up going out in clumps. Your 100bT network may not care either way, but if you are pushing video over ADSL, 802.11b, or ATM, you may find your packets getting lost along the way.
      That's not true.
      When an application sends data over the network, it does a send() (or possibly a write()) on a socket. These are systemcalls, so the CPU switches context to the kernel, and the data send by the program is placed in the kernel network buffers. Note that this happens immediately, without waiting for another timeslice.
      Then the kernel sends as much as possible (depends on the buffer size on the network card itself) of the data to the network card (after slapping on IP and TCP headers), after which the kernel returns to the application.

      Now comes the difference: you suggest that when the network card is done sending the data, it'll have to wait for the next timeslice (because then a context switch to kernelspace occurs and the kernel can do some work), but this is not true!
      When the network card is done sending the data, it immediately generates an interrupt (what do you think IRQs are for?). On interrupt, the CPU switches context to the kernel, and the kernel (still having the data to be send in the network buffers) can immediately replenish the buffer on the network card, allowing packets to follow very closely on eachother, regardless of timer granularity.


      By the way, somewhat modern network cards can burst packets. That is, they can receive a whole batch of packets from the kernel, which they will then send at the appropriate speed of the medium, so that not everey packet will generate an interrupt. And that's a good thing (tm), because high interrupt loads (think towards 100,000 interrupts/sec for gigabit - without jumbo frames and bursts) are performance killers.
  • ...but Windows had this way back in '95. Ouch.
    • Yeah, but we're talking about a real O/S here. One of the reasons '95 crashed so much was because it didn't have virtual memory. Therefore the OS didn't have to page out all the memory at the context switch tick and they could afford to tick more often, because the costs were lower.

      I think they more than made up for it in reboot time ;-)

      • Therefore the OS didn't have to page out all the memory at the context switch tick and they could afford to tick more often, because the costs were lower.

        Wow. Are you saying that linux pages out the running process at every context switch? I think I might have found an explanation for X's choppiness.

        • Pretty much, although it's not paging to disk of course (unless you're really short of memory ;-) )

          It has to do stuff like that to keep the processes address space separate- otherwise one rogue process would kill all the others, like in 95

      • Windows 95 absolutely does have virtual memory. (Are you thinking of Mac OS 9??) It's true that it crashed a lot, and that's because the protections afforded by a real OS were not in '95 (it was easy to turn off virtual memory protections and trample on the address space of another process). But each process definitely had its own virtual address space, and most of the things that a real OS does (page table, TLB, paging to disk, etc.) were in 95. I don't know what this business is about not having to page out all the memory -- I never saw the 95 source code but it probably does what any other real OS does: set the page table to the one of the process and flush the TLB.

        • Windows 95 absolutely does have virtual memory. (Are you thinking of Mac OS 9??)

          Mac OS 7 had virtual memory. It just wasn't protected virtual memory until Mac OS X.

          • I don't think this is true. What the classic Mac OSes called virtual memory wasn't really virtual memory like what I'm talking about. Yes, they had a menu item where you could make disk space into "virtual memory" (I'm not sure what this did, really), but processes still had one unified address space. (Why else did we have to set the amount of memory we wanted to allocate to each program?) It's not like they were using the MMU of the processor and actually doing virtual memory, but just had the protections turned off -- they were doing a software simulation of some aspects of VM (like they simulated multitasking, for instance). It wasn't really VM.

  • Way to go. Any binary that used the 'HZ' variable (a constant defined in a header file) will need to be recompiled for these new kernels. Way to go, Linux. Keep it up.
  • Robert Love will be giving a talk 2.5 and the preption patches at the Southern California Linux Expo
    If you use the promo code: F633F you can get into the expo free.
  • If turning up the system tick rate helps, that's an indication of one of two problems.
    • Something is polling that should be event-driven. Some applications (Older versions of Netscape come to mind) like to do something on every tick. (For Netscape, that was a lousy architectural decision made so it would work on the classic MacOS and 16-bit Windows.) There are also some really crappy interprocess communication systems that are polled. Find and fix.
    • Thread scheduling priorities are wrong. This is a subtle issue, but basically, the threads that aren't CPU bound but have tight latency requirements have to have priority over the threads that are CPU bound and don't have tight latency requirements. Smarter schedulers try to achieve this automatically, but some of the guesses made in the UNIX world are tied, for historical reasons, to the TTY end of the system and are no longer appropriate.
    A useful exercise is to turn the tick rate way down (maybe 1HZ) and put a compute loop job in the background. Everything that's broken according to the above criteria will turn into a toad, which helps debug the problem.

The biggest difference between time and space is that you can't reuse time. -- Merrick Furst

Working...