FBFS final benchmarks available

Why I call these benchmarks final?

The gsoc 2011 program ends in a few days. So now I publish here some benchmarks of the scheduler I have been developing. I now need to take a break from this scheduler coding, but this does not mean that the project is over for me. I hope I could further work on this scheduler something like half a year later as a university project.

What can you see in the benchmarks?

I think that the benchmark on the 2-CPU machine shows that when running a computer with small number of CPUs, this scheduler is quite competitive to the other two FreeBSD schedulers. However, as you can see in the 8-CPU machine benchmark, 8 CPUs are already too much for the scheduler. And the benchmarking in a virtualized environment also shows that the scheduler performs poorly when running in virtual machines.

I think that the reason why the scheduler performs so poorly on the 8 core machine is the locking contention. And I think I could add some improvements in this area. For example when adding a new thread to the scheduler, I probably do not need to grab the global lock for the whole scheduler. Because a new task is always inserted at the end of the list, it would probably suffice to lock only the last thread structure in the scheduler queue. I think I could find a few other places where such increased locking granularity could increase the scheduler performance.

FBFS in action

Fidaj has created a short video that shows the scheduler in action on FreeBSD 9.0-Beta1. Enjoy it here.

Sysbench Postgresql benchmark on a 2 CPU machine

sysbench on a 2 cpu machine

Sysbench Postgresql benchmark on a 8 CPU machine

sysbench on a 2 cpu machine

Sysbench Postgresql benchmark on a 4 CPU virtual machine

sysbench on a 2 cpu machine

Posted in Uncategorized | 9 Comments

Problem with music solved

Thanks for testing my work

Firstly I would like to thank you who have tested my work. I want to thank especially to Roman Bogorodskiy and Arne Meyer who were the first to notice that my scheduler has problems with playing audio under high cpu load. There is a good chance that I would not have noticed it myself because when I do coding, I do not usually listen to music. And when I am runnning FreeBSD, I usually do the coding:-)

What was the problem

The problem with the scheduler was that if you were playing some audio and running some cpu intensive applications at the same time, the music would become highly distorted – kind of slowed down.

Let’s now have a look at an application that plays music from the scheduler’s point of view. For the scheduler it is just an application that sleeps most of the time. Sometimes it wakes up, probably sends some data to the audio card and then quickly goes to sleep again.

What I firstly did is that I added some logging to my scheduler code. I logged some scheduler relevant events like context switch, waking up of a thread and similar. The timestamp was also included in the log so that I could see where the delays come from. Here is a short excerpt of the logs I got:

493778161690 "mpg123" wakes up on cpu 0
493778162080 ipi send to cpu 1
493778163940 ast on cpu 1
493778165150 "skuska" switch off cpu 1
493778166760 "irq256: mskc0" switch on cpu 1 with prio: 16
493778246160 "irq256: mskc0" switch off cpu 1
493778246490 "swi4: clock" switch on cpu 1 with prio: 52
493778249780 "initial thread" wakes up on cpu 1 with deadline 188337
493778252570 "swi4: clock" switch off cpu 1
493778253890 "swi6: Giant taskq" switch on cpu 1 with prio: 60
493778267110 "initial thread" wakes up on cpu 1
493778271830 "swi6: Giant taskq" switch off cpu 1
493778274240 "mpg123" switch on cpu 1 with prio: 76
493778281920 "mpg123" sleep on cpu 1
493778283780 "mpg123" switch off cpu 1

The first column contains a nanosecond timestamp. skuska is the name of a cpu-intensive program I was running. And mpg123 was my music playing application.

This is what happened according to the log. Firstly we see that the mpg123 application wakes up on cpu 0. Because the scheduler knows that this application was running last time on cpu 1, and currently on cpu 1 there is running the skuska application which does no have high priority nor small virtual deadline, it sends an inter processor interrupt (IPI) to the cpu 1. The third line refers to that IPI being processed on cpu 1. The next line shows that the skuska application is descheduled – taken off the cpu. So far so good and there were not big delays between these events.

However, next we see that instead of the mpg123 application, the scheduler chooses firstly some high priority kernel threads (irq256, swi4 and swi6). Only after these threads are processed, the mpg123 application is run. And this is the cause of the delays in audio playing.

So this is the bug: Some high priority threads get queued up in the scheduler instead of being run immediately after they are set as schedulable. And then when our music application (or some other interactive process) wakes up and preempts some thread, it does not get scheduled because the high priority kernel threads take precedence.

How I fixed the bug

Once I knew what was the problem, it was easy to fix it. Now, whenever a thread is inserted into the scheduler, the scheduler checks the priority of the thread and the priority of the currently running threads. If the new thread has high priority, it is not just registered in the scheduler data structures, but it is sent directly to the cpu.

Feedback is welcome

Although I think the bug is fixed, it would be nice if somebody could confirm it in the discussion. You can download the patch from the latest patch available section. Also please note in the discussion of that section that fidaj has ported the scheduler to the CURRENT branch. I would like to thank him here. And if he is so kind to create a new patch with this bug fixed I will be happy to include it in the Latest patch available section so that people can find it easily.

Posted in Uncategorized | 10 Comments

Status report

What I have done last week

Last week I was working on the cpuset functionality. In FreeBSD, you can set for a certain process a mask of CPUs where the process can run. For more information visit the cpuset(1) man page. For example, on my home PC I can run the following command:

sudo cpuset -c -l 1 ./binary

This will restrict the binary process to run only on the cpu with id 1. I can check it in gnome by viewing the Resources tab in the System monitor.

What I will do this week

Some of you have reported problems with listening to music under high cpu load. I was also able to reproduce this behaviour. I think the problem might be connected with another issue. I have a suspicion that the preemption mechanism in my scheduler does not work now as expected. So now I will make some more precise measurments focused on tracking down how the preemption mechanism in my scheduler really works – not just how I intended it to work.

According to the results I get I will either try to fix it or I will find that there is nothing to fix.

Posted in Uncategorized | Leave a comment

Per-process CPU accounting

How per-process CPU-usage percentage is computed.

Every hz tick, the scheduler increments a CPU tick counter of the currently running thread. This tick counter is used for CPU usage percentage estimation for every thread. This scheduler works with a 5 second long timing window. The CPU usage percentage estimate is computed over this 5 second window. That means, if a thread has its tick counter equal to half of the hz * 5, then the CPU usage for this thread is reported to be 50%. (Where hz is the system clock’s frequency)

Let us now have a look at how this tick counter is updated. There are two ways how this counter can be changed.

  1. Every hz tick the the value of the CPU tick counter of the currently running thread is compared with the value 5 * hz. If the counter value is less, it is incremented.
  2. If a context switch happens from an old thread to a new thread, the timestamp of the context switch is saved into the old thread. When later a context is switched back to our old thread, the counter of the thread (here referred to as the old one) is decremented by the length of the time the thread was taken off the CPU.

So the whole algorithm is quite simple when compared to the 4BSD scheduler. The estimate is not affected by the CPU utilization prior to the last 5 second timing window. The estimate is determined solely by the CPU usage in the latest timing window. When a thread is taken off the processor for some time, the counter for that thread is decremented proportionally to that time. This should lead to the expected behaviour that threads not utilizing the CPU will get 0%. And if a thread is using almost all CPU time, it will get 100%.

This is not the way how ULE or the 4BSD schedulers handle it. I firstly thought I will implement it in the way how ULE does it – because it seemed quite simple to me on the first sight. But when I looked at it in more detail and started thinking how and when the parameters involved can change, I decided I will use even a simpler approach – and that’s it.

What I will be working on this week.

I now want to add the cpuset functionality to my scheduler. In FreeBSD, you have the ability to restrict certain tasks to certain CPUs. This scheduler ignores it for now. So I want to add support for it.

Posted in Uncategorized | 1 Comment

Kernel panic bug fixed

The bug

Last week I was dealing with one bug that caused kernel panic during the boot time. However, it showed to me only on the amd64 platform. The problem could be quickly summed up as a “null pointer dereference”. It happened when a sleeping thread woke up. At this time, a linked list of all CPUs was traversed and for every CPU I looked at the location where the pc_curthread pointer was pointing. This pc_curthread pointer is a part of the struct pcpu which holds per-cpu specific information – such as a pointer to the currently running thread on that particular CPU. The problem was, that sometimes it was a null pointer. So as a bug fix, I just added a test for the null pointer.

The reason why it took me so long to find this bug source is that in the FreeBSD kernel there must be some memory mapped at virtual address zero. So dereferencing a null pointer does not just cause panic. (Probably it is also important to note here that I was just trying to read from that memory. Maybe if I tried to write something there, I would have been immediately killed) The kernel panicked later when the error propagated itself further to the other kernel interfaces.

What I am working on now

Now I am working on per process CPU accounting information, so that the percentage column in the top command output shows some reasonable value. This is done differently in the 4BSD scheduler and in the ULE scheduler. Currently it looks like I will follow the way how the ULE does it.

Posted in Uncategorized | 3 Comments

FreeBSD FBFS live DVD image is available now

The FBFS live DVD

Important note !

The published live dvd image freezes when starting the gnome desktop environment if you run it on real hardware. It works, however if you try to run it from vmware player. I do not have enough time now to explore the issue so if you really want to try the FBFS scheduler with the X desktop environment, you need to follow the tutorial on the second half of this page and compile the FBFS into the kernel manually.

End of important note !

I have published on internet a bootable live dvd image of FreeBSD with the FBFS scheduler. So now you can easily try it:-) The FreeBSD image on this dvd contains an X environment. It is a slightly stripped down version of the GNOME desktop environment. The compressed image can be downloaded from here. Just download it, uncompress and burn the iso image.

I have created there a user with unprotected root access. The username is fbfs and the password is also fbfs.

This dvd image image contains a slightly lighter version of FreeBSD. I removed the locales and many gnome applications to make it lighter and thus easier for me to publish and for you to download.

The downloaded iso image is a live dvd, that means you can boot your computer from this image and run the supplied operating system without the need to install it on your disk. This makes it easy for you to try my new scheduler.

Manually compiling the FBFS

Firstly a quick warning. Here I write a quick guide how to compile the FBFS scheduler into the FreeBSD. Although I have tried to do my best, it is possible that the following procedure could make your system unstable or even unbootable. So use common sense, do not apply it to production systems:-) The scheduler is still under development.

Now I assume you are running the 8.2-RELEASE version of the FreeBSD operating system. It is the latest version available for download from the official FreeBSD web page. You can get it from here.

The first thing you need to do is to download the FreeBSD sources for the version RELENG_8_2. One possible way how to do it is using the csup utility. Make sure you have enough free space in the /usr filesystem and download this supfile. You should now edit this file and change the first line

*default host=cvsup.cz.FreeBSD.org

to specify a host name that is near your geographic location. The list of available hosts is here. So after editing, the first line could look like this if you are from France.

*default host=cvsup.fr.FreeBSD.org

Now execute the following command as root:

# csup src-supfile

where the src-supfile is the name of the downloaded supfile. This will download for you the required sources. Wait until the download ends, it will take a while.

Now you need to download the FBFS patch to the FreeBSD kernel from here. Download it to the /usr/src location. You should apply the patch with the following commands as root:

# cd /usr/src
# patch -p1 < fbfs_midterm.patch

where fbfs_midterm.patch is the name of the downloaded patch file.

The next thing you need to do is to configure your kernel to use the FBFS scheduler. Edit your kernel configuration file. (for example /usr/src/sys/i386/conf/GENERIC) Now locate this line:

options SCHED_ULE # ULE scheduler

and change it to this:

options SCHED_FBFS # FBFS scheduler

Now you need to compile and install the patched kernel. For this, you should read this carefully. To build and install the new kernel, you need to issue these commands as root (Replace the GENERIC with the kernel config file name that you use):

# cd /usr/src
# make buildkernel KERNCONF=GENERIC
# make installkernel KERNCONF=GENERIC

Now reboot and try the FBFS scheduler.

What is still missing

If you run my scheduler, you may notice that there is some functionality missing that you are used to. Now, there is no process cpu accounting information delivered to the user space. So when you run the top command, all processes will show 0% cpu utilization. The current FBFS implementation also lacks the cpuset functionality. I will add the implementation for these features next time.

Feedback is welcome

Any feedback is welcome:-)

Posted in Uncategorized | 19 Comments

Status report

What I have done this week

When selecting an idle CPU that should wake up because a thread is waking up, I now consider the cpu group of the cpu where the waking thread was last running. The cpu topology is handled in FreeBSD by cpu groups. The idea is simple – cores of a single CPU are in the same group. The groups are organized hierarchically – because CPU caches are hierarchical. The CPUs in a single group usually share some level of cache. So what I do is basically this – I have a CPU where the thread was running last time. I look if there is any idle CPU in the cpu group of this cpu. If yes, I wake that idle CPU. If not, I check the same thing, but only one hierarchy higher. I do this until I find a suitable idle cpu or I reach the top of the CPU group hierarchy. The similar concept is used in the ULE scheduler, and from there I tried to figure out how this CPU topology API works in the FreeBSD kernel.

I have also taken a look at the dtrace in FreeBSD. This is a wonderful tool that lets me analyze how much time the kernel spends in various functions of my scheduler. I have started learning this tool and made some example scripts. I think I will sometimes publish here an analysis of how much time the kernel spends in the functions of the FBFS scheduler and the corresponding functions of the ULE scheduler.

The midterm evaluation of my work at gsoc is due next week and so I publish here a patch to the FreeBSD kernel that contains my work so far. Currently I am working on a live CD that will contain my scheduler already compiled into the kernel. I will publish this live CD here in a few days. I will then also write here a quick guide how to apply my patch to the FreeBSD kernel. For now, I just tell you this patch is over the FreeBSD kernel from the stable branch with tag RELENG_8_2. The patch is here. So if you are confident enough, you can try it on your own. Or wait a few days until I finish the work on the live cd. Then you can try the cd or read my quick guide how to apply my patch. Any feedback will be welcome:-)

What I will do next week

Firstly I will finish work on the live cd with the latest version of the FBFS scheduler. Then I will run similar benchmarks as I run at the beginning of the project. I will publish here the results.

Posted in Uncategorized | 1 Comment

When threads wake up…

How I handle threads that wake up.

When a thread wakes up, I firstly put it back on the appropriate priority run queue. Then I check the CPU where the thread was last running. If this CPU is currently running a lower priority thread, or a thread with the same priority but later deadline, then an asynchronous system trap is sent to that CPU. When this CPU next tries to return from kernel to user space, a context switch is executed. The context switches to the best thread as determined by the BFS.

If the last CPU the waking thread was running on runs a higher priority thread, then I look for any idle CPUs. I send an AST to every idle CPU. This will cause them to reschedule on the next return from the kernel. This will happen probably on the next timer interrupt.

If there are no idle CPUs, then I go through all the CPUs present in the system and I check if the thread that is now waking up should preempt the thread currently running on that CPU. If such a CPU is found, then I send an AST to this CPU.

The reason for all this is to improve the responsiveness of the FBFS scheduler. And although hard to measure exactly, the improvement can be noticed when my laptop is under heavy workload and I try to do something with GUI. This improvement can be noticed when compared against the previous versions of the FBFS scheduler. The ULE scheduler, however, shows better responsiveness than the latest version of the FBFS.

What I will do next week.

In the previous status report blog post, I mentioned how the BFS scheduler makes use of the CPU topology map when choosing an idle CPU where the waking thread should run. The similar concept is also used in the ULE scheduler. So next, I will look at the ULE code and see how it uses the functions provided by the kernel that describe the CPU topology in the system. I will then use this information when choosing which idle CPU should run the thread that is just waking up.

Posted in Uncategorized | Leave a comment

Status report

What I have done this week

The threads now remember how much of their time slice is still remaining from the previous runs. A thread is preempted as soon as the remaining part of its time slice expires. This is different from the previous implementation, where the thread was always assigned a new full time quantum when it was preempted to.

The virtual deadline mechanism now applies both to the time sharing and idle threads.

What I will do next week

When a thread wakes up, it is simply put on the global run queue. However, the BFS firstly looks if there is any suitable idle CPU, or a CPU that is running a lower priority thread than the thread that just woke up. If such a CPU exists, an inter-processor interrupt is sent to the CPU which causes the CPU to reschedule.

If there are more idle CPUs, the BFS does not choose one by random, but according to the CPU cache locality. When the BFS is initialized, it builds a topology map of the CPUs present in the system. The most suitable idle CPU is chosen according to this CPU topology map. I will also implement this part into the FBFS scheduler, but probably not the next week.

Posted in Uncategorized | 2 Comments

Implementing virtual deadline into the FBFS

Intro

When the scheduler thinks that the currently running thread has been long enough on the CPU, it selects a new thread to run. For BFS and time-sharing threads, the decision which thread should run next is based on a special number that is associated with every thread. BFS refers to this number as a virtual deadline. An O(n) lookup is performed to find a thread with the earliest virtual deadline.

Computing the virtual deadline for a thread

The BFS computes a new virtual deadline for a thread only when the thread runs out of its time slice. If we have a thread with nice value -20, then the new deadline is computed as follows:

new_deadline := now + rr_interval

Where the rr_interval is the time quantum the thread is allowed to run on CPU until it is preempted and now is the current time. If a virtual deadline for a task with nice value different than -20 is calculated, the rr_interval in the formula above is increased by 10% for every nice level. So if we have a thread with a default nice value 0, its virtual deadline is computed as follows:

new_deadline := now + rr_interval * 1.120

and that is effectively:

new_deadline := now + rr_interval * 6.727

An important thing to note about the virtual deadline calculations is that it does not consider the number of threads in the time-sharing queue. So it can sometimes calculate deadlines for threads and it will be impossible to meet all the deadlines if the threads use up their whole time quantums. The BFS does not give any guarantee that a given task will be scheduled by the time its virtual deadline passes. It uses the virtual deadline merely in the selection process to choose the next thread to run.

BFS and interactivity

What I personally like most about the BFS design is how it handles and favors interactive processes. When you look at the way how the original 4BSD scheduler calculates the CPU usage estimates and updates the thread priority based on these computations, I think you will also like the way how the BFS handles it:-)

The BFS simply needs nothing to do. By interactive threads, I mean the threads that sleep a lot. And if a thread sleeps long enough, its virtual deadline will be already over by the time the thread wakes up. Its deadline will be low in comparison to other (non-interactive) threads and therefore the BFS will choose it to run sooner.

What I still need to implement

In the previous paragraphs I described how I understand the BFS virtual deadline mechanism. I have all this already implemented in the current version of the FBFS scheduler. There are, however, some other important points that are relevant to the virtual deadline mechanism and that still need to be implemented.

These are:

  • When the BFS chooses the next thread to run and looks for the thread with the lowest virtual deadline, it stops the O(n) lookup if it finds a thread whose deadline has already passed.
  • When a thread in BFS does not use up its whole time quantum, it runs only with the remaining time quantum the next time it is scheduled. My FBFS implementation, that is based on the 4BSD scheduler, assigns a thread full time quantum every time the process is scheduled.
  • The virtual deadline mechanism does not apply only to the time-sharing threads, but also to the idle threads.

I will try to address these points next week.

Posted in Uncategorized | 4 Comments