Xen

From Noah.org
Revision as of 07:53, 18 June 2014 by Root (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


I run a few large Xen hosts. My five largest hosts can carry around 100 guests each. Each of these hosts has 24 cores, up to 384 GB of RAM, 10TB of storage in a 20 disk RAID 10 array, and four gigabit network interfaces. I have stressed the systems with over 200 guests running on a single machine, but the performance of individual guests was too slow to be useful. After a few hours guests would randomly begin to show reliability problems (disk iowaits causing services to crash and virtual network interfaces would become corrupt and bring down the entire network layer). I found that it is difficult to keep more than 120 guests to running reliably. In day-to-day operations I keep around 90 guests up at once on a single host.

It's difficult to find training, documentation, or even reports of others configuring such large hosts. A bigger Xen hosts just makes debugging even more tricky, I can't even fully utilize the RAM on my hosts that are setup with 384 GB. Since I am practically limited to running about 90 guests and they average about 2 GB RAM so, so I can use only about half the RAM. I have been slowing redeploying theff unusable RAM to other servers.

Having so much RAM and so many cores spread over such a large number of guests means there are lots of places for faults to hide. Problems build up more slowly before they become a problem. Guests can run for days with no trouble and then suddenly have problems with no previous warnings in dmesg or /var/log/. Stress testing likewise takes much longer. Even something as simple as memtest can take so long that it isn't practical. Also, artificial stress test loads never quite replicate the obscure, subtle bugs you get after running real loads for a few days or weeks. I have not found any stress testing tools specific to testing virtual machine environments. My own test tools usually amount to spinning up hundreds of guests at once and then running cpuburn, stress, or any of a number of other micro-benchmarks and stress tools. It's harder to test networking, yet this is often where I see the most problems. The total network bandwidth through the host may actually be moderate, but the number of interrupts hitting a device driver in dom0 may be so great that it falls down. The irqbalance daemon can help. Again, usually it takes a few days for network problems to show up, which is not amenable to regression testing. A similar problem happens with the virtual disk block driver. The dom0 disk throughput may be moderate but lots of IO operations multiplied by a hundred guests can cause the virtual block driver in dom0 to exceed timeouts. I found that tuning the the host to handle guest block IO is the most difficult problem to overcome. In this case I take a strategy opposite to the low latency, low overhead strategy used to handle the homogenous workloads typically found on large enterprise servers. With a virtual machine host I favor predictable and even IO handling at the expense of overall performance. Lower throughput and higher latency is tolerable as long as any individual guest can be guaranteed some minimal level of bandwidth and a predictable latency. Large enterprise servers are often tuned the opposite way. They commonly use a NOOP scheduler. The problem I found with using NOOP is that it's quite easy for a single guest to monopolize the block IO with strong negative effect on performance for all guests, including dom0. Furthermore, there is nothing that can be done to throttle the IO for a guest that is behaving badly. The only thing that can be done is to shut down the bad guest. For Xen, I found that a CFQ or Deadline queue seems to give more even IO handling. Also CFQ gives you the option of using ionice to change IO priority per guest. Unfortunately, this isn't perfect. I found that it can take 10 or 20 seconds for the scheduler to kick in and throttle a guest after it starts to overload IO. Also is not easy or possible to translate a schedule priority to a specific bandwidth limit...I have done very little testing with the deadline queue, but the theory behind it sounds promising.

In the ideal world the kernel could test the available IO bandwidth of a block driver and then allow the administrator to provision guests to a hard bandwidth limit. This is not totally realistic because IO resources are almost always overcommitted -- especially on a virtual machine host. If a host is running 100 guests and all guests decide to start streaming bulk data to a file then we have to accept that the system will fail in this situation. Promising a minimum 1% to each guest in 100 guests (not counting IO kernel overhead to handle only 1% of the bandwidth would be a worthless guarantee of service.

tools

xm top
xm list
xm uptime
xm dmesg
xm info

Problem: xen guests slow to a crawl for no reason

You may have restricted your guest to using only a few cores. You may have also accidentally pinned it to using specific cores. This eventually leads to bound situations where the guest does not have enough "wiggle room" to get unstuck.

You may have a guest config file under /etc/xen/guest.cfg with lines something like this:

cpus = '0-3'
vcpus = '4'

The counter intuitive solution is to let the guest have access to all physical cores. This examples assumes you running a 24 core host:

cpus = '0-23'
vcpus = '24'

The important thing is the cpus setting. You may set the vcpus setting lower, but my guests seem to run better with access to all cpus. I found that it rarely makes sense to second guess the kernel. It typically does a good job.

a note about GRUB_CMDLINE_XEN names in /etc/default/grub

I have seen documentation spell XEN boot options with and without underscores. I am not sure if the system will accept both, or if one style is a newer convention, or what. Beware.

xen version headaches

Xen can be very finicky to get running. Generally later versions are better. This may seem obvious, but the difference can be extreme. Later versions are much easier get working. The downside is that many tools for Xen are quite brittle and are strongly dependent on a specific version of Xen. Anything that depends on scripts in /etc/xen/scripts/ is bound to break between different versions of Xen. Unfortunately, lots of tools seem to have this weakness. One of the more popular tools for working with Xen, xen-tools, is particularly guilty of tight version coupling. It is also itself a very buggy tool. Beware.

logging

If you are in dom0 you can see the Xen Hypervisor's boot log by running xm dmesg.

When something goes wrong you want to collect as much information as early in the boot process as possible.

loglvl=all guest_loglvl=all sync_console console_to_ring lapic=debug apic_verbosity=debug apic=debug iommu=off com1=115200,8n1 console=com1 

what is the hostname of the dom0 of a guest?

Often while logged into a guest it would be useful to know which dom0 is hosting a given guest. There are tools you can install to get information like this, but one easy way to make this information available is to pass this on the kernel boot command-line when you boot the guest. You can edit your /etc/xen/guest-foo.cfg file to automatically pass this information on the kernel boot command-line. Add the following somewhere near the top of your guest's .cfg file.

# Do something like this to get the vmhost name:
#     VMHOST=$(sed -e 's/^.*vmhost=\([^[:space:]]*\)$/\1/' /proc/cmdline)
extra = "vmhost="+os.popen('hostname -f').read().strip()

Problem: dom0 can't handle too much memory

It may seem odd that your host can have too much RAM, but it seems that huge amounts of RAM will confuse the Xen dom0. In my case I was working with a server with 384 GB of RAM. The problem is that your physical machine has more memory than dom0 can handle. The solution is to restrict the amount of memory the Xen dom0 can use. This is set in the GRUB boot menu.

Here is what you are likely to see. You try to boot your Xen host and it locks up during boot with a message like this:

FATAL: Error inserting dm_mod (/lib/modules/2.6.32-5-xen-amd64/kernel/drivers/md/dm-mod.ko): Cannot allocate memory
done.
Begin: Waiting for root file system ... done
Gave up waiting for root device.

You can restrict the amount of RAM for dom0 by editing the grub.cfg or by editing /etc/default/grub on Debian/Ubuntu systems. I also like to pin a few cores for dom0. That is, I like to reserve CPU only for dom0 use.

The grub.cfg should have a line similar to this:

    multiboot   /xen-4.0-amd64.gz placeholder

It should be modified to something like this:

    multiboot   /xen-4.0-amd64.gz placeholder dom0_mem=8192M,max:8192M dom0maxvcpus=4 dom0vcpuspin

See also: http://wiki.xen.org/wiki/Xen_Best_Practices#Xen_dom0_dedicated_memory_and_preventing_dom0_memory_ballooning and http://wiki.debian.org/Xen#Other_configuration_tweaks

The exact operations you need to update grub.cfg will vary from platform to platform. On modern Ubuntu systems you will edit /etc/default/grub then run update-grub. On an ancient Debian 6 system I did this:

dpkg-divert --divert /etc/grub.d/08_linux_xen --rename /etc/grub.d/20_linux_xen
sed -i -e '$aGRUB_CMDLINE_XEN="dom0maxvcpus=4 dom0vcpuspin dom0_mem=8192M,max:8192M"' /etc/default/grub
update-grub
sed -i -e 's/(enable-dom0-ballooning .*)/(enable-dom0-ballooning no)/' -e 's/(dom0-min-mem .*)/(dom0-min-mem 8192)/' /etc/xen/xend-config.sxp
reboot

Problem: dom0 can't free RAM to run guests

You might see an error like this while starting a guest:

Error: Not enough free memory and enable-dom0-ballooning is False, so I cannot release any more.  I need 8421376 KiB but only have 130924.

Ballooning causes trouble in machines with lots of RAM, yet turning it off causes dom0 to take all the RAM for itself. This leaves nothing for the guests. The fix is simple. This is another instance where Xen behaves badly on large systems.

The solution is simple. You must also set the Xen boot parameters in GRUB to limit the amount of RAM dom0 is allowed to use. See the section titled #Problem: dom0 can't handle too much memory.

The most annoying part about this is that part of the fix must be done in /etc/xen/xend-config.sxp and part of it must be done in the GRUB config on boot. It seems like these fundemental memory parameters should all be in one place.

Problem: scrubbing free RAM takes forever

Add no-bootscrub to GRUB_CMDLINE_XEN. You may also wish to disable scrubbing free RAM since that will cause the boot to take forever. The RAM scrubbing is a security strengthening step. If your host is for your sole use then this security step can probably be skipped. This will significantly increase the boot speed.

GRUB_CMDLINE_XEN="dom0_max_vcpus=4 dom0_mem=4G,max:4G no-bootscrub"

Error: Dom0 dmesg log shows 'page allocation failure' or 'Out of memory: kill process:' or 'invoked oom-killer:' messages

Yes, these are vague symptoms, but I found that if I set vm.min_free_kbytes to a higher value this seemed to help. This may be partly precipitated by turning off dom0 ballooning and setting a fixed amount of dedicated memory. Note that this can happen even if dom0 has free RAM and swap. If you have lots of guests I think their I/O demands (disk and/or network) cause the dom0 kernel run out of wiggle room. Edit /etc/sysctl.conf and set the following option to reserve 128 MB for the kernel.

vm.min_free_kbytes = 131072

You can update this live with the following command.

sysctl vm.min_free_kbytes=131072

Problem: dom0 takes forever to shutdown (XENDOMAINS_SAVE)

The problem is that by default Xen attempts to save the running state of each guest before the host is allowed to shutdown. If you have hundreds of guests this will take a very, very long time. If you have more RAM than free disk on your dom0 then not only will host shutdown take a long time it will also fill the disk. I almost never need this feature. It uses a lot of disk space. When I shutdown the host I usually don't care about the guests.

Edit /etc/default/xendomains and set XENDOMAINS_SAVE to be empty. This controls the feature that allows Xen to save the guest's running state when dom0 is shutdown. Also set XENDOMAINS_RESTORE=false.

#XENDOMAINS_SAVE=/var/lib/xen/save
XENDOMAINS_SAVE=""
XENDOMAINS_RESTORE=false

Problem: xend won't start

The host and dom0 seem to boot fine, but guests cannot start because xend is not running. I found that this happened when my dom0 ran out of disk space. For me the solution was, "don't run out of disk space".

Problem: guest networking is erratic, part 1

If too many guests share the same bridged interface then their networking may become slow and erratic. This can happen even if the total traffic over the dom0 physical interface is low; although, the problem shows itself more frequently when traffic load is high. I am not sure of the cause of this problem. It may be due congestive collapse; or the dom0 virtual networking drivers may have undocumented limitations; or the Linux virtual bridge driver may have undocumented limitations. I found no unusual syslog or dmesg messages; although, I have not done extensive testing, so I may have missed something.

This erratic networking problem only seems to happen when over 100 guests are on the same bridge. Splitting the guests across different bridges, each with its own physical interface eliminates this problem. (Note that it is possible that creating many separate virtual bridges assigned to the same physical interface will also eliminate this problem, but I have not tested this.) I budget about 64 guests per bridge (this is purely a voodoo number with no testing to support it). Assigning guest to different bridges creates an extra task in provisioning and managing a dom0 host, but the task is not difficult. Each /etc/xen/guest.cfg file must be configured to associate it with one of the available bridges. I have not found a way to automatically spread the guests evenly over the available bridges. So far the only solution I have is a simple script that rebalances each .cfg file to sequentially reassign each guest to a bridge round-robin style. This requires the guests to be restarted.

I saw this problem mainly with Xen version 4.0.1.

Problem: guest networking is erratic, part 2

I also found that erratic networking may occur when dom0 runs low on disk space. I have not tested this and I have only seen this problem a few times since it is unusual for dom0 to run low on disk. I noticed this after restarting dom0 with XENDOMAINS_SAVE turned on. The memory images of all the guests exceeded the dom0 disk capacity. After rebooting the dom0 I paid no attention to its boot messages, but soon I noticed that its guests seemed to have network troubles.

See also #Problem: dom0 takes forever to shutdown (XENDOMAINS_SAVE).

ERROR: dmesg says, xen_balloon: reserve_additional_memory: add_memory() failed: -17

You see endless messages in dmesg that look something like this:

[9185884.234147] System RAM resource [mem 0x188000000-0x18fffffff] cannot be added
[9185884.234154] xen_balloon: reserve_additional_memory: add_memory() failed: -17

These messages appear to be harmless. These errors are associated with Xen 4.2.0 and Xen 4.2.1.

The root cause appears to be that due to the fact that Xen sucks. There is a dirty hack that seems to fix this; however, I don't understand how or why it works. I don't normally advocate mysterious hacks to silence error messages (does this actually fix anything or just gag the system's screams for help?). Run the following (this is not persistent across reboots):

cat /sys/devices/system/xen_memory/xen_memory0/info/current_kb > /sys/devices/system/xen_memory/xen_memory0/target_kb

Error: physdev match: using --physdev-out in the OUTPUT, FORWARD and POSTROUTING chains for non-bridged traffic is not supported anymore.

If you see the following message in dmesg or /var/log/kern.log

Error: physdev match: using --physdev-out in the OUTPUT, FORWARD and POSTROUTING chains for non-bridged traffic is not supported anymore.

then you probably need to patch /etc/xen/scripts/vif-common.sh and edit the function frob_iptables() so that it looks like the function below. You need to add the --physdev-is-bridged option to iptables in two places.

frob_iptable()
{
  if [ "$command" == "online" ]
  then
    local c="-I"
  else
    local c="-D"
  fi

  iptables "$c" FORWARD -m physdev --physdev-is-bridged --physdev-in "$vif" "$@" -j ACCEPT \
    2>/dev/null &&
  iptables "$c" FORWARD -m state --state RELATED,ESTABLISHED -m physdev \
    --physdev-is-bridged --physdev-out "$vif" -j ACCEPT 2>/dev/null

  if [ "$command" == "online" -a $? -ne 0 ]
  then
    log err "iptables setup failed. This may affect guest networking."
  fi
}

install Xen on Ubuntu

aptitude -q -y install xen-hypervisor-4.1-amd64 xen-tools xen-utils-4.1 xenstore-utils  blktap-utils

generic /etc/default/grub settings

This is a good starting place for values for grub in /etc/default/grub. I show only the values that I typically change. After modifying this file you need to run update-grub.

GRUB_DEFAULT=3
GRUB_HIDDEN_TIMEOUT_QUIET=false
GRUB_TIMEOUT=10
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="apparmor=0"
GRUB_DISABLE_OS_PROBER=true
GRUB_CMDLINE_XEN="dom0_max_vcpus=4 dom0_vcpus_pin  dom0_mem=4G,max:4G bootscrub=false"
GRUB_CMDLINE_XEN_DEFAULT=""

resource limits and quotas

Under VMWare this is called Storage I/O Control or SIOC, which allows a manager to set IOPS limits per guest. VirtualBox also has an IO bandwidth control feature (see VBoxManage bandwidthctl). KVM provides several types of IO throttling through the blkio subsystem of cgroups (two forms of policy limits are available); also, the blockio layer of QEMU has a blk-throttle option... Xen does not have a similar disk IO limit feature out of the box, but it can make use of one of the same mechanisms that KVM has implemented if the Dom0 kernel has the option enabled... There is an extension to Xen called Xen Cloud Platform (XCP), which may provide IO bandwidth control (I have not used it).

Check which scheduler is being used. The Credit Scheduler should be used.

xm dmesg | grep scheduler

Reserve one core for Dom0. All other DomU guests can pick from the rest. This is to ensure that Dom0 always has enough CPU to handle IO because all IO goes through the Dom0 virtual IO layer (disk and network).

Limiting block IO for guests

It's easy for a single guest to overload the block IO for all the over guests. The following shows a way to limit how much block IO priority a guest can take by setting the Dom0 IO scheduler to cfg. Note that you may also want to test setting the Dom0 IO scheduler to deadline, but that's a different path to explore.

Note that these instructions assume phy (physical) block devices for the guests, not tap:aio (tapdisk) devices. The method below should work similarly, but you will need to figure out the PID of the tapdisk driver associated with the given guest. Apparently mapping the Xen Guest ID to the tapdisk PID is not as easy.

Set the Dom0 host IO scheduler to CFQ. This is usually not the default for a Xen host because CFQ is not as efficient for high traffic servers; however, in this case we actually want a scheduler that handles workloads more like a desktop system (gracefully handle lots of unrelated IO requests with a fair distribution of bandwidth). Note that the following only temporarily sets the scheduler. You must edit /etc/sysfs.conf or /etc/sysfs.d/60-local.conf to make this change persistent across reboots (add the line block/sda/queue/scheduler = cfq).

echo "cfq" > /sys/block/sda/queue/scheduler

Identify the Xen Guest ID for the guest to be limited. In this example, the ID is 111.

# xm list myguest.noah.org
Name                                        ID   Mem VCPUs      State   Time(s)
myguest.noah.org                            111  4096     8     r----- 2696745.5

Identify the PID of the blkback driver belonging to Xen Guest 111. In this case the PID is 2162.

# ps axo pid,nice,command | grep [b]lkback.111.xvd
 2162   0 [blkback.111.xvd]

Use ionice to adjust the IO priority of the blkback driver. Note that the first example sets the lowest priority (-n 7) in the Best-effort class (-c 2). The second example sets the class to Idle (-c 3), which is the lowest priority you can set and may make the domain almost unusable. You can use ionice -p ${PID} to check the current IO priority of a given process.

ionice -p 2162 -c 2 -n 7
ionice -p 2162 -c 3