Bobcares

Soft Lockup on Xen Hypervisor – How to prevent Dom0 CPU starve

by | Jun 26, 2021

Wondering how to set Soft Lockup on Xen Hypervisor? We can help you.

As part of our Server Virtualization Technologies and Services, we assist our customers with several OnApp queries.

Today, let us discuss Soft Lockup on Xen Hypervisor.

 

Soft Lockup on Xen Hypervisor

It is possible for fresh hypervisors that run CentOS 6.XX and XEN4 to hung/Kernel panic without any VMs running.

The error may look like this:

kernel:BUG: soft lockup - CPU#16 stuck for 22s! [stress:6229] 

Message from syslogd@HV3-cloud at Aug 30 09:56:27 ... 
 kernel:BUG: soft lockup - CPU#16 stuck for 22s! [stress:6229]

The Dmesg output will be similar to this:

Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81140a53>] exit_mmap+0xe3/0x160 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8104fde4>] mmput+0x64/0x140 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81056d25>] exit_mm+0x105/0x130 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81056fcd>] do_exit+0x16d/0x450 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8113df2c>] ? handle_pte_fault+0x1ec/0x210 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81057305>] do_group_exit+0x55/0xd0 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81067294>] get_signal_to_deliver+0x224/0x4d0 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8101489b>] do_signal+0x5b/0x140 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8126f17d>] ? rb_insert_color+0x9d/0x160 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81083863>] ? finish_task_switch+0x53/0xe0 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81576fe7>] ? __schedule+0x3f7/0x710 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff810149e5>] do_notify_resume+0x65/0x80 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8157862c>] retint_signal+0x48/0x8c 
 Aug 30 09:59:00 HV3-cloud kernel: Code: cc 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
 Aug 30 09:59:00 HV3-cloud kernel: Call Trace: 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81009e2d>] ? xen_force_evtchn_callback+0xd/0x10 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8100a632>] check_events+0x12/0x20 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8100a61f>] ? xen_restore_fl_direct_reloc+0x4/0x4 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8111dc06>] ? free_hot_cold_page+0x126/0x1b0 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81005660>] ? xen_get_user_pgd+0x40/0x80 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8111dfe4>] free_hot_cold_page_list+0x54/0xa0 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81121b18>] release_pages+0x1b8/0x220 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8114da64>] free_pages_and_swap_cache+0xb4/0xe0 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81268da1>] ? cpumask_any_but+0x31/0x50 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81139bbc>] tlb_flush_mmu+0x6c/0x90 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8113a0a4>] tlb_finish_mmu+0x14/0x40 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81140a53>] exit_mmap+0xe3/0x160 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8104fde4>] mmput+0x64/0x140 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81056d25>] exit_mm+0x105/0x130 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81056fcd>] do_exit+0x16d/0x450 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8113df2c>] ? handle_pte_fault+0x1ec/0x210 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81057305>] do_group_exit+0x55/0xd0 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81067294>] get_signal_to_deliver+0x224/0x4d0 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8101489b>] do_signal+0x5b/0x140 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8126f17d>] ? rb_insert_color+0x9d/0x160 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81083863>] ? finish_task_switch+0x53/0xe0 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81576fe7>] ? __schedule+0x3f7/0x710 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff810149e5>] do_notify_resume+0x65/0x80 
 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8157862c>] retint_signal+0x48/0x8c 
 Aug 30 09:59:02 HV3-cloud kernel: BUG: soft lockup - CPU#5 stuck for 22s! [stress:6233] 
 Aug 30 09:59:02 HV3-cloud kernel: Modules linked in: arptable_filter arp_tables ip6t_REJECT ip6table_mangle ipt_REJECT iptable_filter ip_tables bridge stp llc xen_pciback xen_gntalloc bonding nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_round_robin dm_multipath xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd ufs(O) coretemp hwmon crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 aes_generic microcode pcspkr sb_edac edac_core joydev i2c_i801 sg iTCO_wdt iTCO_vendor_support igb evdev ixgbe mdio ioatdma myri10ge dca ext4 mbcache jbd2 raid1 sd_mod crc_t10dif ahci libahci isci libsas scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
 Aug 30 09:59:02 HV3-cloud kernel: CPU 5

 

Cause

While we install the hypervisor, to share the CPU fairly between the VMs we set few parameters on the Dom0.

Though this works on previous XEN/CentOS Versions it seems to starve the Dom0 on current XEN 4 / CentOS releases.

Generally, this is the parameter set by default:

xm sched-credit -d 0 -c 200

Here, -c is the cap. It optionally fixes the maximum amount of CPU a domain will be able to consume, even if the host system has idle CPU cycles.

[root@ ~]# xm sched-credit Name ID Weight Cap Domain-0 0 65535 200

This seems to starve the Dom0 CPU on servers that scale down the CPU power.

 

Resolution:

Our Support Techs recommend these steps to handle RHEL/CentOS 6.x with XEN 4.2.x hypervisor(s) crashes:

1. Initially, to check values of Ratelimit and Tslice for Cpu-Pool (only for Centos6/XEN4), we run:

root@xen4hv1 ~# xl -f sched-credit

Then we try to set them like below if it shows a different output:

root@xen4hv1 ~# xl -f sched-credit -s -t 5ms -r 100us

Or:

root@xen4hv1 ~# service xend stop

root@xen4hv1 ~# xl -f sched-credit -s -t 5ms -r 100us

root@xen4hv1 ~# service xend start

2. After that, we set default Credit Scheduler CAP and Weight values for Domain-0:

# xm sched-credit -d Domain-0 -w <WEIGHT> -c <CAP>

Here,

WEIGHT=600 for small HVs or cpu_cores/2*100 for large HVs;

CAP=0 for small HVs with few VMs and low CPU overselling or cpu_cores/2*100 for large HVs with huge CPU overselling;

For example, for HV with 8 cores:

# xm sched-credit -d Domain-0 -w 600 -c 0

Otherwise, we can set by default to 6000:

# xm sched-credit -d Domain-0 -w 6000 -c 0

If the changes help, we change the CAP and Weight values in the /onapp/onapp-hv.conf file:

# vi /onapp/onapp-hv.conf
 XEN_DOM0_SCHEDULER_WEIGHT=<WEIGHT>
 XEN_DOM0_SCHEDULER_CAP=<CAP>

3. Finally, we try to assign a certain number of vCPUs in /etc/grub.conf for Domain-0 like below:

# cat /boot/grub/grub.conf | grep dom0
 kernel /xen.gz dom0_mem=409600 dom0_max_vcpus=2
If the changes help, we change the maximum number of vCPUs value in the /onapp/onapp-hv.conf file:
# vi /onapp/onapp-hv.conf

XEN_DOM0_MAX_VCPUS=2

In order for changes to take effect, we do a system reboot.

[Need help with the procedures? We can help you]

 

Conclusion

In short, we saw how our Support Techs set Soft Lockup on Xen Hypervisor.

PREVENT YOUR SERVER FROM CRASHING!

Never again lose customers to poor server speed! Let us help you.

Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.

GET STARTED

var google_conversion_label = "owonCMyG5nEQ0aD71QM";

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Never again lose customers to poor
server speed! Let us help you.