Wondering how to set Soft Lockup on Xen Hypervisor? We can help you.
As part of our Server Virtualization Technologies and Services, we assist our customers with several OnApp queries.
Today, let us discuss Soft Lockup on Xen Hypervisor.
Soft Lockup on Xen Hypervisor
It is possible for fresh hypervisors that run CentOS 6.XX and XEN4 to hung/Kernel panic without any VMs running.
The error may look like this:
kernel:BUG: soft lockup - CPU#16 stuck for 22s! [stress:6229] Message from syslogd@HV3-cloud at Aug 30 09:56:27 ... kernel:BUG: soft lockup - CPU#16 stuck for 22s! [stress:6229]
The Dmesg output will be similar to this:
Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81140a53>] exit_mmap+0xe3/0x160 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8104fde4>] mmput+0x64/0x140 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81056d25>] exit_mm+0x105/0x130 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81056fcd>] do_exit+0x16d/0x450 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8113df2c>] ? handle_pte_fault+0x1ec/0x210 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81057305>] do_group_exit+0x55/0xd0 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81067294>] get_signal_to_deliver+0x224/0x4d0 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8101489b>] do_signal+0x5b/0x140 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8126f17d>] ? rb_insert_color+0x9d/0x160 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81083863>] ? finish_task_switch+0x53/0xe0 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81576fe7>] ? __schedule+0x3f7/0x710 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff810149e5>] do_notify_resume+0x65/0x80 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8157862c>] retint_signal+0x48/0x8c Aug 30 09:59:00 HV3-cloud kernel: Code: cc 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc Aug 30 09:59:00 HV3-cloud kernel: Call Trace: Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81009e2d>] ? xen_force_evtchn_callback+0xd/0x10 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8100a632>] check_events+0x12/0x20 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8100a61f>] ? xen_restore_fl_direct_reloc+0x4/0x4 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8111dc06>] ? free_hot_cold_page+0x126/0x1b0 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81005660>] ? xen_get_user_pgd+0x40/0x80 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8111dfe4>] free_hot_cold_page_list+0x54/0xa0 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81121b18>] release_pages+0x1b8/0x220 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8114da64>] free_pages_and_swap_cache+0xb4/0xe0 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81268da1>] ? cpumask_any_but+0x31/0x50 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81139bbc>] tlb_flush_mmu+0x6c/0x90 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8113a0a4>] tlb_finish_mmu+0x14/0x40 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81140a53>] exit_mmap+0xe3/0x160 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8104fde4>] mmput+0x64/0x140 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81056d25>] exit_mm+0x105/0x130 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81056fcd>] do_exit+0x16d/0x450 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8113df2c>] ? handle_pte_fault+0x1ec/0x210 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81057305>] do_group_exit+0x55/0xd0 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81067294>] get_signal_to_deliver+0x224/0x4d0 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8101489b>] do_signal+0x5b/0x140 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8126f17d>] ? rb_insert_color+0x9d/0x160 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81083863>] ? finish_task_switch+0x53/0xe0 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff81576fe7>] ? __schedule+0x3f7/0x710 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff810149e5>] do_notify_resume+0x65/0x80 Aug 30 09:59:00 HV3-cloud kernel: [<ffffffff8157862c>] retint_signal+0x48/0x8c Aug 30 09:59:02 HV3-cloud kernel: BUG: soft lockup - CPU#5 stuck for 22s! [stress:6233] Aug 30 09:59:02 HV3-cloud kernel: Modules linked in: arptable_filter arp_tables ip6t_REJECT ip6table_mangle ipt_REJECT iptable_filter ip_tables bridge stp llc xen_pciback xen_gntalloc bonding nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_round_robin dm_multipath xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd ufs(O) coretemp hwmon crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 aes_generic microcode pcspkr sb_edac edac_core joydev i2c_i801 sg iTCO_wdt iTCO_vendor_support igb evdev ixgbe mdio ioatdma myri10ge dca ext4 mbcache jbd2 raid1 sd_mod crc_t10dif ahci libahci isci libsas scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Aug 30 09:59:02 HV3-cloud kernel: CPU 5
Cause
While we install the hypervisor, to share the CPU fairly between the VMs we set few parameters on the Dom0.
Though this works on previous XEN/CentOS Versions it seems to starve the Dom0 on current XEN 4 / CentOS releases.
Generally, this is the parameter set by default:
xm sched-credit -d 0 -c 200
Here, -c is the cap. It optionally fixes the maximum amount of CPU a domain will be able to consume, even if the host system has idle CPU cycles.
[root@ ~]# xm sched-credit Name ID Weight Cap Domain-0 0 65535 200
This seems to starve the Dom0 CPU on servers that scale down the CPU power.
Resolution:
Our Support Techs recommend these steps to handle RHEL/CentOS 6.x with XEN 4.2.x hypervisor(s) crashes:
1. Initially, to check values of Ratelimit and Tslice for Cpu-Pool (only for Centos6/XEN4), we run:
root@xen4hv1 ~# xl -f sched-credit
Then we try to set them like below if it shows a different output:
root@xen4hv1 ~# xl -f sched-credit -s -t 5ms -r 100us
Or:
root@xen4hv1 ~# service xend stop
root@xen4hv1 ~# xl -f sched-credit -s -t 5ms -r 100us
root@xen4hv1 ~# service xend start
2. After that, we set default Credit Scheduler CAP and Weight values for Domain-0:
# xm sched-credit -d Domain-0 -w <WEIGHT> -c <CAP>
Here,
WEIGHT=600 for small HVs or cpu_cores/2*100 for large HVs;
CAP=0 for small HVs with few VMs and low CPU overselling or cpu_cores/2*100 for large HVs with huge CPU overselling;
For example, for HV with 8 cores:
# xm sched-credit -d Domain-0 -w 600 -c 0
Otherwise, we can set by default to 6000:
# xm sched-credit -d Domain-0 -w 6000 -c 0
If the changes help, we change the CAP and Weight values in the /onapp/onapp-hv.conf file:
# vi /onapp/onapp-hv.conf
XEN_DOM0_SCHEDULER_WEIGHT=<WEIGHT>
XEN_DOM0_SCHEDULER_CAP=<CAP>
3. Finally, we try to assign a certain number of vCPUs in /etc/grub.conf for Domain-0 like below:
# cat /boot/grub/grub.conf | grep dom0
kernel /xen.gz dom0_mem=409600 dom0_max_vcpus=2
# vi /onapp/onapp-hv.conf
XEN_DOM0_MAX_VCPUS=2
In order for changes to take effect, we do a system reboot.
[Need help with the procedures? We can help you]
Conclusion
In short, we saw how our Support Techs set Soft Lockup on Xen Hypervisor.
0 Comments