Top Support Issues and How to Solve Them Sriram Rajendran Escalation Engineer
Transcription
Top Support Issues and How to Solve Them Sriram Rajendran Escalation Engineer
Top Support Issues and How to Solve Them Sriram Rajendran Escalation Engineer Nov 6 2008 Agenda Support issues covered VMFS volumes missing / lost and LUN re-signaturing ESX host not-responding / disconnected in VirtualCenter Expanding the size of a VMDK with existing Snapshots. Virtualization performance misconception Cannot See My VMFS Volumes A common support issue – “My VMFS volumes have disappeared!” In fact, it is often the case that the volumes are seen as snapshots which, by default, are not mounted. Why are VMFS volumes seen as snapshots when they are not? ESX server A is presented with a LUN on ID 0. The same LUN is presented to ESX server B on ID 1. VMFS-3 volume created on LUN ID 0 from server A. Volume on server B will not be mounted when SAN is rescanned. Server B will state that the volume is a snapshot because of LUN ID mismatch. LUNs must be presented with the same LUN IDs to all ESX hosts. In the next release of ESX, LUN ID is no longer compared if the target exports NAA type IDs. How Does ESX Determine If Volume Is A Snapshot? When a VMFS-3 volume is created, the SCSI Disk ID data from the LUN/storage array is stored in the volume’s LVM header. This contains, along with other information, the LUN ID. When another ESX server finds a LUN with a VMFS-3 filesystem, the SCSI Disk ID information returned from the LUN/storage array is compared with the LVM header metadata. The VMkernel treats a volume as a snapshot if there is a mismatch in this information. How Should You Handle Snapshots? First of all, determine if it is really a snapshot: If it is a mismatch of LUN IDs across different ESX hosts, fix the LUN ID through array management software to ensure that the same LUN ID is presented to all hosts for a share volume. Other reasons a volume might appear as a snapshot could be changes in the way the LUN is presented to the ESX: • HDS Host Mode setting • EMC Symmetrix SPC-2 director flag • Change from A/P firmware to A/A firmware on array If it is definitely a snapshot, you have two options: Set EnableResignature Or Disable DisAllowSnapshotLUN LVM.EnableResignature Used when mounting the original and the snapshot VMFS Volumes on the same ESX. Set LVM.EnableResignature to 1 and issue a rescan of the SAN. This updates the LVM header with: • new SCSI Disk ID information • a new VMFS-3 UUID • a new label Label format will be snap-<generation number>-<label>, or snap-<generation number>-<uuid> if there is no label, e.g. • Before resignature: /vmfs/volumes/lun2 • After resignature: /vmfs/volumes/snap-00000008-lun2 Remember to set LVM.EnableResignature back to 0. LVM.DisallowSnapshotLUN DisallowSnapshotLUN will not modify any part of the LVM header To allow the mounting of snapshot LUNs, set: EnableResignature to 0 (disable) and DisallowSnapshotLUN to 0 (disable) Do not use DisallowSnapshotLUN to present snapshots back to same ESX server LVM.EnableResignature overrides LVM.DisallowSnapshotLUN LVM.EnableResignature LVM.EnableResignature LUN 0 LUN 1 Storage SPA 0 SPB 1 0 1 FC Switch 1 will have to be used to make the volume located on cloned LUN, LUN 1, visible to the same ESX server after a rescan. FC Switch 2 HBA 1 HBA 2 Server A Two volumes with the same UUID must not be presented to the same ESX server. Issues with data integrity will occur. LVM.EnableResignature OR LVM.DisallowSnapshotLUN -- snapshot -- LUN 0 Snapshot LUN presented to a different ESX server LUN 1 Storage SPA 0 1 FC Switch 1 HBA 1 HBA 2 Server A We can present the snapshot LUN, LUN 1, using DisallowSnapshotLUN = 0 on Server B as long as Server B cannot see LUN 0. SPB 0 1 FC Switch 2 HBA 1 HBA 2 Server B If Server B can also see LUN 0, then we must use resignaturing since we cannot present two LUNs with the same UUID to the same ESX server. LVM.DisallowSnapshotLUN Remote snapshot, e.g. SRDF Production LUN 0 DR site LUN 1 Storage A SPA 0 Storage B SPB 1 0 1 FC Switch 1 FC Switch 2 HBA 1 HBA 2 Server A SPA 0 SPB 1 0 1 FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B Since there is not going to be a LUN with the same UUID at the remote site, one can allow snapshots. ESX host not-responding/disconnected in VirtualCenter Components involved in the communication between ESX and VC servers. VC Server ESX Server VPXD VC agent VPXA Host Agent HostD Back to the issue Customer complaints: ESX server is seen in disconnected or notresponding state in VC server. If the ESX host is seen in Disconnected state then reconnecting the Host will solve the issue. However, in most cases, the ESX host is seen in “Not-Responding” state. • What do we do in this case? List of things we can do Verify that network connectivity exists from the VirtualCenter Server to the ESX Server. • Use ping to check the connectivity. Verify that you can connect from the VirtualCenter Server to the ESX Server on port 902 (If the ESX Server was upgraded from version 2.x then verify if you can connect on port 905) • Use telnet service to connect to the specified port. Verify if Hostd agent is running Verify if VPXA agent is running Check system resources Checking if Hostd is alive Verify that the ESX Server management service, hostd is still alive / running in ESX server. How to check if Hostd is still alive or not? • Connect to the ESX server using SSH from another Windows/Linux box. • Execute the command, vmware-cmd –l • If the command succeeds – Hostd is working fine • If the command fails – Hostd Is not working/ probably stopped. • Restarting the mgmt-vmware service will get the ESX host back online in VirtualCenter Server. • Once this is done, execute the same command again to know if Hostd is working fine or not. Checking if Hostd is alive If Hostd fails to start then check the following logs for any hints to identify the issue. /var/log/vmware/hostd.log Note: In most cases, the obvious reasons we have seen are, The / root filesystem is full. There are some rogue VMs registered. Some of presented LUNs are either corrupted or do not have a valid partition table. If you are not able to proceed any further file a ticket with VMware TechSupport. Checking if VPXA is alive To verify if the VirtualCenter Agent Service (vmware-vpxa) is running: Log in to your ESX Server as root, from an SSH session or directly from the console of the server. [root@server]# ps -ef | grep vpxa root 24663 1 0 15:44 ? 00:00:00 /bin/sh /opt/vmware/vpxa/bin/vmware-watchdog -s vpxa -u 30 -q 5 /opt/vmware/vpxa/sbin/vpxa root 26639 24663 0 21:03 ? 00:00:00 /opt/vmware/vpxa/vpx/vpxa root 26668 26396 0 21:23 pts/3 00:00:00 grep vpxa The output appears similar to the following if vmware-vpxa is not running: [root@server]# ps -ef | grep vpxa root 26709 26396 0 21:24 pts/3 00:00:00 grep vpxa Checking if VPXA is alive Some times the VPXA process may become orphaned. Restarting the vmware-vpxa service helps. How to restart the service: /etc/init.d/vmware-vpxa restart If the Service fails to start, check the following logs to identify the issue, /var/log/vmware/vpx/vpxa.log If you are not able to proceed any further file a ticket with VMware TechSupport. Checking System Resources Some times the hostd or the vpxa fails to start due to the lack of system resources. High CPU utilization on an ESX Server -- esxtop High memory utilization on an ESX Server -- /proc/vmware/mem Slow response when administering an ESX Server Expanding the size of a VMDK with existing Snapshots You CANNOT expand a VM’s VMDK file while it still has snapshots. e.g. #ls * important.vmdk important-000001-delta.vmdk #vmkfstools –X 20G important.vmdk Data important.vmdk important-000001-delta.vmdk Data important.vmdk important-000001-delta.vmdk Expanding VM With a Snapshot If you do, you will now have a VM that won’t boot Expanding VM With a Snapshot Tricking ESX into seeing the expanded VMDK as the original size. In this example we have a test.vmdk that we expand from 5GB to 6GB #vmkfstools -X 6G test.vmdk Expanding VM With a Snapshot If we check test.vmdk we see # Disk DescriptorFile version=1 CID=3f24a1b3 parentCID=ffffffff createType="vmfs" # Extent description RW 12582912 VMFS "test-flat.vmdk" # The Disk Data Base #DDB ddb.virtualHWVersion = "4" ddb.geometry.cylinders = "783" ddb.geometry.heads = "255" ddb.geometry.sectors = "63" ddb.adapterType = "buslogic" Expanding VM With a Snapshot Original - RW 10485760 VMFS "test-flat.vmdk“ New - RW 12582912 VMFS "test-flat.vmdk“ If we have no “BACKUPS” how do we get the original value? #grep -i rw test-000001.vmdk RW 10485760 VMFSSPARSE “test-000001-delta.vmdk" Expanding VM With a Snapshot We change test.vmdk RW value. # Disk DescriptorFile version=1 CID=3f24a1b3 parentCID=ffffffff createType="vmfs" # Extent description RW 10485760 VMFS "test-flat.vmdk" # The Disk Data Base #DDB ddb.virtualHWVersion = "4" ddb.geometry.cylinders = "783" ddb.geometry.heads = "255" ddb.geometry.sectors = "63" ddb.adapterType = "buslogic" Expanding VM With a Snapshot Commit The snapshot(s) #vmware-cmd /pathtovmx/test.vmx removesnapshots Grow the VMDK file #vmware-cmd –X 6GB test.vmdk If needed add a snapshot #vmware-cmd pathtovmx/test.vmx createsnapshot <name> <description> Virtualization Performance Myths CPU affinity Virtual SMP performance Ready time Transparent page sharing Memory over-commitment Memory Ballooning NICTeaming Hyperthreading CPU affinity Myth: Set CPU affinity to improve VM performance CPU affinity implications CPU affinity restricts scheduling freedom. VM will accrue ready time if the pinned CPU is not available for scheduling On NUMA system setting CPU affinity disables NUMA scheduling. VM performance will suffer if memory is allocated on the remote node On Hyperthreadedsystem CPU affinity binds the VM to Logical CPU ESX tries to balance Interrupts. Setting CPU affinity to a physical CPU where interrupts occur frequently can impact performance Fact: Setting CPU affinity could impact performance Virtual SMP Performance Myth: Virtual SMP improves performance of allCPU bound applications SMP performance Single threaded application cannot use more than one CPU at a time Single thread may ping pong between the virtual CPUs Incurs virtualization overhead, pinning the thread to a vcpu helps in this case Co-Scheduling Overhead Multiple Idle physical CPUs may not be available when the VM wants to run (VM may accumulate ready time) Fact: Virtual SMP does not improve performance of single-threaded applications Ready Time (1 of 2) Myth: Ready time should be zero when CPU usage is low VM state running (%used) Run waiting (%twait) ready to run (%ready) Wait Ready When does a VM go to “ready to run”state Guest wants to run or needs to be woken up (to deliver an interrupt) CPU unavailable for scheduling the VMRunReadyWait Ready Time (2 of 2) o Factors affecting CPU availability CPU over commitment Even Idle VMs have to be scheduled periodically to deliver timer interrupts NUMA constraints NUMA node locality gives better performance Burstiness – Inter-related workloads Tip: Use host anti affinity rules to place inter related workloads on different hosts Co-scheduling constraints CPU affinity restrictions Fact: Ready time could exist even when CPU usage is low Transparent Page Sharing Myth: Disable Transparent page sharing to improve performance Transparent page sharing VMkernel scans memory for Identical pages and collapses them into a single page Copy on write is performed if a shared page is modified OSes have static code pages that rarely change Number of shared pages becomes significant with more consolidation Huge win for memory over commitment The default scanning rate is low and incurs negligible overhead (<1%) Fact: Transparent page sharing does not affect performance adversely and it improves performance under memory over commitment Memory Ballooning Myth: Disable balloon driver to increase VM performance How Ballooning works Balloon driver gives / takes away memory from the guest under memory pressure Rate of reclamation of memory is determined by memory shares In its default configuration, memory shares is proportional to VM memory size Memory is reclaimed forcibly by swapping if balloon driver is not installed Tip: To avoid swapping/ballooning use memory reservation Fact: Disabling ballooning driver severely affects VM performance under memory over-commitment Hyperthreading Myth: Hyperthreading hurts ESX performance Hyperthreading support in ESX Hyperthreading increases the available number of CPUs for scheduling SMP VMs use logical CPUs from different physical CPUs whenever possible Scheduler temporarily quarantines a VM from logical CPU if it misbehaves (Cache trashing) Frequently misbehaving VMs can be selectively excluded from hyperthreading with htsharing=none option Fact: Hyperthreading improves overall performance by reducing ready time Additional References Best practices Using VMware Virtual SMP http://www.vmware.com/pdf/vsmp_best_practices.pdf Performance tuning best practices for ESX server 3 http://www.vmware.com/pdf/vi_performance_tuning.pdf ESX Resource Management Guide http://www.vmware.com/pdf/esx_resource_mgmt.pdf