Explanation about How CPU Limit and CPU Reservation can Slow your VM (if you don’t do a proper sizing and analysis)

In this post, I would like to share about CPU limit and CPU reservation configuration in vSphere ESXi virtualisation technology.

Actually those features are great (since the configuration also available in vCloud Director (*it will call the configuration in vCenter)). Those features are great if you really know and already consider on how to use it properly. For example, if you would like to use CPU reservation please make sure that you are not running those VMs in a fully contention/overcommitment environment. For CPU limit, if you have application that is always consume 100% of CPU even though you always give more CPU to the VM – then you can use Limit configuration to limit the usage of the CPU by that application (but, for me the Best Way is ask your Developer to Fix the Application!).

Okay, let’s talk more about CPU Limit.

Duncan Epping and Frank Denneman (both are the most respectable VMware blogger), once said that: “Look at a vCPU limit as a restriction within a specific time frame. When a time frame consists of 2000 units and a limit has been applied of 300 units it will take a full pass, so 300 “active” + 1700 units of waiting before it is scheduled again.”

So, applying a limit on a vCPU will slow your VM down no matter what. Even if there are no other VMs running on that 4 socket quad core host.

Next, let’s talk more about CPU Reservation.

Josh Odgers (another virtualisation blogger) also explained that CPU reservation “reserves” CPU resources measured in Mhz, but this has nothing to do with the CPU scheduler. So setting a reservation will help improve performance for the VM you set it on, but will not “solve” CPU ready issues caused by “oversized” VMs, or by too high an overcommitment ratio of CPU resources.

The configuration of Limit and Reservation are done outside the Guest OS, so your Operating System (Windows/Linux/etc) or your Application (Java/.NET/C/etc) do not know that. Your application will ask the resource based on the allocated CPU to that VM.
You should minimize the use of Limit and Reservation as it makes the operation more complex.

Conclusion:

Better use the feature of default VMkernel which already got a great scheduler functionality that will take fairness into account. Actually, you can use CPU share configuration if you want to prioritise the VM other than others.

But, the most important thing is: “Please Bro…, Right Size Your VM!”

 

Kind Regards,
Doddi Priyambodo

 

Can Not Connect / Error Connecting to iSCSI SAN

Sorry to disturb the tutorial about cloud native application, just a quick note about the troubleshooting.

I found an issue today regarding my iSCSI connection to the datastore. All hosts are all having this error when trying to connect to the SAN. This is because I played with my Lab a lot! and tried to remove and add the NIC of my Fusion and also my Host.

Error messages looks something like this:

Call "IscsiManager.QueryBoundVnics" for object "iscsiManager" on ESXi / vCenter failed.

The problem is solved with the following:

1. Disabled the iSCSI software adapter (backup your iqn and settings)
2. Navigate to /etc/vmware/vmkiscsid/ of the host and backup the files
3. Delete the contents in /etc/vmware/vmkiscsid/
4. Reboot the host
5. Create a new software iscsi adapter, write the IQN with the old one we backup earlier
6. Add iscsi port bindings and targets.
7. DONE.

 

Kind Regards,
Doddi Priyambodo

STP may cause temporary loss of network connectivity when a failover or failback event occurs (1003804)

Symptoms

In a switched network environment which uses Spanning Tree Protocol (STP), you experience these symptoms:

  • An ESXi or ESX host temporarily loses network connectivity when a failover or failback event occurs.
  • Virtual machines temporarily lose network connectivity when a failover or failback event occurs.
  • A VMware High Availability (HA) isolation event occurs after one of the teamed NICs of the COS is unplugged and plugged in to a different port.

Resolution

STP is used to accomplish a loop-free environment. Every time a port state goes up, STP calculation occurs. As the result of the calculation, the switch ports are either set to a forwarding or blocking state to prevent a traffic loop. STP topology convergence has four states:

  • Blocking
  • Listening
  • Learning
  • Forwarding

When STP convergence is initiated it forces all of the physical switches in the STP domain to dump their forwarding tables and relearn the STP topology and all MAC addresses. This process can take between 30-50 seconds. During this time, no user data passes through the port. Some user applications can time out during this period. Connectivity is restored when the STP domain completes this convergence.

To prevent the 30-50 second loss of connectivity during STP convergence, perform one of these options:

  • To set STP to Portfast on all switch ports that are connected to network adapters on an ESXi/ESX host
    Portfast allows the ports to immediately be set back to the forwarding state and prevents the link state changes that occur on ESX/ESXi hosts from affecting the STP topology. Setting STP to Portfast prevents the 30-50 second loss of network connectivity.
    The command to set STP to Portfast depends on the model of the switch. As the command is different from model to model and vendor to vendor, c ontact your physical switch vendor for more detailed information on how to configure the same.
    For example:
    To set STP to Portfast on a switch, run the below command based on the switch model:

    • CISCO-IOS
      spanning-tree portfast (for an access port)
      spanning-tree portfast trunk (for a trunk port)
    • NX-IOS
      spanning-tree port type edge (for an access port)
      spanning-tree port type edge trunk (for a trunk port)
    • To set STP to Portfast on a Dell switch, run the command:
      spanning-tree portfast
    • HP switches use a feature called admin-edge-port, which works the same way as Portfast or RSTP.
      To enable admin-edge-port, run the command:
      spanning-tree port-listadmin-edge-port
  • To disable STP
    VMware does not typically recommend that you disable STP. However, to prevent this issue from occurring, it may be necessary to disable STP. Before you disable STP, contact your switch vendor.
    The command to disable STP depends on the switch. Contact your switch vendor for more detailed information.
    For example:
    To disable STP on a Nortel switch, run the command:
    config ethernet stg stp disable

Taken from : http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003804

Troubleshooting – Lokasi Log File dari VMware vRealize Automation 6.x

Berikut ini adalah informasi mengenai lokasi logs dari VMware vRealize Automation 6.x Suite (dahulu namanya adalah VMware vCloud Automation Center).

Troubleshooting dilakukan dengan membaca/menganalisa beberapa log yang terjadi dalam sebuah sistem. Lokasi dari file log ini tersebar di beberapa server berdasarkan keputusan arsitektur dari vRA pada saat instalasi/deployment, apakah menggunakan mekanisme distributed deployment atau simple deployment.

 

vRealize Automation Virtual Appliance Locations
Description
/var/log/vcac/catalina.out
tc Server Runtime logs, vRealize Automation webapp logs
/var/log/vco/app-server/catalina.out
vRealize Automation’s built-in vRealize Orchectrator logs
/var/log/apache2/access_log
Apache Access logs
/var/log/apache2/error_log
Apache GET/POST Error logs
/var/log/apache2/ssl_request_log
Apache SSL troubleshooting logs
vRealize Automation Infrastructure as a Service Locations
Description
C:Program Files (x86)VMwarevCACAgentsagent_namelogsfile
Plug-in logs example: CPI61, nsx, VC50, VC51Agent, VC51TPM, vc51withTPM, VC55Agent, vc55u, VDIAgent
C:Program Files (x86)VMwarevCACDistributed Execution ManagerDEMORLogsDEMOR_All Distributed Execution Manager logs
C:Program Files (x86)VMwarevCACDistributed Execution ManagerDEMWRLogsDEMWR_All Distributed Execution Worker logs
C:Program Files (x86)VMwarevCACServerLogs Manager Service logs
C:Program Files (x86)VMwarevCACServerConfigToolLogvCACConfiguration-date Repository Configuration logs
C:Program Files (x86)VMwarevCACServerModel Manager DataLogsnothing_today IIS Access logs (usually empty, but can be expected)
C:Program Files (x86)VMwarevCACServerModel Manager WebLogsRepository Repository logs
C:Program Files (x86)VMwarevCACServerWebsiteLogsWeb_Admin_All Web Admin logs
C:inetpublogs IIS logs
Identity Virtual Appliance Locations Description
/var/log/vmware/sso/catalina.out ID VA tc Server Runtime logs
/var/log/vmware/sso/ssoAdminServer.log
SSO Admin Server logsNote: Not applicable to vRealize Automation.
/var/log/vmware/sso/vmware-identity-sts-perf.log STS performance logs
/var/log/vmware/sso/vmware-identity-sts.log STS logs
/var/log/vmware/sso/vmware-sts-idmd-perf.log Identity service performance logs
/var/log/vmware/sso/vmware-sts-idmd.err Identity service error logs
/var/log/vmware/sso/vmware-sts-idmd.log Identity service logs
/var/log/vmware/vmafd/vmafdd.log Identity VA logs
/var/log/vmware/vmdir/vdcsetupldu.log Initial setup logs
/var/log/vmware/vmdir/vmafdvmdirclient.log VMware SSO LDAP initial configuration logs
/var/log/vmware/vmkdc/vmkdcd.log VMware SSO LDAP initial configuration logs
vRealize Application Services Location Description
/home/darwin/tcserver/darwin/logs/catalina.out Application Services tc Server Runtime logs
vMware vRealize Business Standard Description
/usr/local/tcserver/vfabric-tc-server-standard/tcinstance1/logs/catalina.out vRealize Business Advanced and Enterprise tc Server Runtime logs
/usr/local/tcserver/vfabric-tc-server-standard/tcinstance1/logs/auditFile.log REST API requests
/usr/local/tcserver/vfabric-tc-server-standard/tcinstance1/logs/itfm-external-api.log API logs
/usr/local/tcserver/vfabric-tc-server-standard/tcinstance1/logs/itfm-reflib-update.log vRealize Business standard reference library related changes.
/usr/local/tcserver/vfabric-tc-server-standard/tcinstance1/logs/itfm-vc-dc.log Data collector logs
/usr/local/tcserver/vfabric-tc-server-standard/tcinstance1/logs/itfm.log vRealize Business Advanced and Enterprise logs
vCenter Server Appliance (VCSA) 5.5.x Locations Description
/var/log/vmware/vpx/vpxd.log vCenter VPXD logs
/var/log/vmware/vpx/vpxd-alert.log vCenter VPXD alert logs
/var/log/vmware/vpx/vws.log Management Web Service logs
/var/log/vmware/vpx/vmware-vpxd.log vCenter VPXD status change logs
/var/log/vmware/vpx/inventoryservice/ds.log vCenter Inventory Service logs
/var/log/vmware/vsphere-client/logs/vsphere_client_virgo.log vSphere Client logs
/var/log/vmware/vsphere-client/logs/virgo-server/log.log vSphere Client logs
/var/log/vmware/vsphere-client/eventlogs/eventlog.log vSphere Client event logs
vCenter SSO Locations Description
/var/log/vmware/sso/catalina.out SSO tc Server Runtime logs
/var/log/vmware/sso/ssoAdminServer.log SSO Admin Server logs (only in 5.5.x version)
/var/log/vmware/sso/vmware-identity-sts-perf.log STS performance logs
/var/log/vmware/sso/vmware-identity-sts.log STS logs
/var/log/vmware/sso/vmware-sts-idmd-perf.log Identity service performance logs
Kind Regards,
Doddi Priyambodo

Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x

Symptoms

Permanent Device Loss (PDL)

  • A datastore is shown as unavailable in the Storage view.
  • A storage adapter indicates the Operational State of the device as Lost Communication.
  • All paths to the device are marked as Dead.
  • The /var/log/vmkernel.log file shows messages similar to:

    cpu2:853571)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:661: Path "vmhba3:C0:T0:L0" (PERM LOSS) command 0xa3 failed with status Device is permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0.
    cpu2:853571)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:661: Path "vmhba4:C0:T0:L0" (PERM LOSS) command 0xa3 failed with status Device is permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0.
    cpu2:853571)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:972:Could not select path for device "naa.60a98000572d54724a34642d71325763".
    cpu2:853571)WARNING: ScsiDevice: 1223: Device :naa.60a98000572d54724a34642d71325763 has been removed or is permanently inaccessible.
    cpu3:2132)ScsiDeviceIO: 2288: Cmd(0x4124403c1fc0) 0x9e, CmdSN 0xec86 to dev "naa.60a98000572d54724a34642d71325763" failed H:0x8 D:0x0 P:0x0
    cpu3:2132)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device "naa.60a98000572d54724a34642d71325763" is blocked. Not starting I/O from device.
    cpu2:2127)ScsiDeviceIO: 2316: Cmd(0x4124403c1fc0) 0x25, CmdSN 0xecab to dev "naa.60a98000572d54724a34642d71325763" failed H:0x1 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0.
    cpu2:854568)WARNING: ScsiDeviceIO: 7330: READ CAPACITY on device "naa.60a98000572d54724a34642d71325763" from Plugin "NMP" failed. I/O error
    cpu2:854568)ScsiDevice: 1238: Permanently inaccessible device :naa.60a98000572d54724a34642d71325763 has no more open connections. It is now safe to unmount datastores (if any) and delete the device.
    cpu3:854577)WARNING: NMP: nmpDeviceAttemptFailover:562:Retry world restore device "naa.60a98000572d54724a34642d71325763" - no more commands to retry

All-Paths-Down (APD)

  • A datastore is shown as unavailable in the Storage view.
  • A storage adapter indicates the Operational State of the device as Dead or Error.
  • All paths to the device are marked as Dead.
  • You are unable to connect directly to the ESXi host using the vSphere Client.
  • The ESXi host shows as Disconnected in vCenter Server.
  • The /var/log/vmkernel.log file shows messages similar to:

    cpu1:2049)WARNING: NMP: nmp_IssueCommandToDevice:2954:I/O could not be issued to device "naa.60a98000572d54724a34642d71325763" due to Not found
    cpu1:2049)WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.60a98000572d54724a34642d71325763": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
    cpu1:2049)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device "naa.60a98000572d54724a34642d71325763" is blocked. Not starting I/O from device.
    cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.60a98000572d54724a34642d71325763" - issuing command 0x4124007ba7c0
    cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:658:Retry world failover device "naa.60a98000572d54724a34642d71325763" - failed to issue command due to Not found (APD), try again...
    cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:708:Logical device "naa.60a98000572d54724a34642d71325763": awaiting fast path state update...
    cpu0:2642)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.60a98000572d54724a34642d71325763" - issuing command 0x4124007ba7c0
    cpu0:2642)WARNING: NMP: nmpDeviceAttemptFailover:658:Retry world failover device "naa.60a98000572d54724a34642d71325763" - failed to issue command due to Not found (APD), try again...
    cpu0:2642)WARNING: NMP: nmpDeviceAttemptFailover:708:Logical device "naa.60a98000572d54724a34642d71325763": awaiting fast path state update...

  • A restart of the management agents may show these errors:

    Not all VMFS volumes were updated; the error encountered was 'No connection'.
    Errors:
    Rescan complete, however some dead paths were not removed because they were in use by the system. Please use the 'storage core device world list' command to see the VMkernel worlds still using these paths.
    Error while scanning interfaces, unable to continue. Error was Not all VMFS volumes were updated; the error encountered was 'No connection'.

  • You may also see that the device is no longer listed:

    cpu17:10107)WARNING: Vol3: 1717: Failed to refresh FS 4beb089b-68037158-2ecc-00215eda1af6 descriptor: Device is permanently unavailable
    cpu17:10107)ScsiDeviceIO: 2316: Cmd(0x412442939bc0) 0x28, CmdSN 0x367bb6 from world 10107 to dev "eui.00173800084f0005" failed H:0x1 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
    cpu17:10107)Vol3: 1767: Error refreshing PB resMeta: Device is permanently unavailable

Purpose

This article discusses a Permanent Device Loss (PDL) and All-Paths-Down (APD) in ESXi 5.x, and provides information on dealing with each of these scenarios.

Resolution

In vSphere 4.x, an All-Paths-Down (APD) situation occurs when all paths to a device are down. As there is no indication whether this is a permanent or temporary device loss, the ESXi host keeps reattempting to establish connectivity. APD-style situations commonly occur when the LUN is incorrectly unpresented from the ESXi/ESX host. The ESXi/ESX host, still believing the device is available, retries all SCSI commands indefinitely. This has an impact on the management agents, as their commands are not responded to until the device is again accessible. This causes the ESXi/ESX host to become inaccessible/not-responding in vCenter Server.

In vSphere 5.x, a clear distinction has been made between a device that is permanently lost (PDL) and a transient issue where all paths are down (APD) for an unknown reason.

For example, in the VMkernel logs, if a SCSI sense code of H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 or Logical Unit Not Supported is logged by the storage device to the ESXi 5.xhost, this indicates that the device is permanently inaccessible to the ESXi host, or is in a Permanent Device Loss (PDL) state. The ESXi host no longer attempts to re-establish connectivity or issue commands to the device.

Devices that suffer a non-recoverable hardware error are also recognized as being in a Permanent Device Loss (PDL) state.

This table outlines possible SCSI sense codes that determine if a device is in a PDL state:

SCSI sense code Description
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 LOGICAL UNIT NOT SUPPORTED
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x4c 0x0 LOGICAL UNIT FAILED SELF-CONFIGURATION
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x3 LOGICAL UNIT FAILED SELF-TEST
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x1 LOGICAL UNIT FAILURE

For more information about SCSI sense codes in vSphere, see Interpreting SCSI sense codes (289902).

Note: Some iSCSI arrays map LUN-to-Target as a one-to-one relationship. That is, there is only ever a single LUN per Target. In this case, the iSCSI arrays do not return the appropriate SCSI sense code, so PDL on these arrays types cannot be detected. However, in ESXi 5.1, enhancements have been made and now the iSCSI initiator attempts to re-login to the target after a dropped session. If the device is not accessible, the storage system rejects the host’s effort to access the storage. Depending on the response from the array, the host can now mark the device as PDL.

All-Paths-Down (APD)

If PDL SCSI sense codes are not returned from a device (when unable to contact the storage array, or with a storage array that does not return the supported PDL SCSI codes), then the device is in an All-Paths-Down (APD) state, and the ESXi host continues to send I/O requests until the host receives a response.

As the ESXi host is not able to determine if the device loss is permanent (PDL) or transient (APD), it indefinitely retries SCSI I/O, including:

  • Userworld I/O (hostd management agent)
  • Virtual machine guest I/O

    Note: If an I/O request is issued from a guest, the operating system should timeout and abort the I/O.

Due to the nature of an APD situation, there is no clean way to recover.

  • The APD situation needs to be resolved at the storage array/fabric layer to restore connectivity to the host.
  • All affected ESXi hosts may require a reboot to remove any residual references to the affected devices that are in an APD state.

Note: Performing a vMotion migration of unaffected virtual machines is not possible, as the management agents may be affected by the APD condition, and the ESXi host may become unmanaged. As a result, a reboot of an affected ESXi host forces an outage to all non-affected virtual machines on that host.

Planned versus unplanned PDL

A planned PDL occurs when there is an intent to remove a device presented to the ESXi host. The datastore must first be unmounted, then the device detached before the storage device can be unpresented at the storage array. For more information on how to correctly unpresent a LUN in ESXi 5.x, see Unmounting a LUN or detaching a datastore/storage device from multiple ESXi 5.x hosts (2004605).

An unplanned PDL occurs when the storage device is unexpectedly unpresented from the storage array without the unmount and detach being executed on the ESXi host.

In ESXi 5.5, VMware provides a feature called Auto-remove for automatic removal of devices during an unplanned PDL. For more information, see PDL AutoRemove feature in vSphere 5.5 (2059622).

To clean up an unplanned PDL:

  1. All running virtual machines from the datastore must be powered off and unregistered from the vCenter Server.
  2. From the vSphere Client, go to the Configuration tab of the ESXi host, and click Storage.
  3. Right-click the datastore being removed, and click Unmount.

    The Confirm Datastore Unmount window displays. When the prerequisite criteria have been passed, the OK button appears.

    If you see this error when unmounting the LUN:

    Call datastore refresh for object <name_of_LUN> on vCenter server <name_of_vCenter> failed

    You may have a snapshot LUN presented. To resolve this issue, remove that snapshot LUN on the array side.

  4. Perform a rescan on all of the ESXi hosts that had visibility to the LUN.

    Note: If there are active references to the device or pending I/O, the ESXi host still lists the device after the rescan. Check for virtual machines, templates, ISO images, floppy images, and raw device mappings which may still have an active reference to the device or datastore.

  5. If the LUN is still being used and available again, go to each host, right-click the LUN, and click Mount.

    Note: One possible cause for an unplanned PDL is that the LUN ran out space causing it to become inaccessible.

See Also