Koleksi Useful Links / Booklet untuk VMware

 

Berikut ini saya lampirkan koleksi links yang bagus untuk dibaca-baca mengenai VMware.

 

USEFUL URLS :

  •  VMware Products Feature Walkthrough
    http://featurewalkthrough.vmware.com/
  •  VMware Knowledgebase:
    http://kb.vmware.com/selfservice/microsites/microsite.do
  •  VMware Documentation:
    https://www.vmware.com/support/pubs/
  •  Security Hardening Guide:
    https://www.vmware.com/security/hardening-guides
  •  VMware Compatibility Guide:
    http://www.vmware.com/resources/compatibility/search.php
  •  VMware Product Interoperability Matrixes:
    http://www.vmware.com/resources/compatibility/sim/interop_matrix.php
  •  Guest Operating System Installation Guide:
    http://partnerweb.vmware.com/GOSIG/home.html
  •  Technical White Papers:
    http://www.vmware.com/vmtn/resources/
  •  VMware Security Advisories:
    http://www.vmware.com/security/advisories/
  •  VMware Community:
    https://communities.vmware.com/community/vmtn/vmug/forums/asia_pacific
  •  VMware Blog:
    http://blogs.vmware.com/
  •  VMware Education:
    http://mylearn.vmware.com/mgrreg/index.cfm
  •  My Learn Portal for Education and certification
    https://mylearn.vmware.com/
  •  Hands on Lab (HOL)
    http://hol.vmware.com
  •  My VMware Portal
    https://my.vmware.com/web/vmware/login
  •  Technical Publication Glossary
    https://www.vmware.com/pdf/master_glossary.pdf

 

USEFUL TECHNICAL WHITEPAPER

  •  VMware Software-Defined Data Center
    https://www.vmware.com/resources/techresources/10471
  •  What’s New in VMware vSphere 6 – Performance
    https://www.vmware.com/resources/techresources/10485
  •  Performance Best Practices for VMware vSphere 6.0
    https://www.vmware.com/resources/techresources/10480
  •  vSphere Upgrade Center:
    https://www.vmware.com/products/vsphere/upgrade-center/overview
  •  vCenter 6.0 Deployment Guide
    https://www.vmware.com/files/pdf/techpaper/vmware-vcenter-server6-deployment-guide.pdf
  •  vCenter Server 6 Deployment Topologies and High Availability
    http://blogs.vmware.com/vsphere/2015/03/vcenter-server-6-topology-ha.html
  •  vCenter Single Sign-On and Platform Services Controller High Availability Compatibility Matrix (2112736)
    http://kb.vmware.com/kb/2112736
  •  vCenter Platform Controller Service FAQs
    http://kb.vmware.com/kb/2113115
  •  Configuring PSC 6.0 High Availability for vSphere 6.0 using vCenter Server 6.0 Appliance (2113315)
    http://kb.vmware.com/kb/2113315
  •  vCenter Server 6.0 Availability Guide
    http://www.vmware.com/files/pdf/techpaper/VMware-vCenter-Server6-Availability-Guide.pdf
  •  Security of the VMware vSphere Hypervisor
    http://www.vmware.com/files/pdf/techpaper/vmw-wp-secrty-vsphr-hyprvsr-uslet-101.pdf
  •  Microsoft SQL Server and VMware Virtual Infrastructure
    https://www.vmware.com/resources/techresources/10002
  •  Best Practices for Performance Tuning of Telco and NFV Workloads in vSphere
    https://www.vmware.com/resources/techresources/10479
  •  Using “esxtop” to Troubleshoot Performance Problems
    https://www.vmware.com/resources/techresources/436
  •  VMware Horizon View and All Flash Virtual SAN Reference Architecture
    https://www.vmware.com/resources/techresources/10484
  •  Virtualizing Microsoft Applications on VMware Virtual SAN
    https://www.vmware.com/resources/techresources/10478

 

Business Critical Application Virtualization Guides

  •  Microsoft SQL
    https://www.vmware.com/business-critical-apps/sql-virtualization/microsoft-support.html
  •  Microsoft Exchange
    https://www.vmware.com/business-critical-apps/exchange/index.html
  •  Microsoft Sharepoint
    https://www.vmware.com/business-critical-apps/sharepoint-virtualization/index.html
  •  SAP
    https://www.vmware.com/business-critical-apps/sap-virtualization/index.html
  •  Oracle
    https://www.vmware.com/business-critical-apps/oracle-virtualization/resources.html
  •  Java
    https://www.vmware.com/business-critical-apps/enterprise-java-app/resources.html

 

VREALIZE OPERATIONS INSIGHT

  1.  Official enterprise management blogs
    http://blogs.vmware.com/management/
  2.  Official video
    https://www.youtube.com/channel/UCKON30YeSGIeqsueMYgEa9A
  3.  Useful resources
    http://www.vmware.com/products/vrealize-suite/resources.html
  4.  Solution Exchange
    https://solutionexchange.vmware.com/store/category_groups/cloud-management
  5.  Hands-on Lab for Management products
    http://labs.hol.vmware.com/HOL/catalogs/catalog/128
  6.  Technical blogs by VMware or customers
    o http://sflanders.net/ is world #1 blog for Log Insight. Steven is the Product Architect for Log Insight.
    o http://virtual10.com/ by Manny Sidhu, a Virtualization architect working for a global bank.
    o http://vxpresss.blogspot.sg/ by Sunny Dua, VMware PSO Consultant and CTO Ambassador.
    o http://virtual-red-dot.info by Iwan Rahabok, VMware SE and CTO Ambassador.

 

Thanks. Semoga berguna.

 

Kind Regards,

Doddi Priyambodo.

VMware vSphere 6 Public Documentation

Berikut ini adalah dokumentasi resmi mengenai vSphere 6. Bisa di-download langsung dari VMware website.

Silahkan dinikmati : (http://pubs.vmware.com/vsphere-60/index.jsp)

 

ESXi and vCenter Server 6.0 Product Guides
vSphere Installation and Setup [pdf | epub | mobi]
vSphere Upgrade [pdf | epub | mobi]
vSphere vCenter Server and Host Management [pdf | epub | mobi]
vSphere vCenter Server Appliance Configuration [pdf | epub | mobi]
vSphere Virtual Machine Administration [pdf | epub| mobi]
vSphere Host Profiles [pdf | epub | mobi]
vSphere Networking [pdf | epub | mobi]
vSphere Storage [pdf | epub | mobi]
vSphere Security [pdf | epub | mobi]
vSphere Resource Management [pdf | epub | mobi]
vSphere Availability [pdf | epub | mobi]
vSphere Monitoring and Performance [pdf | epub| mobi]
vSphere Administration with the vSphere Client [pdf | epub | mobi]
vSphere Troubleshooting [pdf | epub | mobi]

 

vSphere Update Manager 6.0 Product Guides
Installing and Administering VMware vSphere Update Manager [pdf | epub | mobi]
Reconfiguring VMware vSphere Update Manager [pdf | epub | mobi]
Sizing Estimator for vSphere Update Manager [xls]
VMware Virtual SAN 6.0 Product Guides
Administering VMware Virtual SAN [pdf | epub | mobi]
vSphere Data Protection 6.0 Product Guides
vSphere Data Protection Administration Guide 6.0 [pdf]
Other Resources for vSphere 6.0
vSphere Management Assistant Guide [pdf]

There are other documents related to vSphere and vCenter. You can google it and eat them all 🙂

Note:

– untuk instalasi dan penjelasan vSphere 6, ada blog bagus dari Derek Seaman >>> (http://www.derekseaman.com/2015/02/vsphere-6-0-install-pt-1-introduction.html)

 

Thank you.

 

Kumpulan Knowledge Based penting untuk VMware vSphere 6.0

Saat ini sudah launch beberapa bulan yang lalu vSphere 6.0, tetapi saat ini masih banyak pengguna vSphere yang masih menggunakan versi 5.5. Jadi kumpulan knowledge based ini sangat berguna untuk me-recap apa saja yang KB yang penting untuk versi 5.5 ini.

With this in mind, we have created the following list of Knowledgebase articles that are brand new, or have been updated for vSphere 5.5 You’ll notice lots of best practices KBs here.

The first grouping contains absolute ‘must know’ information, the second grouping gets a bit more into details.

Also not to be missed:

Taken from: http://blogs.vmware.com/kb/2013/09/vsphere-5-5-is-here-kbs-you-need-to-know-about.html

High Level Best Practice Configuration yang perlu dicek untuk VMware vSphere Production Environment

Berikut ini ada beberapa konfigurasi best practice yang perlu dicek untuk mengetahui apakah environment VMware yang anda miliki saat ini sudah appropriate atau  tidak untuk production level. Ini adalah guidance secara high level saja. Untuk detailnya perlu dijelaskan lebih lanjut, mudah2an dapat saya teruskan untuk beberapa komponen dibawah ini.

Component Recommended Action Item
Compute Configure firewall rules and ports according to best practices.
Compute VMware vSphere ESXi Shell and SSH access should be configured per the customer security and manageability requirements.
Datacenter Use vCenter Server roles, groups, and permissions to provide appropriate access and authorization to the VMware virtual infrastructure. Avoid using Windows built-in groups (Administrators).
Datacenter Tasks and Events Retention Policy set in the environment.
Datacenter Size with HA host failure considerations.
Datacenter Set up redundancy for the management port (either using a separate vmnic or a separate uplink) and an alternate isolation response gateway address (if appropriate) for more reliability in HA isolation detection.
Datacenter Maintain compatible and homogeneous (CPU and memory) hosts within a cluster to support the required functionality for vMotion, vSphere DRS, VMware vSphere Distributed Power Management (DPM), VMware vSphere HA, and vSphere FT.
Network Verify that there is redundancy in networking paths and components to avoid single points of failure. For example, provide at least two paths to each network.
Network Configure networking consistently across all hosts in a cluster.
Network If jumbo frames are enabled, verify that jumbo frame support is enabled on all intermediate devices and that there is no MTU mismatch.
Network Minimize differences in the number of active NICs across hosts within a cluster.
Network Configure networks so that there is separation of traffic (physical or logical using VLANs).
Network Use DV Port Groups to apply policies to traffic flow types and to provide Rx bandwidth controls through the use of Traffic Shaping.
Network Use Load-Based Teaming (LBT) to balance virtual machine network traffic across multiple uplinks.
Network Use Network I/O Control (NetIOC) to prioritize traffic on 10GbE network uplinks.
Network Adjust load balancing settings from the default virtual port ID only if necessary.
Storage Minimize differences in datastores visible across hosts within the same cluster or vMotion scope.
Storage NFS and iSCSI storage traffic should be separated physically (for performance) and logically (for security).
Virtual Machines Limit use of snapshots, and when using snapshots limit them to short-term use.
Virtual Machines Verify that VMware Tools is installed, running, and up to date for running virtual machines.
Virtual Machines Verify that virtual machines meet the requirements for vSphere vMotion.
Compute Avoid unnecessary changes to advanced parameter settings.
Datacenter Enable bidirectional CHAP authentication for iSCSI traffic so that CHAP authentication secrets are unique.
Datacenter Disconnect vSphere Clients from the vCenter Server when they are no longer needed.
Datacenter Maintain compatible virtual hardware versions for virtual machines to support vMotion.
Licensing Verify that adequate licenses are available for vCenter Server instances.
Licensing Verify that adequate CPU licenses are available for ESXi hosts.
Network Distribute vmnics for a port group across different PCI buses for greater redundancy.
Network Change port group security default settings for Forged Transmits, Promiscuous Mode, and MAC Address Changes to Reject unless the application requires the defaults.
Storage Use shared storage for virtual machines instead of local storage.
Storage Size datastores appropriately.
Storage Allocate space on shared datastores for templates and media/ISOs separately from datastores for virtual machines.
Storage Use Storage I/O Control (SIOC) to prioritize high importance virtual machine traffic.
Virtual Machines As a security enhancement initiative, disable certain unexposed features.
Virtual Machines Limit sharing console connections if there are security concerns.
Virtual Machines Allocate only as much virtual hardware as required for each virtual machine. Disable any unused or unnecessary or unauthorized virtual hardware devices.
Virtual Machines Consider using the latest virtual hardware version to take advantage of additional capabilities.
Virtual Machines Use the latest version of VMXNET that is supported by the guest operating system.
Virtual Machines Use reservations and limits selectively on virtual machines that need it. Don’t set reservations too high or limits too low.
Virtual Machines Select the correct guest operating system type in the virtual machine configuration to match the guest operating system.

Kind Regards,
Doddi Priyambodo

 

Bagaimana cara belajar VMware untuk Pemula? (VMware Tutorial Indonesia)

Jika ingin belajar megenai product VMware, berikut ini adalah link public yang bisa dibuka dan dapat menjadi reference :

1. Official Website VMware (http://www.vmware.com), ada banyak public material yang di-share disana.

2. VMwareTV di youtube channel (https://www.youtube.com/user/vmwaretv), referensi video-nya bagus dan silahkan lanjut  browsing ke beberapa channels disana

3. Website kumpulan video (http://www.vmwarelearning.com), kumpulan video-video yang sangat bagus

4. VMware Feature walk through (http://featurewalkthrough.vmware.com), tutorial step by step untuk VMware for newbie

5. Laboratorium Virtual di Cloud! (http://labs.hol.vmware.com/), one word from me: “WOW!”

6. Subscribe blog ini regularly 🙂

7. Join VMware Class di authorized training yang tersebar di Indonesia

8. Ada beberapa resources internal  (ex:vault portal) yang aksesnya hanya dimiliki oleh VMware Employee dan VMware Partner. Coba berkenalan dan tanyakan ke mereka, mungkin ada beberapa public material yang bisa di-share oleh mereka.

 

Kind Regards,
Doddi Priyambodo

Cloning Microsoft SQL Server from a Template

SQL Server

There are several things that we need to do to deploy Microsoft SQL Server database using standardized template.

The items can be read fully from these blogs :

So, the options are :

  1. Run several PowerCLI scripts after the deployment (execute via SysPrep or vRealize Orchestrator)
  2. Use VMware Application Service and automate the deployment via vRealize Automation

 

Kind Regards,
Doddi Priyambodo

What Will SysAdmins Do in the Automated Cloud Future?

What Will SysAdmins Do in the Automated Cloud Future?                                                                                                                                                                                                                                                                          

Nobody would dispute that system administrators have been integral to keeping IT environments running. But that hasn’t stopped people from wondering whether sysadmins will still have a role in a future world of highly automated clouds.

They will, and it will be just as critical. But that role will also be very different.

Today, sysadmins are all about the VM. They’re akin to workers on a manufacturer’s production line. Sometimes they’re at the beginning of the line, figuring out where to place the VM and what services to connect to it, and then handing it off to developers who will add applications inside of it. Sometimes they’re at the end of the process, deploying the VM. Many times they’re manning the station, ready to add memory to fix poor performance or move a VM when a server fails.

But increasingly, advanced analytics engines are able to identify infrastructure anomalies and recommend remediation steps, and automation tools can put many of them into action. So what does that mean for sysadmins?

Instead of focusing on discrete tasks and spending large amounts of time on daily firefighting, sysadmins will be more strategic, like pilots overseeing entire operations.

Even though airplanes are capable of getting from point A to point B on their own thanks to intelligent systems and automation, pilots still man the cockpit. Their expertise is required to oversee and, often times, adjust the incredibly intricate and interdependent systems that keep planes flying. Pilots are the ones entrusted with getting passengers to destinations safely.

So, too, with sysadmins. The cloud—and the notion of software-defined data centers—have added order-of-magnitudes more complexity to IT environments. Instead of an application running on one VM, it may run on dozens of VMs, each of which has storage, load balancing, database and other services attached to it. Instead of the VM being the container, the application becomes the container. And each container is a system with many interdependent parts and services.

The sysadmin’s new role is to optimize and manage those systems. But unlike the way sysadmins have been managing VMs, they can’t hand-hold each of these complex systems. They’d run out of hours in a day before even scratching the surface. Rather, now that products such as VMware vCO, vCAC and App Director are being combined into a single automation stack, deeply integrated with vCenter Operations Management, and working together with SRM, vSAN and NSX, software can automatically handle many of the daily tasks. When more complex, critical problems arise, they’ll be flagged for the sysadmins, who will pull from their broad knowledge and nuanced understanding of storage, networks, applications and more, to triage and resolve them.

The sysadmins’ responsibilities won’t end there. By working at this higher level, they will be able to influence those systems in ways that help businesses operate more efficiently, cost-effectively and competitively. And that’s where their real value lies. Sysadmins of the future will be planners and problem solvers who leverage automated cloud environments and their advanced analytics capabilities. Like pilots, they’ll ensure the IT systems that businesses rely on can take them the enterprise where they it needs to go.

By Leslie Muller

Business Continuity (BC) vs Disaster Recovery (DR) in VMware Site Recovery Manager (SRM) Design – (RPO, RTO, WRT, MTD)

Business Continuity vs Disaster Recovery

DR : – we hoped it would never happen, but it has…
       – get the business running again ASAP
       – it is a tactical and technical movement
BC : – C level executive
       – who, what, where, and when is needed
       – not simply technical, whole of business need to be considered

RPO, RTO, WRT, MTD (Recovery Point Objective, Recovery  Time Objective, Work Recovery Time, Maximum Tolerable Downtime)

This is a simple explanation about RPO and RTO. Also the explanation about WRT and MTD, because there are few customers understand this terms completely. But, we need to discuss about these criteria during our design of Disaster Recovery. Especially if we want to implement VMware SRM (Site Recovery Manager).

 

Consider the following scenario.

Stage 1: Business as usual

At this stage all systems are running production and working correctly.

Stage 2: Disaster occurs

BCDR-02

On a given point in time, disaster occurs and systems needs to be recovered. At this point theRecovery Point Objective (RPO) determines the maximum acceptable amount of data loss measured in time. For example, the maximum tolerable data loss is 15 minutes.

Stage 3: Recovery

BCDR-03

At this stage the system are recovered and back online but not ready for production yet. The Recovery Time Objective (RTO) determines the maximum tolerable amount of time needed to bring all critical systems back online. This covers, for example, restore data from back-up or fix of a failure. In most cases this part is carried out by system administrator, network administrator, storage administrator etc.

Stage 4: Resume Production

BCDR-04

At this stage all systems are recovered, integrity of the system or data is verified and all critical systems can resume normal operations. The Work Recovery Time (WRT) determines the maximum tolerable amount of time that is needed to verify the system and/or data integrity. This could be, for example, checking the databases and logs, making sure the applications or services are running and are available. In most cases those tasks are performed by application administrator, database administrator etc. When all systems affected by the disaster are verified and/or recovered, the environment is ready to resume the production again.

BCDR-05

The sum of RTO and WRT is defined as the Maximum Tolerable Downtime (MTD) which defines the total amount of time that a business process can be disrupted without causing any unacceptable consequences. This value should be defined by the business management team or someone like CTO, CIO or IT manager.

This is of course a simple example of a Business Continuity/Disaster Recovery plan and should be included in your Business Impact Analysis (BIA).

Referenced from: http://defaultreasoning.com/2013/12/10/rpo-rto-wrt-mtdwth/

Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x

Symptoms

Permanent Device Loss (PDL)

  • A datastore is shown as unavailable in the Storage view.
  • A storage adapter indicates the Operational State of the device as Lost Communication.
  • All paths to the device are marked as Dead.
  • The /var/log/vmkernel.log file shows messages similar to:

    cpu2:853571)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:661: Path "vmhba3:C0:T0:L0" (PERM LOSS) command 0xa3 failed with status Device is permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0.
    cpu2:853571)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:661: Path "vmhba4:C0:T0:L0" (PERM LOSS) command 0xa3 failed with status Device is permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0.
    cpu2:853571)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:972:Could not select path for device "naa.60a98000572d54724a34642d71325763".
    cpu2:853571)WARNING: ScsiDevice: 1223: Device :naa.60a98000572d54724a34642d71325763 has been removed or is permanently inaccessible.
    cpu3:2132)ScsiDeviceIO: 2288: Cmd(0x4124403c1fc0) 0x9e, CmdSN 0xec86 to dev "naa.60a98000572d54724a34642d71325763" failed H:0x8 D:0x0 P:0x0
    cpu3:2132)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device "naa.60a98000572d54724a34642d71325763" is blocked. Not starting I/O from device.
    cpu2:2127)ScsiDeviceIO: 2316: Cmd(0x4124403c1fc0) 0x25, CmdSN 0xecab to dev "naa.60a98000572d54724a34642d71325763" failed H:0x1 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0.
    cpu2:854568)WARNING: ScsiDeviceIO: 7330: READ CAPACITY on device "naa.60a98000572d54724a34642d71325763" from Plugin "NMP" failed. I/O error
    cpu2:854568)ScsiDevice: 1238: Permanently inaccessible device :naa.60a98000572d54724a34642d71325763 has no more open connections. It is now safe to unmount datastores (if any) and delete the device.
    cpu3:854577)WARNING: NMP: nmpDeviceAttemptFailover:562:Retry world restore device "naa.60a98000572d54724a34642d71325763" - no more commands to retry

All-Paths-Down (APD)

  • A datastore is shown as unavailable in the Storage view.
  • A storage adapter indicates the Operational State of the device as Dead or Error.
  • All paths to the device are marked as Dead.
  • You are unable to connect directly to the ESXi host using the vSphere Client.
  • The ESXi host shows as Disconnected in vCenter Server.
  • The /var/log/vmkernel.log file shows messages similar to:

    cpu1:2049)WARNING: NMP: nmp_IssueCommandToDevice:2954:I/O could not be issued to device "naa.60a98000572d54724a34642d71325763" due to Not found
    cpu1:2049)WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.60a98000572d54724a34642d71325763": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
    cpu1:2049)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device "naa.60a98000572d54724a34642d71325763" is blocked. Not starting I/O from device.
    cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.60a98000572d54724a34642d71325763" - issuing command 0x4124007ba7c0
    cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:658:Retry world failover device "naa.60a98000572d54724a34642d71325763" - failed to issue command due to Not found (APD), try again...
    cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:708:Logical device "naa.60a98000572d54724a34642d71325763": awaiting fast path state update...
    cpu0:2642)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.60a98000572d54724a34642d71325763" - issuing command 0x4124007ba7c0
    cpu0:2642)WARNING: NMP: nmpDeviceAttemptFailover:658:Retry world failover device "naa.60a98000572d54724a34642d71325763" - failed to issue command due to Not found (APD), try again...
    cpu0:2642)WARNING: NMP: nmpDeviceAttemptFailover:708:Logical device "naa.60a98000572d54724a34642d71325763": awaiting fast path state update...

  • A restart of the management agents may show these errors:

    Not all VMFS volumes were updated; the error encountered was 'No connection'.
    Errors:
    Rescan complete, however some dead paths were not removed because they were in use by the system. Please use the 'storage core device world list' command to see the VMkernel worlds still using these paths.
    Error while scanning interfaces, unable to continue. Error was Not all VMFS volumes were updated; the error encountered was 'No connection'.

  • You may also see that the device is no longer listed:

    cpu17:10107)WARNING: Vol3: 1717: Failed to refresh FS 4beb089b-68037158-2ecc-00215eda1af6 descriptor: Device is permanently unavailable
    cpu17:10107)ScsiDeviceIO: 2316: Cmd(0x412442939bc0) 0x28, CmdSN 0x367bb6 from world 10107 to dev "eui.00173800084f0005" failed H:0x1 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
    cpu17:10107)Vol3: 1767: Error refreshing PB resMeta: Device is permanently unavailable

Purpose

This article discusses a Permanent Device Loss (PDL) and All-Paths-Down (APD) in ESXi 5.x, and provides information on dealing with each of these scenarios.

Resolution

In vSphere 4.x, an All-Paths-Down (APD) situation occurs when all paths to a device are down. As there is no indication whether this is a permanent or temporary device loss, the ESXi host keeps reattempting to establish connectivity. APD-style situations commonly occur when the LUN is incorrectly unpresented from the ESXi/ESX host. The ESXi/ESX host, still believing the device is available, retries all SCSI commands indefinitely. This has an impact on the management agents, as their commands are not responded to until the device is again accessible. This causes the ESXi/ESX host to become inaccessible/not-responding in vCenter Server.

In vSphere 5.x, a clear distinction has been made between a device that is permanently lost (PDL) and a transient issue where all paths are down (APD) for an unknown reason.

For example, in the VMkernel logs, if a SCSI sense code of H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 or Logical Unit Not Supported is logged by the storage device to the ESXi 5.xhost, this indicates that the device is permanently inaccessible to the ESXi host, or is in a Permanent Device Loss (PDL) state. The ESXi host no longer attempts to re-establish connectivity or issue commands to the device.

Devices that suffer a non-recoverable hardware error are also recognized as being in a Permanent Device Loss (PDL) state.

This table outlines possible SCSI sense codes that determine if a device is in a PDL state:

SCSI sense code Description
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 LOGICAL UNIT NOT SUPPORTED
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x4c 0x0 LOGICAL UNIT FAILED SELF-CONFIGURATION
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x3 LOGICAL UNIT FAILED SELF-TEST
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x1 LOGICAL UNIT FAILURE

For more information about SCSI sense codes in vSphere, see Interpreting SCSI sense codes (289902).

Note: Some iSCSI arrays map LUN-to-Target as a one-to-one relationship. That is, there is only ever a single LUN per Target. In this case, the iSCSI arrays do not return the appropriate SCSI sense code, so PDL on these arrays types cannot be detected. However, in ESXi 5.1, enhancements have been made and now the iSCSI initiator attempts to re-login to the target after a dropped session. If the device is not accessible, the storage system rejects the host’s effort to access the storage. Depending on the response from the array, the host can now mark the device as PDL.

All-Paths-Down (APD)

If PDL SCSI sense codes are not returned from a device (when unable to contact the storage array, or with a storage array that does not return the supported PDL SCSI codes), then the device is in an All-Paths-Down (APD) state, and the ESXi host continues to send I/O requests until the host receives a response.

As the ESXi host is not able to determine if the device loss is permanent (PDL) or transient (APD), it indefinitely retries SCSI I/O, including:

  • Userworld I/O (hostd management agent)
  • Virtual machine guest I/O

    Note: If an I/O request is issued from a guest, the operating system should timeout and abort the I/O.

Due to the nature of an APD situation, there is no clean way to recover.

  • The APD situation needs to be resolved at the storage array/fabric layer to restore connectivity to the host.
  • All affected ESXi hosts may require a reboot to remove any residual references to the affected devices that are in an APD state.

Note: Performing a vMotion migration of unaffected virtual machines is not possible, as the management agents may be affected by the APD condition, and the ESXi host may become unmanaged. As a result, a reboot of an affected ESXi host forces an outage to all non-affected virtual machines on that host.

Planned versus unplanned PDL

A planned PDL occurs when there is an intent to remove a device presented to the ESXi host. The datastore must first be unmounted, then the device detached before the storage device can be unpresented at the storage array. For more information on how to correctly unpresent a LUN in ESXi 5.x, see Unmounting a LUN or detaching a datastore/storage device from multiple ESXi 5.x hosts (2004605).

An unplanned PDL occurs when the storage device is unexpectedly unpresented from the storage array without the unmount and detach being executed on the ESXi host.

In ESXi 5.5, VMware provides a feature called Auto-remove for automatic removal of devices during an unplanned PDL. For more information, see PDL AutoRemove feature in vSphere 5.5 (2059622).

To clean up an unplanned PDL:

  1. All running virtual machines from the datastore must be powered off and unregistered from the vCenter Server.
  2. From the vSphere Client, go to the Configuration tab of the ESXi host, and click Storage.
  3. Right-click the datastore being removed, and click Unmount.

    The Confirm Datastore Unmount window displays. When the prerequisite criteria have been passed, the OK button appears.

    If you see this error when unmounting the LUN:

    Call datastore refresh for object <name_of_LUN> on vCenter server <name_of_vCenter> failed

    You may have a snapshot LUN presented. To resolve this issue, remove that snapshot LUN on the array side.

  4. Perform a rescan on all of the ESXi hosts that had visibility to the LUN.

    Note: If there are active references to the device or pending I/O, the ESXi host still lists the device after the rescan. Check for virtual machines, templates, ISO images, floppy images, and raw device mappings which may still have an active reference to the device or datastore.

  5. If the LUN is still being used and available again, go to each host, right-click the LUN, and click Mount.

    Note: One possible cause for an unplanned PDL is that the LUN ran out space causing it to become inaccessible.

See Also