This post is part of my VMware VCIX-NV Study Guide and covers some troubleshooting tips for common NSX connectivity issues.
Documentation
Index
- Troubleshoot virtual machine connectivity to Logical Switches
- Troubleshoot dynamic routing protocols
- Troubleshoot Virtual Private Networks (VPNs)
- Troubleshoot VXLAN, VTEP, and VNI configuration and connectivity
Troubleshoot virtual machine connectivity to Logical Switches
I’m not really sure what VMware means with this subject. There are several things can can fall under this, like VXLAN connectivity preventing the virtual machine from going across the logical switch. We’ll cover this in other topics.
Troubleshoot dynamic routing protocols
The NSX Edge can use OSPF, BGP and IS-IS for dynamic routing between other network components (other Edges or physical devices). Below are some troubleshooting tips for dynamic routing:
Show active neighbors
vShield-edge-12-0> show ip ospf neighbor Neigbhor ID Priority Address Dead Time State 1.1.1.1 128 192.168.99.1 36 Full/DR vShield-edge-2-0> show ip bgp neighbors vShield-edge-2-0> show isis neighbors
Show installed dynamic routes
vShield-edge-2-0> sh ip route bgp vShield-edge-2-0> sh ip route isis vShield-edge-12-0> show ip route ospf Codes: O - OSPF derived, i - IS-IS derived, B - BGP derived, C - connected, S - static, L1 - IS-IS level-1, L2 - IS-IS level-2, IA - OSPF inter area, E1 - OSPF external type 1, E2 - OSPF external type 2, N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2 O E2 1.1.1.0/24 [110/0] via 192.168.99.1 O E2 2.2.2.0/24 [110/0] via 192.168.99.1 O E2 10.1.5.0/24 [110/0] via 192.168.99.1 O E2 192.168.1.0/24 [110/0] via 192.168.99.1
Show interfaces listening for neighbors
vShield-edge-2-0> show ip ospf interface vNic_3 is activated Internet Address 192.168.99.1, Network Mask 255.255.255.0, Area 0.0.0.0 Transmit Delay is 1 sec, Network Type BROADCAST, State DR, Priority 128 Designated Router's Interface Address 192.168.99.1 Backup Designated Router's Interface Address 0.0.0.0 Timer intervals configured, Hello 10, Dead 40, Retransmit 5 vShield-edge-2-0> show isis interface
Show link state database for OSPF/ISIS
vShield-edge-2-0> show isis database vShield-edge-2-0> show ip ospf database
Detect Authentication Failure
vShield-edge-12-0> show log reverse 2015-01-31T12:38:18+00:00 vShield-edge-12-0 routing[876]: [user.info] AUDIT 0x3e02-39 (0000): OSPF 1 Packet received with unexpected authentication type 1.
Detect OSPF Area Misconfiguration
vShield-edge-12-0> show log reverse 2015-01-31T12:40:58+00:00 vShield-edge-12-0 routing[876]: [user.emerg] EXCEPTION 0x3e01-110 (0000): OSPF 1 OSPF packet dropped because it was received on non-existent or inactive virtual or sham link
Debugging
In addition to doing all kinds of show commands to determine the status of ISIS, OSPF or BGP, you can also debug the protocols to get a lot more information about what the processes are doing in the background. To start the debugging process:
vShield-edge-12-0> debug ip ospf vShield-edge-12-0> debug ip bgp vShield-edge-12-0> debug isis
When enabled, the log will fill up with messages from the protocol. You should never let this running continuously, always disable it when you’re done. To stop the debugging process:
vShield-edge-12-0> no debug ip ospf vShield-edge-12-0> no debug ip bgp vShield-edge-12-0> no debug isis
An sample output of the debug messages OSPF sends when establishing a neighbor relationship:
2015-01-31T12:46:58+00:00 vShield-edge-12-0 routing[876]: [user.info] AUDIT 0x3e01-226 (0000): OSPF 1 i/f idx 0X00000004 rtr ID 1.1.1.1 IP addr 192.168.99.1 neighbor FSM has processed an input. 2015-01-31T12:46:58+00:00 vShield-edge-12-0 routing[876]: [user.info] AUDIT 0x3e01-200 (0000): OSPF 1 Database exchange with an adjacent OSPF neighbor has been completed. 2015-01-31T12:46:58+00:00 vShield-edge-12-0 routing[876]: [user.info] AUDIT 0x3e01-226 (0000): OSPF 1 i/f idx 0X00000004 rtr ID 1.1.1.1 IP addr 192.168.99.1 neighbor FSM has processed an input.
Troubleshoot Virtual Private Networks (VPNs)
VPNs can be tricky, especially between two vendors. So when you’re configuring them, you should know where to look if one doesn’t come up.
The NSX Edge keeps logs of the events, which are stored in /var/log/messages. The contents can be viewed through the command “show log”. You can either check that or check the central syslog facility, if you have one. The following log lines are taken from the output of “show log reverse”.
Phase 1 or 2 Policy Mismatch
When the VPN on the NSX Edge hangs in the “STATE_MAIN_I1” state, there’s something wrong with the Phase 1 or 2 negotiation. Look for “s1-c1” and “NO_PROPOSAL_CHOSEN” in the logs:
Jan 31 13:11:35 gw-vpn01 ipsec[6769]: | got payload 0x800(ISAKMP_NEXT_N) needed: 0x0 opt: 0x0 Jan 31 13:11:35 gw-vpn01 ipsec[6769]: | ***parse ISAKMP Notification Payload: Jan 31 13:11:35 gw-vpn01 ipsec[6769]: | next payload type: ISAKMP_NEXT_NONE Jan 31 13:11:35 gw-vpn01 ipsec[6769]: | length: 96 Jan 31 13:11:35 gw-vpn01 ipsec[6769]: | DOI: ISAKMP_DOI_IPSEC Jan 31 13:11:35 gw-vpn01 ipsec[6769]: | protocol ID: 0 Jan 31 13:11:35 gw-vpn01 ipsec[6769]: | SPI size: 0 Jan 31 13:11:35 gw-vpn01 ipsec[6769]: | Notify Message Type: NO_PROPOSAL_CHOSEN Jan 31 13:11:35 gw-vpn01 ipsec[6769]: "s1-c1" #1: ignoring informational payload, type NO_PROPOSAL_CHOSEN msgid=00000000
Pre-Shared Key Mismatch
When the PSK does not match, the log will tell you something about “INVALID_ID_INFORMATION”, after initiating the “Quick Mode” for information exchange.
Jan 31 13:15:00 gw-vpn01 ipsec[3855]: "s1-c1" #1: transition from state STATE_MAIN_I3 to state STATE_MAIN_I4 Jan 31 13:15:00 gw-vpn01 ipsec[3855]: "s1-c1" #1: STATE_MAIN_I4: ISAKMP SA established {auth=OAKLEY_PRESHARED_KEY cipher=oakley_3des_cbc_192 prf=oakley_sha group=modp1024} Jan 31 13:15:00 gw-vpn01 ipsec[3855]: "s1-c1" #1: Dead Peer Detection (RFC 3706): enabled Jan 31 13:15:00 gw-vpn01 ipsec[3855]: "s1-c1" #2: initiating Quick Mode PSK+ENCRYPT+TUNNEL+PFS+UP+SAREFTRACK {using isakmp#1 msgid:e8add10e proposal=3DES(3)_192-SHA1(2)_160 pfsgroup=OAKLEY_GROUP_MODP1024} Jan 31 13:15:00 gw-vpn01 ipsec[3855]: "s1-c1" #1: ignoring informational payload, type INVALID_ID_INFORMATION msgid=00000000
Troubleshoot VXLAN, VTEP, and VNI configuration and connectivity
MTU Size
VXLAN requires you to set a larger MTU size. The recommended size is 1600. You can check from the ESXi server CLI whether the VXLAN stack has issues and if the correct MTU has been configured on the ESXi host uplinks by simply doing a (special) ping:
~ # ping ++netstack=vxlan -d -s 1572 -I vmk3 192.168.99.103 PING 192.168.99.103 (192.168.99.103): 1572 data bytes 1580 bytes from 192.168.99.103: icmp_seq=0 ttl=64 time=1.108 ms 1580 bytes from 192.168.99.103: icmp_seq=1 ttl=64 time=3.246 ms
If the ESXi host (192.168.99.103 is a different host from where the test was) does not respond correctly, try with a lower packet size like 1472. If it does respond that time, the MTU is not configured correctly.
If it does respond, but VXLAN issues persist, zoom in on the controller and ESXi host communication. First, get the ID of the logical switch you’re having issues with (through the GUI or CLI) and login to a controller to see whether your ESXi hosts are logged in to the controller for this logical switch:
nsx-controller # show control-cluster logical-switches connection-table 5003 Host-IP Port ID 192.168.99.103 43261 1 192.168.99.104 42155 2
If that looks okay, check whether the ESXi hosts have registered as VTEPs with the controller:
nsx-controller # show control-cluster logical-switches vtep-table 5003 VNI IP Segment MAC Connection-ID 5003 192.168.99.103 192.168.99.0 00:50:56:63:18:db 1 5003 192.168.99.104 192.168.99.0 00:50:56:66:08:fe 2
If there are no VTEPs registered, there might be an issue with multicast on the network (if configured). If you’ve discovered that the ESXi hosts have registered as VTEPs, check whether any MAC addresses of virtual machines are registering with the controller for the logical switch:
nsx-controller # show control-cluster logical-switches mac-table 5003 VNI MAC VTEP-IP Connection-ID 5003 00:50:56:bc:21:ab 192.168.99.103 1 5003 00:50:56:ed:1a:bc 192.168.99.104 2
If there are no MAC addresses present, multicast (if configured) might be the culprit. If everything looks fine and you still don’t have connectivity, start checking firewalls. 😉
Leave a Reply