HOWTO: Troubleshoot JGroups and Multicast IP Issues

Introduction

Most, if not all, of the information for this HOWTO is based on issues that occurred running jboss/jgroups in production over a period of many years.  JGroups and multicast IP-related issues are not all that common but when they do occur, a long enough time has passed from the previous incident that the resolution of that previous incident is not in our short-term memories.

Before Beginning

Before you even get started with jgroups and multicast IP, talk to one of your network engineers to figure out a proper multicast address to use for your cluster.  On an IPv4 network, IP Multicast's network range is from 224.0.0.0 to 239.255.255.255.  Ask your network administrator what he or she feels is a safe address to use for your multicast traffic.  Simply picking 224.0.0.0, for example, would be a terrible choice.  This wikipedia page on multicast addresses highlights some other addresses that would be bad choices for your jgroups cluster address.  In some cases, your network engineering team has spent considerable time designing the network for multicast traffic and may already have a block of addresses that can be configured for group addresses by applications.

JGroups ships with useful utilities for testing whether or not two nodes can communicate via multicast called "McastRecieverTest" and McastSenderTest". The syntax is relatively straightforward. On one machine, copy the jgroups jar file to a directory you have write access to, then you'll start up the receiver by running the following from a console window:

export JAVA_HOME=<PATH_TO_YOUR_JAVA_HOME>
$JAVA_HOME/java -cp jgroups-all.jar org.jgroups.tests.McastReceiverTest -mcast_addr <NETADMIN_PROVIDED_MULTICAST_IP_ADDR> -port <ANY_HIGH_PORT_NOT_IN_USE>

On the other machine, you'll execute the McastSenderTest using the same IP address and port number used on the receiver. Syntax for starting the sender is:

export JAVA_HOME=<PATH_TO_YOUR_JAVA_HOME>
$JAVA_HOME/java -cp jgroups-all.jar org.jgroups.tests.McastSenderTest -mcast_addr <NETADMIN_PROVIDED_MULTICAST_IP_ADDR> -port <ANY_HIGH_PORT_NOT_IN_USE>

That's it. In the console window running your sender test, type anything and press enter. You should see the output of what you typed show up in the receiver window's console window almost simultaneously (and without noticeable delay). Repeat the SenderTest on any node that will be part of your cluster. If all nodes are working, multicast IP is working fine. If you do not see any output show up in your Receiver console, then there is something wrong and see the next section.

Multicast IP not working using the JGroups tests

In every case I've come across where jboss clusters and jgroups clusters won't start or work, with one exception noted below in the next section, the cause has always been network configuration related. The problem is convincing your network administrators that there is an actual problem but having a basic understanding of what on your server is connected to what can be very helpful in determining cause.

Many servers in datacenters utilize redundant nics configured in a bond (also called NIC teaming). Some popular configurations call for fail-on-fault or fault-tolerance mode. This means you have two NICs bonded as one, one of which is live, one of which is in standby mode and only becomes live if the primary NIC (or the switch port the NIC connects to) fails. If configured well, theoretically, your primary NIC is going to be connected to a port on the primary switch and your standby NIC will be connected to a port on a standby switch.  This provides redundancy in the event of NIC failure, port failure on the switch, or switch failure. It is very important to understand that there is a switch (or set of switches) connecting the primary and standby switches, too, and that multicast IP traffic needs to be able to traverse all those switches.  One cause of failure experienced in the past was for this very reason:  Server A was plugged into Switch A and Server B was plugged into Switch B but Switch C connecting Switch A to Switch B was not allowing multicast traffic.

Cisco switches default their ports to enabling IGMP Snooping. Nexus devices enable this by default--catalyst switches do not but many administrators will enable it. If you have servers in the same VLAN and multicast IP is not working, (and you are running a Cisco-powered network like most people do), then this cisco article can help. The simplest thing to do is solution 5: Disable IGMP Snooping on that VLAN but your network engineer really needs to best understand the impact of doing that before simply turning it off because it can result in a very large amount of broadcast traffic on that vlan. Whatever solution is used, though, can be easily validated using the JGroups MCast testing tools.

JGroups/Multicast was working, now it isn't

If servers have been moved to new vlans, see the previous section.

I ran across this problem when my nodes were recently provided with new IP addresses as part of a datacenter move:

UDP.createSockets(): cannot list on any port in range 0-1

This is *not* a multicast IP problem.  The problem here is that the address bound to your NIC is not the same address set as your hostname in /etc/hosts.  I've never seen this problem occur on a Windows box--only Linux boxes.  Check the IP address bound to your NIC (or have your systems admin do it) and verify that it is the same IP address for the host entry for that server in /etc/hosts.  They should *not* match if you are generating this error on jboss or jgroups startup.  This is the only case that I've ever come across in the last three or four years where a jgroups-related exception was not caused by a network configuration problem.


Creative Commons Attribution-ShareAlike 3.0 Unported