One of the problems firms have when they hire IT staff on the basis of their ‘connections’ (usually wealthy or influential relatives – or if LinkedIn was involved, their drinking buddies) is they often get dumb solutions implemented. Here’s one we came across recently and what we did to fix it permanently.
FX Traders on a busy trading floor use many programs simultaneously to display pricing, liquidity, market information, and other data necessary to trade effectively. They are provided with redundant workstations: Two high-powered PCs, with four monitors and two network connections each. So they sit with eight monitors in front of them and if one of their PCs should shut down or lose its network connection, they can continue working uninterrupted from the other.
There were two third-party applications that most traders had running constantly and should certain conditions arise they will cause an IP Multicast Flood that will affect the entire subnet that they are on. Effectively preventing the entire team from trading. The only known resolution for this was to identify which trader’s PC was causing the flood and have it rebooted.
This would occur on average once per week and cause around 10-15 minutes downtime for the affected trading pit. While this doesn’t seem like much, for a large investment house this can cost tens of thousands of dollars in missed trading opportunities. Particularly when it occurs on a busy desk such as USDGBP (Cable), during a busy market period such as the NFP release, or when stops are triggered.
It also led to frustrated traders and clients who expect more reliable IT systems.
Enter The IT Guy
The one who led the support team for the applications that caused this problem was not an IT guy, in spite of how he may have been addressed. He had a second-class honours degree in electrical engineering from a provincial university, had worked with computers once before, and, all importantly had an Uncle with fifty-million quid deposited in a wealth management account at the bank, and he had an internet certificate with “agile” written on it.
So we can see why they hired him. What we can’t figure out is why they listened to him, let alone put him in charge of a critical IT team.
The First Solution
His solution, which was clearly thought-through to the best of his ability, was to “upgrade the traders’ PCs to have 1Gbs network cards“. This doesn’t even address the problem of the multicast flood but, as his thinking went, it might hide it and it involves upgrading hardware to the latest cool in-thing. It was implemented. After the first two weeks there had been no repeat of the outages and he was congratulated.
After a month or so the broadcast floods were seen again, only now they were filling the entire 1Gbs segment that his solution had required be created and it was affecting more than a single trading team – it was affecting the entire floor. His “solution” had merely masked the problem allowing it to fester undetected until it caused a worse outage than before, albeit with reduced frequency.
The net effect of his changes was there was the same amount of downtime which occured less frequently but affected more users for longer periods, at a cost of purchasing and installing several hundred gigabit network cards (which connect to a 100mbs back-bone and only run at 1gbs on the local subnets anyway).
The Second Solution
The IT Guy then came up with a second solution, a political one: Traders should be told to run the conflicting applications on separate PCs. They have two at their desks after all.
This was rejected by the business on the grounds that such a policy invalidates the active-active fail-over redundancy that the dual-PC set-up is intended to provide.
The Final Solution
That is when we were called in.
Our solution involved rolling back the network redesign to give each trading pit the two subnets they were originally attached to. This simply required rolling-back the routing configuration to it’s state prior to solution one. Then we needed a small reconfiguration of the local PC and application start-up files.
No new hardware, code, or cabling was required. It took a few hours to implement, was rolled-out over the weekend, and it solved the broadcast flood issue permanently.
Here’s what we did.
We noted that originally the two network cards in each PC were on separate sub-nets. Our analysis of the two applications involved in the flood revealed that one was set-up to use Netbios, the other to use only TCP/IP. In the Netbios application’s README files it clearly states that it is unsafe to run the application if there are any other multicast applications on the same subnet.
Thus the solution presents itself: Protocol segmentation.
We asked the desktop support team to configure one of the traders’ PC network cards to disable Netbios over TCPIP. The application deployment teams were asked to configure their applications to only bind to the appropriate network card. With these changes the two applications were now on separate subnets and could no longer conflict.
Simple small changes that could be implemented and tested in a day and that isolated the two misbehaving applications from each other. All that was required was in-depth, rather than superficial, knowledge of how IT systems work, how they scale, and the willingness to lead disparate IT teams to work together on solving problems.
No budget was required, no hardware needed to be installed. Small easily verifiable changes to managed configurations of existing systems was all that was needed. After roll-out there have been no more problems reported.
That’s what you pay the extra for when you hire IT professionals: Solutions that work permanantly, as opposed to those that merely make it through sign-off then fail again shortly after your IT Guy has been paid.