Why are network engineers so bitter about managing virtualization?
Rivka Gewirtz Little
Published: 13 Apr 2012
Network engineers are tired of being viewed as plumbers—especially when it comes to managing virtualization. After all, the job of supporting virtualized traffic goes so much deeper than providing an always-available pipe. Systems teams understand the complexity of a virtualized environment, but don’t always see the network admin’s role in the virtual network management process. The split results in ineffective troubleshooting strategies and network architectures that don’t always better a virtualized environment.
Virtualization architect Bob Plankers recognized that problem amongst his own ranks at a large Midwestern university and set out to change things by opening up conversation—and management tools— between the two teams. The result? A new network architecture and an effective approach to managing virtualization.
Is there really a disconnect between networking and systems folks when it comes to managing virtualization?
Bob Plankers: Absolutely. Virtualization or systems people don’t include the network guys in what’s going on. In traditional data center models, when workloads stayed in one place and things were static, [the networking team was aware], but to have the ability to do vMotion, moving VMs around in a data center without any notice, is kind of distributing to them. I don’t want to liken network guys to plumbers, but if they’re maintaining pipes, and all of a sudden you’ve moved a bunch of water flow from one place to another, they’re wondering what’s going on. Some network guys don’t understand what is possible with virtualization or what their systems guys are doing.
But systems guys don’t understand why network guys care. They just look at the network as a pipe. They say, ‘Well, there’s one across my data center so I’ll put my ESX hosts there, and I’ll put [one] of my hosts here,’ and they have no concept of the infrastructure that’s required to connect the switches and how much bandwidth there is. They just see it as this always-on service, which I guess is a credit to network guys in general, but at the same time the two need to talk about what’s going on.
There has to be concern among systems guys about whether there’s enough capacity, though, right?
Plankers: Yes, there should be. And there are two types of capacity with virtualization. There’s outward facing, from a VM that generates traffic as a server on the network, and then there’s the vMotion and inter-cluster communications within a VMware cluster. The vMotion is really taxing on a network. You take a physical host with 256 gigs of RAM and you want to copy that 256 gigs of RAM somewhere else to another host as quickly as possible—that drives quite a lot of traffic. Also, VMware has pretty specific limits on how much latency there can be between ESX hosts when you’re vMotioning things. You can’t have a router in between.
The problem [becomes] whether the virtualization guys talked to the network guys when they were designing their stuff, or did they just plop it on the network? In a lot of cases [the environment] grew organically, so you had one or two virtualization hosts and you thought, ‘Hey, this is pretty cool. I just saved a bunch of money.’ So you added a third and then a fourth, but then you didn’t have any room, so they’re scattered all over the data center.
What does your own environment look like?
Plankers: We have all Dell servers—just rack mounts, not blades. And we’re also a Cisco shop for the network. I’ve got two VMware vSphere clusters. One has 10 machines in it and the other has eight machines in it, serving as the physical host for about 500 virtual machines.
That’s a big environment. Do you have a communications problem with the networking team?
Plankers: Yes, but there was a Networking Tech Field Day (a conference of networking bloggers) last August where I was the only systems guy in the room with 11 other network guys. One of the Force10 guys was going on about how systems guys get up in the morning and do all this vMotion crap and he’s like, ‘I don’t know why they do it.’ So I raised my hand and said, ‘Would you like to know?’ It became very clear in that moment that network guys have no idea why systems guys do things, and they’re a little bitter about not being included sometimes. They’re bitter about being seen as plumbers.
I realized that I needed to start talking to my network guys. As a result, we’re starting a project right now where we’re [changing] 1 gig connections to virtualization hosts. When you’re trying to vMotion off VMs that occupy a host of 256 gigs of RAM, or 512 gigs of RAM, hosts get bigger. One of the things about virtualization is that it pays to have fewer larger hosts than smaller hosts. But as the hosts get bigger, the vMotion process gets longer. If you’re clearing it off because that host is having hardware failure, you need to go faster. So we’ve decided it would be better if we could put all of our gear in one rack column, in one spot in the data center. We will put in a top-of-rack switch that’s 10 gig, and all the inter-cluster stuff can be limited to that switch, so you’re not taxing other parts of the network. We’re making changes that make networking happy and make me happy because I’m getting 10 gig connectivity. That’s a testament to what happens when you work with people.
Will that mean the network team has any more control of traffic management inside the vSphere environment?
Plankers: Not really. They don’t manage any of the distributed switches or anything like that, but they will have access to [see] them. One of the other things that came out of [cross-team conversations] is that I’ve given [networking] access to see where the VM is and on which host. It turned out a few months ago we were having a problem where it became really clear that if they had access to that data they could help diagnose as opposed to watching us diagnose. They’ve got their own set of tools for monitoring and management and then I’ve got mine. They are still separate, but now I can see their router logs so it’s a much more unified effort.
You gave them access to your VMware tools?
Plankers: They’ve got access to the vCenter client, and they can look at the logs. I also showed them how to see the network configurations. They don’t have permission to change them because I would like them to talk to me about it—just like I don’t have permission to change things on their switches and routers.
Is there a possibility of moving to a joint, third-party management tool that shows what’s available in the physical and virtualized environments?
Plankers: Absolutely. Xangati has some [cross-platform] tools that are network oriented, and they are able to pick up data from a variety of sources including physical switch gear, so you can see your VM end-to-end. We’ve been looking at it, but for us it’s a budgetary thing.
Xangati is good, but in a lot of other cases, there are tool vendors saying they can conduct managing virtualization, but it’s a limited add-on compared to what you get natively from VMware products. Then you have to ask, ‘Is it better to have one tool that’s OK at everything or two tools that are really good at what they do?’
What about the Nexus 1000v, which gives network engineers more control of the virtualized environment?
Plankers: For us it’s an added cost; we don’t need the functionality of it, so we haven’t implemented it. In some places that was the way to appease the network guys, by basically giving them control of the virtual switches, but I guess each organization has its own style and way of dealing with this situation. For some who have tried to implement it, they might have tried talking to each other first.
Application performance is something that network guys are often responsible for. How do they address this if they can’t get their hands on the virtual network?
Plankers: You can’t. How can you manage something you can’t see? If they are in charge of managing performance, they need the tools to see what’s going on or they’re not in charge of managing the performance.
Who is in charge of application performance management in your situation?
Plankers: With us it’s a tiered thing. We’ve got network guys, storage guys, server or virtualization guys —me—sitting in the middle of all this stuff. Then we’ve got sys admins who are bridging the gap between me and the app people. And we’ve got app people as well. If an app is having a performance problem, there are a lot of people who need to be involved.
In our particular environment, that can get kind of tricky because the app guys point the finger at virtualization when a VM is slow, and I turn around and say the VM is slow because storage is slow, and maybe storage is slow because the network is slow.
For us, any performance tool that I implement needs to be shared with everyone, so the app admin, the storage admin and the network guy all need to see the data.
Traditionally, network engineers use VLANs to segment and secure traffic. In a virtualized environment that’s different. How do you address traffic segmenting and security in this environment?
Plankers: We use the VLAN capabilities in the virtual switches all the time. It’s either that or we have to put a ton of network interfaces in our hosts. For us, if the VLAN is enough segmentation to appease security people and enough for network guys on their uplinks and on their back trunks, it’s good enough for me as well. Then I just configure my virtual switches to use the VLAN capabilities.
Networking folks don’t love automation, especially without granular management. How do you address that?
Plankers: For systems guys, automation is ridiculous, and for network guys, their attitude seems to be that crap rolls downhill. If a systems guy is having a problem they will blame the network, and automation makes that worse.
Automatic provisioning of VMs can be kind of scary, but certain levels of automation can go a long way toward helping us and saving us time. There needs to be oversight so it’s not scary. If firewall rules are automatically changed, somebody like a security guy needs to go back and make sure it’s right. Automation doesn’t replace audit processes. In fact, it drives the need for more audits.
Do you use the firewalls that are built into VMware or will you look at third-party security?
Plankers: I am letting my network guys do the firewalling. They’ve got a really mature solution for firewalling any device on the network [using Cisco ASA firewalls]. I am not going to reinvent anything. As there are replacement cycles, now that we are talking with one another, we can actually have conversations about things like that going forward though. We might go toward some of the virtual firewall vShield stuff. Altor Networks makes a decent firewall. Some of that stuff is interesting because it can do firewalling at the VM level. It can say that VM ‘X’ absolutely can’t talk to VM ‘Y’ even if they’re sitting right next to each other on the same network segment, on the same VLAN. That’s cool for shared hosting, multi-tenant environments.
What are your thought on working with Virtualized systems? Leave a comment below!