One of the most interesting, useful, and to me, overlooked, features of virtualization is the ability to do dynamic resource allocation and management. The ability of the system to balance computing loads on the fly is something that long been a job that has consumed systems administrators trying to squeeze every last cycle out of their compute hardware for decades. The need to balance CPU, memory, disk i/o, and network i/o by creating the optimum mix of compute jobs in the right order, of the right duration, at the right time with respect to not only those resources, but those of power and cooling can create some interesting and difficult problems.
The advent of Distributed Power Management (DPM) and Distributed Resource Scheduling (DRS) greatly alleviated these problems. The systems manager can define policies for maximum CPU utilization, for example, under which VMs are moved off of one compute node in a virtualization farm that is overtaxed to one that has more resource available. Conversely, DPM will consolidate loads onto fewer servers, allowing those that are not needed to be powered down to reduce power consumption.
These features are incredibly useful, but only address the single variable of CPU utilization. Another variable which can often cause a need to move VMs is that of network I/O. No matter how much we have, bandwidth is a precious resource, and often one that can vary widely over time. In order to help control VM locality with this in mind, Oracle added network policies to DPM/DRS configurations. A threshold may be set for each network on an OVM server, which, when exceeded, triggers migration of VMs to other OVM servers in the pool.
It is key to note that these policies can be defined per server, or across the pool as a whole. Keep in mind that the theoretical capacity of a network is not what it may appear to be, not is it always the actual capacity. When looking at capacity, remember to take things like TCP/IP overhead, padding, multicast traffic, link aggregation, and failover bonding into account. Under Oracle Linux, the theoretical network capacity for aggregated links is the sum of those links, while in the failover case, the theoretical max is the port speed of the Ethernet port being used.
(Side note: When debugging network performance issues don’t forget to look at duplex, MTU size, VLAN constraints, and QoS parameters that may be set in the switching fabric. )
In order to turn on the network parameters of DPM/DRS in Oracle VM, navigate to the Servers and VMs tab, select the pool you want to work with, and then in the Perspective field, select Policies, and then edit the policies.
This will display the Configure Policy dialog, in which you will first set the policy type, CPU threshold, and the servers within the pool you want to apply this policy to.
Note that you don’t have to set a CPU policy. You can create a policy solely for network thresholds. Here, I have selected the servers, but not applied a CPU threshold.
The next screen will allow you to select the networks you wish to add the policy. Proceed to the final dialog. This final step of the configuration wizard will offer you the opportunity to set the threshold for each network individually.
Click Finish to apply.
This process allows us another variable to tune the behavior of our OVM systems to get the last bit of performance and to balance the workload based on several variables.
As always, the proof is in the monitoring. Tools such as Enterprise Manager, Cacti, Nagios, and MRTG will help keep an eye on what is going on, but DPM/DRS will actually get the job done without that annoying middle of the night phone call.
Full documentation can be found in the Oracle VM Manager User’s Guide for Oracle VM 3.4, in the DPS/DPM policies section. http://docs.oracle.com/cd/E64076_01/E64081/html/vmcon-svrpool-netpolicy.html
You can also get additional information about using DRS/DPM based on CPU on Jeff Savit's blog about DRS/DPM.