Title Developing a resilient power management portal for an enterprise infrastructure
Abstract The Scientific Computing Technology (SCT) group within the Science and Technology Facilities Council's (STFC) e-Science centre is responsible for managing large scale cluster computing services for the UK's e-Infrastructure as well as STFC's core Facilities. Such an infrastructure needs managing, and SCT has invested a great deal of time and effort in finding suitable products to streamline the System Administration duties where possible, and develop in-house tools when solutions either do not exist or do not justify the cost. Alongside requirements to provide power management functionality to an OS commissioning service, SCT also saw the need to extend functionality to cover all power outlets available in the SCT Infrastructure so a consistent view could be maintained. To meet this challenge, a new service was developed which manages Power Distribution Units (PDU), Intelligent Platform Management Interfaces (IPMI), and VMware Virtual Machines. With this service, authenticated System Administrators can perform fine grained power control of servers and Virtual Machines with information collected from both Oracle and MySQL databases. It employs a database caching technique and runs in a fault tolerant environment to provide a resilient service in the event of building power failure; a scenario where the service is deemed most needed. It gives a consistent view of an enterprise infrastructure, with fine grained control for power management. For the first time, SCT can perform fully automated Operating System deployment on virtually any host in its infrastructure.
