Rather than treat grid, cloud, and high-performance computing (HPC) as separate and distinct approaches, this month’s CN theme focuses more on interoperability among these methodologies — and the issues that arise along the way.
Working in Harmony
The first article in our theme, “From Meta-Computing to Interoperable Infrastructures” by Stelios Sotiriadis and his colleagues, begins with a high-level examination of what each of these technologies is suited for:
- HPC features tight coupling between applications and the underlying homogeneous infrastructure. Focus is primarily on speed and performance in the normally client-owned environment.
- Grid computing has a lesser coupling between applications and the underlying infrastructure. The apps are less location-aware as virtual organizations allow heterogeneous and geographically dispersed nodes. Focus is on parallelism and distributed computing, and access to the shared infrastructure is restricted.
- Cloud computing includes virtually no coupling between the application and underlying infrastructure. Focus is on pay-per-use and dynamic resource provisioning for anytime-anywhere computation on the publicly accessible infrastructure. Lower service-level agreements (SLAs) are often acceptable.
Nothing new, right? Almost. The twist in recent years has been the ability to coordinate and integrate these seemingly different environments. Sotiriadis and colleagues introduce the concept of a meta-scheduler that can move workloads across all three environments. Their article looks at current research topics in the area and, more importantly, the gaps between them.
Meta-Scheduling to Assist with Integration
Meta-schedulers can be centralized or decentralized, but the end goal is to integrate the environments under one management layer. Thomas Rings and Jens Grabowski demonstrate this approach in our second article, “Pragmatic Integration of Cloud and Grid Computing Infrastructures.” As they discuss, a meta-scheduler must accommodate the following challenges:
- Heterogeneous resources,
- Management and scheduling across local and remote locations and environments,
- Dynamic environments where resources may come and go;
- Geographically dispersed locations,
- Load-balancing with multiple resource types and locations,
- Failures and rescheduling, and
- Security constraints.
Much research has been done in this area, but we’ve really only begun to scratch the surface.
Rings and Grabowski take a very pragmatic approach, focusing on actually integrating a grid environment within the Amazon Web Services (AWS) public cloud infrastructure. In what they call “grid-in-cloud-services,” the authors use the Uniform Interface to Computing Resources (Unicore) to instantiate a private infrastructure-as-a-service (IaaS) cloud within the Amazon public IaaS cloud. The Unicore gateway (or meta-scheduler) controls load balancing between the internal grid and the grid-in-cloud-services infrastructure. This presents a significant step toward interoperability even as it outlines and demonstrates remaining challenges in this field.
Assuming we solve the general architectural challenges of integrating an external cloud with an internal grid/HPC environment, the issue of pricing takes center stage, particularly as it pertains to SLAs, SLA management, and service selection. That is, once we figure out how to shift load dynamically, will we be able to create a spot-market – in which resources are purchased and delivered immediately – in which users can choose service providers based on current SLA and price objectives? They could, for example, pay a cheaper price in the afternoon with a lower SLA to one vendor compared to a higher price in the morning from another vendor to meet higher service-level needs.
The goal is to take agility to the next level so that we can shift and load balance across multiple cloud vendors based on SLA and price. With that in mind, our third article, “Automatic SLA Matching and Provider Selection in Grid and Cloud Computing Markets” by Christoph Redl and his colleagues, discusses the implementation of an SLA template that can be used for negotiating (SLA-matching) and legally signing contracts (real-time provider selection). Implemented via Web services standards such as WS-Agreement or WSLA, their proposed SLA templates contain the required data (SLA metrics, parameters, and service-level objectives, for instance) that a machine-learning algorithm can use to mediate and agree to contracts. The machine-learning algorithm employs a MAPE-style control loop that 1) Monitors the learning progress versus recommendations, 2) Analyzes new knowledge to be added to the database, 3) Plans training and revision, and 4) Executes the training.
Load Balancing and Game Theory
Our final theme article, Qin Zheng and Bharadwaj Veeravalli’s “On the Design of Mutually Aware Optimal Pricing and Load-Balancing Strategies for Grid Computing Systems,” presents strategies for addressing the issues discussed by Redl and colleagues. The authors delve into the game theory aspects of load balancing (a topic that particularly interests me because of its relevance to the research I am pursuing for my PhD). “Gaming” the underlying system — by overwhelming the infrastructure in order to prevent others from gaining access, for example — can have adverse effects on other users’ SLAs. In contrast, resource reservation systems can’t be gamed because they carve out resources irrespective of the users’ needs or requirements. Other methods can also prevent, or at least reduce, the ability to game the system. Redl and colleagues examine how providers can use price differentiation to help prevent temporal starvation, in which a user’s request for service is delayed because the system has been gamed, as opposed to the current model in which systems seek to average (or cap) usage among concurrent consumers so that no single user can starve others. Competition in grid environments is less severe than in public clouds as the users have no vested interest in cooperating with each other in the latter. In a cloud environment, two users might represent competing companies and thus have no interest in cooperating with each other.
Offering various pricing models can also help providers maximize revenue and increase usage. The authors show that it’s easier for service providers (internal or external) to affect behavior by simply varying usage price rather than imposing restrictions or other such policies. This might seem obvious, but in a cloud environment with noncooperative users, setting the wrong price could drive revenue down as the cloud no longer represents the most attractive option. Ultimately, the goal here is to establish a market that embodies the Nash equilibrium point — the point at which no single service provider could wish to change its price and still expect to increase its profit.
Unless a complete rip-out of a current environment becomes feasible (perhaps when old servers reach the end of their lives), efficient seamless integration and interoperability among HPC, grid, and cloud environments will be an important challenge to the growth of cloud computing in fields that are already invested in HPC and grid technologies. From an application perspective, end users care primarily about their SLAs and the associated costs. The ability to bring the benefit of the cloud’s dynamic provisioning and adaptation to HPC and grid environments will allow users to make trade-off decisions in real time to meet service-level needs. The articles in this month’s CN theme explore some methods for achieving that interoperability.
Art Sedighi is a freelance consultant working in New York City with an emphasis on infrastructure design and implementation. He has an MS in computer science from Rensselaer University and an MS in biotechnology and bioinformatics from Johns Hopkins University. Sedighi is currently working toward a PhD in Applied Mathematics from the State University of New York at Stony Brook. See his blog and PhD status pages at: http://phd.artsedighi.com. Contact him at firstname.lastname@example.org.