Logo
Digitized Anything

Funded by: Internal Experiment

AIRR is an AI-based Resource Recommender for High-Performance Computing (HPC) applications in Cloud environments. AIRR was developed as a generalizable approach to the problem of finding the optimal Cloud instance type per HPC-application job, without the need for extensive modeling or preexistent data.

The system treats the challenge of resource allocation as a contextual multi-armed bandit problem. As users submit jobs to be executed in the cloud, AIRR will use these jobs as opportunities to gain insight into the relationship between application, job parameter values, and hardware, which will improve the quality of future recommendations.

AIRR has been initially validated on a mix of four HPC applications and eight, diverse Amazon EC2 instance types. The system was shown to converge towards an optimal solution even after a small number of job executions. It effectively explored different options, which further improved its recommendation choices over time.

AIRR was designed with small labs and individual researchers in mind, who want to run their own HPC applications multiple times, with different inputs. We aim to service these users by:

  • providing an application- and platform-agnostic resource recommender, which can learn the relationship between performance, hardware, and specific-job parameter values with continued usage.
  • keeping the necessary setup and maintenance effort as low as possible by learning the optimal instance type exclusively via real, user-submitted HPC jobs.

However, AIRR can in principle, be also used on the side of Cloud-service providers for optimizing Cloud-resource usage across a rich mix of user applications.

  • Suboptimal choices of instance type for HPC applications lead to either suboptimal application performance or unnecessary extra lease costs or both.
  • HPC-application performance depends on a typically complex interaction between the application, the hardware, and job-specific parameters. There is does not often one overall best choice.
  • Small labs and individual researchers typically lack specialized computer-science knowledge necessary to make optimal hardware choices.
  • Researchers and labs often do not have the budgets for extensive testing or auxiliary data collection to train a resource-recommendation system.
  • Preexistent resource-recommendation systems tend to rely on extensive modeling or auxiliary data and are therefore not suitable for these users.

AIRR makes use of state-of-the-art AI technology to approximate the relationship between application, hardware, and job parameters. It combines this with time-tested algorithms to decide when it is time to exploit its knowledge and when to explore further, so as to improve future recommendations. This technology includes:

  • Reinforcement Learning (RL)
  • Contextual Multi-Armed Bandit algorithms
  • Deep Multilayer Perceptron (MLP) networks

The main envisioned user base for AIRR are small labs or individual users that employ their own HPC applications, which they regularly execute with different inputs. However, we believe that AIRR can also be of interest to Cloud hyperscalers. As AIRR recommends optimal choices of instance types for specific HPC jobs, it can help the Cloud providers improve their infrastructure utilization.