Developing a Nomad Autoscaler for Harvester

Nomad x Harvester

Nomad orchestrates application deployment and management. As applications grow in size, managing resource consumption becomes crucial. The Nomad Autoscaler is a pluggable service that makes workload scaling more accessible, empowering users to create logic for scaling their infrastructure.

Developing a custom plugin is especially beneficial when catering to cloud environments or hypervisors that aren't supported by the HashiCorp community. This blog will guide you through creating a Nomad Autoscaler plugin, through the use of the exposed methods: SetConfig, Scale, and Status.

Defining the Plugin Struct

For our Nomad Autoscaler plugin, we'll define a struct to hold configuration and other state information for scaling on Harvester. The Plugin struct implements the sdk.Target interface to work as a nomad autoscaling plugin. The Plugin struct should contain all the state needed to actually implement autoscaling, such as configuration. loggers, and api clients.

package main

import (
    "context"
    "fmt"
    "time"
    "github.com/hashicorp/go-hclog"
    "github.com/drewmullen/harvester-go-sdk"
    "github.com/hashicorp/nomad/api"
)

type HarvesterPlugin struct {
    config          map[string]string
    logger          hclog.Logger
    HarvesterClient *harvester.APIClient
    NomadClient     *api.Client
    // Additional Config
}

func NewPlugin(log hclog.Logger) *HarvesterPlugin {
    return &HarvesterPlugin{
        logger: log,
    }
}

Configuring the Plugin

The target plugin contains two different config parameters:

target: what to scale
policy: when to scale

Target Configuration

The target configuration contains the plugin-specific configuration, containing authentication credentials, global settings, etc.

target {
    driver = "harvester"
    config = {
       harvester_url = "https://harvester.example.com"
       auth_token    = "eyabc123"
    }
}

Once instantiated, the Nomad autoscaler service will pass the target configuration options to the plugin's SetConfig method, which can then be used to set up the plugin fields. The configuration will also contain options as defined in the General Options documentation.

A sample setup might look something like this:

func (hp *HarvesterPlugin) SetConfig(config map[string]string) error {
    token := getEnvOrConfig("HARVESTER_TOKEN", config, configKeyAuthToken) // A function that returns in order of priority environment var, config value.
    url := getEnvOrConfig("HARVESTER_URL", config, configKeyHarvesterURL) // configKeyHarvesterURL is a const defined elsewhere
    hp.HarvesterClient = harvester.NewAPIClient(&harvester.Configuration{
        DefaultHeader: map[string]string{"Authorization": "Bearer " + token},
        UserAgent: "nomad-autoscaler",
        Debug:     false,
        Servers: harvester.ServerConfigurations{
           {URL: url, Description: "Harvester API Server"},
        },
    })

    apiConfig := &api.Config{
        Address:    config["nomad_address"],
        Region:     config["nomad_region"],
        Namespace:  config["nomad_namespace"],
    }
    if token, ok := config["nomad_token"]; ok {
        apiConfig.Headers = map[string][]string{"X-Nomad-Token": []string{token}}
    }
    nomadClient, err := api.NewClient(apiConfig)
    if err != nil {
        return fmt.Errorf("failed to create Nomad client: %v", err)
    }
    hp.NomadClient = nomadClient
    // Any other additional Config
    return nil
}

Scaling

Cluster Operators author scaling policing when interacting with the Autoscaler. The config provided is then pass as a parameter to the Scale method to dynamically allocate the necessary resources.

scaling "cluster_policy" {
  enabled = true
  min     = 1 # min number of VMs to scale
  max     = 2 # max number of VMs to scale

  policy {
    ....
    target "aws-asg" {
      dry-run            = "false"
      node_class         = "linux"
      node_group_name    = "nomad"
      namespace          = "default"
      cpu_request        = "2"
      memory_request     = "4Gi"
      ...
    }
  }
}

With the configuration defined, Nomad passes the scaling config to the plugin's Scale method. Your hypervisor will determine the actual implementations of how to calculate the active nodes for scale operations.

func (hp *HarvesterPlugin) Scale(action sdk.ScalingAction, config map[string]string) error {
    // config parsing removed for simplicity
    ctx, cancel := context.WithTimeout(context.Background(), hp.scaleTimeout)
    defer cancel()

    total, _, remoteIDs, err := hp.countReady(ctx, nodeGroup, namespace)
    if err != nil {
        return fmt.Errorf("failed to count servers in harvester: %v", err)
    }

    diff, direction := hp.calculateDirection(total, action.Count)
    switch direction {
    // SCALE_IN is an enum utilized as output of the calculateDirection function
    case SCALE_IN:
        if err := hp.scaleIn(ctx, diff, remoteIDs, config); err != nil {
            return fmt.Errorf("failed to perform in: %v", err)
        }
    // SCALE_OUT is an enum utilized as output of the calculateDirection function
    case SCALE_OUT:
        if err := hp.scaleOut(ctx, diff, config); err != nil {
            return fmt.Errorf("failed to perform out: %v", err)
        }
    default:
        hp.logger.Debug("scaling not required", "node group", nodeGroup, "current_count", total, "strategy_count", action.Count)
        return nil
    }
    return err
}

Draining Nodes

During the scaleIn method, HashiCorp recommends that you first drain the node. Draining and purging nodes is critical to scaling down operations, providing reliability to allow for applications to gracefully shutdown. After some time, with the node offline, Nomad's garbage collector will then remove the node from the cluster.

func (hp *HarvesterPlugin) drainNode(ctx context.Context, nodeID string, timeout time.Duration) error {
    _, err := hp.NomadClient.Nodes().UpdateDrainOpts(
        nodeID,
        &api.DrainOptions{
            DrainSpec: &api.DrainSpec{
                Deadline:         timeout,
                IgnoreSystemJobs: true,
            },
            MarkEligible: false,
        },
        nil,
    )
    if err != nil {
        hp.logger.Warn(fmt.Sprintf("Failed to drain %v. Will continue to deleting: %v", nodeID, err))
    } else {
        drainCtx, cancel := context.WithTimeout(ctx, timeout)
        defer cancel()
        err := hp.waitForDrained(drainCtx, nodeID)
        if err != nil {
            hp.logger.Warn(fmt.Sprintf("Failed to drain %v: %v", nodeID, err))
        }
    }
}

Status

The Status function reports the current status of your plugin, which helps debug and monitor purposes. The plugin determines the current running count using information returned by the plugin's Status method. The method returns an sdk.TargetStatus to determine if the next Scale function can be performed, as well as the current running count to determine the next strategy calculation.

func (hp *HarvesterPlugin) Status(config map[string]string) (*sdk.TargetStatus, error) {
    total, active, _, err := hp.countReady(context.Background(), nodeGroup, namespace)
    if err != nil {
        return nil, fmt.Errorf("failed to count Harvester servers: %v", err)
    }
    return &sdk.TargetStatus{
        Ready: active == total,
        Count: total,
        Meta:  make(map[string]string),
    }, nil
}

Conclusion

Developing a Nomad Autoscaler plugin involves implementing key functions like SetConfig, Scale, and Status. Configuring the plugin requires defining target and policy blocks, which dictate what to scale and under what conditions. The proper handling of draining allows for control over the scaling, maintaining the reliability of your application.

Writing an autoscaler plugin lets you to tailor your hypervisor to the needs of your Nomad-managed infrastructure. Finally, here's a demo of the autoscaler in action.

For more details and examples, check out the Nomad Autoscaler Plugin authoring guide, and the Nomad Autoscaling tools documentation.

Special thanks to Steve Kalt for helping review this post.