Skip to main content
Version: Next

Deploying AutoGluon Models with KServe

This guide explains how to deploy an AutoGluon TabularPredictor model with KServe using the autogluon model format and the kserve-autogluonserver runtime.

Prerequisites

Before you begin, make sure you have:

  • A Kubernetes cluster with KServe installed.
  • Access to a storage backend reachable by your cluster (for example, GCS, S3, or Azure Blob).
  • A model saved with TabularPredictor.save(path).
Model Artifacts Must Be a Directory

AutoGluon models must be stored as a directory generated by TabularPredictor.save(path), not as a single file artifact.

Deploy the Model with REST Endpoint

Create an InferenceService with explicit runtime selection:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "autogluon-titanic"
spec:
predictor:
model:
modelFormat:
name: autogluon
protocolVersion: v2
runtime: kserve-autogluonserver
storageUri: "gs://your-bucket/autogluon-model/"
resources:
requests:
cpu: "100m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"

Apply the manifest:

kubectl apply -f autogluon.yaml
Runtime Availability

The kserve-autogluonserver runtime may not be installed by default in every release bundle. Verify that the ClusterServingRuntime exists in your cluster before deploying the InferenceService.

Run Inference

First, determine the ingress IP and ports, then set INGRESS_HOST and INGRESS_PORT.

REST v1 Example

{
"instances": [
{
"PassengerId": 1,
"Pclass": 3,
"Sex": "male"
},
{
"PassengerId": 2,
"Pclass": 1,
"Sex": "female"
}
]
}
SERVICE_HOSTNAME=$(kubectl get inferenceservice autogluon-titanic -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v \
-H "Host: ${SERVICE_HOSTNAME}" \
-H "Content-Type: application/json" \
-d @./autogluon-input-v1.json \
http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/autogluon-titanic:predict

REST v2 Example

For v2 requests, provide one input tensor per feature. Each tensor name must match the feature name expected by the model, and all features must have a consistent batch length.

{
"inputs": [
{ "name": "PassengerId", "shape": [2], "datatype": "INT64", "data": [1, 2] },
{ "name": "Pclass", "shape": [2], "datatype": "INT64", "data": [3, 1] },
{ "name": "Sex", "shape": [2], "datatype": "BYTES", "data": ["male", "female"] }
]
}
SERVICE_HOSTNAME=$(kubectl get inferenceservice autogluon-titanic -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v \
-H "Host: ${SERVICE_HOSTNAME}" \
-H "Content-Type: application/json" \
-d @./autogluon-input-v2.json \
http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/autogluon-titanic/infer

Expected response:

{
"model_name": "autogluon-titanic",
"outputs": [
{ "name": "predictions", "datatype": "INT64", "shape": [2], "data": [1, 0] }
]
}

Prediction Probabilities

The AutoGluon runtime supports returning probabilities via the PREDICT_PROBA=true environment setting in the runtime container configuration.

note

When probability output is enabled, output schema differs from class prediction output.

Troubleshooting

  • Ensure storageUri points to a directory created by TabularPredictor.save(path).
  • For v2 requests, verify each feature is provided as a separate tensor with matching batch length.
  • If no runtime is selected automatically, set runtime: kserve-autogluonserver explicitly.

References