AWS Databricks Plugin

Using Akeyless Secrets in AWS Databricks for DataOps and MLOps

Overview

Managing secrets across AWS services can be challenging, especially in a multi-cloud environment or when moving workloads between AWS and Azure.

With Akeyless Vault, you can manage secrets centrally and securely for Databricks workloads, including:

  • Delta Live Tables (DLT) and Non-DLT jobs
  • DataOps and MLOps use cases

Akeyless helps avoid secret scattering across AWS Secrets Manager, Databricks secret scopes, and other platforms, giving you a cloud-agnostic and portable solution.

Supported Languages in Databricks

Databricks supports:

  • Python (natively supported by Akeyless SDK)
  • Scala & R (via spark.conf or dbutils)
  • SQL (generally does not require secrets)

Requirements

AWS Infrastructure

  • An IAM role for EC2 instances used by Databricks compute clusters
  • A cross-account IAM role allowing Databricks to manage AWS resources
  • Properly configured Instance Profile in Databricks to link IAM roles

Akeyless Configuration

  • An Akeyless Access ID
  • An AWS IAM Auth Method created in Akeyless
  • A secret stored in Akeyless (e.g., /devops/data_gov_api_key)
  • A Databricks workspace with internet access or the access to the Akeyless gateway

Architecture Overview

Databricks EC2 → AWS IAM Role → Akeyless IAM Auth Method → Secret Retrieval → Use in Notebook (Python, DLT)

Step-by-Step Guide: Using Akeyless in Databricks

Step 1: Install Required Packages

%pip install akeyless akeyless_cloud_id
%restart_python

Installs the Akeyless SDK and cloud identity helper, then restarts the Python kernel in Databricks.

Step 2: Authenticate with Akeyless Using AWS IAM

from akeyless_cloud_id import CloudId
import akeyless

# Set up Akeyless client
config = akeyless.Configuration(host="https://api.akeyless.io")
client = akeyless.ApiClient(config)
api = akeyless.V2Api(client)

# Generate cloud ID from IAM role
cloud_id = CloudId().generate()

# Replace with your actual Akeyless Access ID
access_id = "REPLACE_WITH_YOUR_ACCESS_ID"

# Define secret path
secret_path = "/devops/data_gov_api_key"

# Authenticate using AWS IAM
auth_request = akeyless.Auth(
    access_id=access_id,
    access_type="aws_iam",
    cloud_id=cloud_id
)
token = api.auth(auth_request).token

This uses the IAM role attached to the Databricks EC2 instance to securely get a short-lived session token from Akeyless.

Step 3: Retrieve a Secret from Akeyless

# Get the API key from Akeyless Vault
secret_request = akeyless.GetSecretValue(names=[secret_path], token=token)
res = api.get_secret_value(secret_request)
API_KEY = res[secret_path]

This fetches your API key securely and stores it in the API_KEY variable.

Step 4: Use the Secret (API Call Example)

import requests

url = "https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv"
params = {"api_key": API_KEY, "per_page": 5}
response = requests.get(url, params=params)

Uses the API key to fetch public healthcare data as an example.

Step 5: Load API Data into a Spark Table

import pandas as pd
from io import StringIO

if response.status_code == 200:
    pdf = pd.read_csv(StringIO(response.text))
    pdf.columns = pdf.columns.str.replace(" ", "_")
    df = spark.createDataFrame(pdf.fillna(""))

    table_name = "baby_names_by_non_dlt"
    df.write.mode("overwrite").saveAsTable(table_name)
    display(spark.sql(f"SELECT * FROM {table_name}"))
else:
    print(f"❌ API Request Failed: {response.status_code}")

DLT Version: Using Akeyless in Delta Live Tables

The only change from the non-DLT version is how the data is saved:

@dlt.table(
    name="default.baby_names_by_dlt_pipeline",
    comment="This table is created by a DLT pipeline."
)
def mytable():
    return spark.createDataFrame(pdf)

Uses DLT’s @dlt.table decorator to register the DataFrame as a managed DLT table.

IAM Role Configuration (Summary)

You’ll need:

  • IAM Role for EC2: Trusted entity includes ec2.amazonaws.com and Databricks AWS accounts
  • IAM Role for cross-account: Allows Databricks to provision compute
  • External IDs: Use STORAGE_EXTERNAL-ID and DATABRICKS_WORKSPACE_ID in trust policies

Example trust policy (EC2 role):

{
  "Effect": "Allow",
  "Principal": {
    "Service": "ec2.amazonaws.com"
  },
  "Action": "sts:AssumeRole"
}

Example trust policy (cross-account):

{
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::414351767826:root"
  },
  "Action": "sts:AssumeRole",
  "Condition": {
    "StringEquals": {
      "sts:ExternalId": "<STORAGE_EXTERNAL_ID>"
    }
  }
}

Best Practices

  • Use Instance Profiles in Databricks to map to IAM roles
  • Use Akeyless short-lived tokens to avoid hardcoding secrets
  • For Scala or R notebooks, set the secret in spark.conf and read it back in the appropriate language

Example:

# Python cell
spark.conf.set("api.key", API_KEY)
// Scala cell
val apiKey = spark.conf.get("api.key")
println(apiKey)

Footer Section