AWS Databricks Plugin

Using Akeyless Secrets in AWS Databricks for DataOps and MLOps

Overview

Managing secrets across AWS services can be challenging, especially in a multi-cloud environment or when moving workloads between AWS and Azure.

With Akeyless Vault, you can manage secrets centrally and securely for Databricks workloads, including:

Delta Live Tables (DLT) and Non-DLT jobs
DataOps and MLOps use cases

Akeyless helps avoid secret scattering across AWS Secrets Manager, Databricks secret scopes, and other platforms, giving you a cloud-agnostic and portable solution.

Supported Languages in Databricks

Databricks supports:

Python (natively supported by Akeyless SDK)
Scala & R (via spark.conf or dbutils)
SQL (generally does not require secrets)

Requirements

AWS Infrastructure

An IAM role for EC2 instances used by Databricks compute clusters
A cross-account IAM role allowing Databricks to manage AWS resources
Properly configured Instance Profile in Databricks to link IAM roles

Akeyless Configuration

An Akeyless Access ID
An AWS IAM Auth Method created in Akeyless
A secret stored in Akeyless (e.g., /devops/data_gov_api_key)
A Databricks workspace with internet access or the access to the Akeyless gateway

Architecture Overview

Databricks EC2 → AWS IAM Role → Akeyless IAM Auth Method → Secret Retrieval → Use in Notebook (Python, DLT)

Step-by-Step Guide: Using Akeyless in Databricks

Step 1: Install Required Packages

%pip install akeyless akeyless_cloud_id
%restart_python

Installs the Akeyless SDK and cloud identity helper, then restarts the Python kernel in Databricks.

Step 2: Authenticate with Akeyless Using AWS IAM

from akeyless_cloud_id import CloudId
import akeyless

# Set up Akeyless client
config = akeyless.Configuration(host="https://api.akeyless.io")
client = akeyless.ApiClient(config)
api = akeyless.V2Api(client)

# Generate cloud ID from IAM role
cloud_id = CloudId().generate()

# Replace with your actual Akeyless Access ID
access_id = "REPLACE_WITH_YOUR_ACCESS_ID"

# Define secret path
secret_path = "/devops/data_gov_api_key"

# Authenticate using AWS IAM
auth_request = akeyless.Auth(
    access_id=access_id,
    access_type="aws_iam",
    cloud_id=cloud_id
)
token = api.auth(auth_request).token

This uses the IAM role attached to the Databricks EC2 instance to securely get a short-lived session token from Akeyless.

Step 3: Retrieve a Secret from Akeyless

# Get the API key from Akeyless Vault
secret_request = akeyless.GetSecretValue(names=[secret_path], token=token)
res = api.get_secret_value(secret_request)
API_KEY = res[secret_path]

This fetches your API key securely and stores it in the API_KEY variable.

Step 4: Use the Secret (API Call Example)

import requests

url = "https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv"
params = {"api_key": API_KEY, "per_page": 5}
response = requests.get(url, params=params)

Uses the API key to fetch public healthcare data as an example.

Step 5: Load API Data into a Spark Table

import pandas as pd
from io import StringIO

if response.status_code == 200:
    pdf = pd.read_csv(StringIO(response.text))
    pdf.columns = pdf.columns.str.replace(" ", "_")
    df = spark.createDataFrame(pdf.fillna(""))

    table_name = "baby_names_by_non_dlt"
    df.write.mode("overwrite").saveAsTable(table_name)
    display(spark.sql(f"SELECT * FROM {table_name}"))
else:
    print(f"❌ API Request Failed: {response.status_code}")

DLT Version: Using Akeyless in Delta Live Tables

The only change from the non-DLT version is how the data is saved:

@dlt.table(
    name="default.baby_names_by_dlt_pipeline",
    comment="This table is created by a DLT pipeline."
)
def mytable():
    return spark.createDataFrame(pdf)

Uses DLT’s @dlt.table decorator to register the DataFrame as a managed DLT table.

IAM Role Configuration (Summary)

You’ll need:

IAM Role for EC2: Trusted entity includes ec2.amazonaws.com and Databricks AWS accounts
IAM Role for cross-account: Allows Databricks to provision compute
External IDs: Use STORAGE_EXTERNAL-ID and DATABRICKS_WORKSPACE_ID in trust policies

Example trust policy (EC2 role):

{
  "Effect": "Allow",
  "Principal": {
    "Service": "ec2.amazonaws.com"
  },
  "Action": "sts:AssumeRole"
}

Example trust policy (cross-account):

{
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::414351767826:root"
  },
  "Action": "sts:AssumeRole",
  "Condition": {
    "StringEquals": {
      "sts:ExternalId": "<STORAGE_EXTERNAL_ID>"
    }
  }
}

Best Practices

Use Instance Profiles in Databricks to map to IAM roles
Use Akeyless short-lived tokens to avoid hardcoding secrets
For Scala or R notebooks, set the secret in spark.conf and read it back in the appropriate language

Example:

# Python cell
spark.conf.set("api.key", API_KEY)

// Scala cell
val apiKey = spark.conf.get("api.key")
println(apiKey)

Updated 3 months ago