AWS Databricks Plugin
Using Akeyless Secrets in AWS Databricks for DataOps and MLOps
Overview
Managing secrets across AWS services can be challenging, especially in a multi-cloud environment or when moving workloads between AWS and Azure.
With Akeyless Vault, you can manage secrets centrally and securely for Databricks workloads, including:
- Delta Live Tables (DLT) and Non-DLT jobs
- DataOps and MLOps use cases
Akeyless helps avoid secret scattering across AWS Secrets Manager, Databricks secret scopes, and other platforms, giving you a cloud-agnostic and portable solution.
Supported Languages in Databricks
Databricks supports:
- Python (natively supported by Akeyless SDK)
- Scala & R (via spark.conf or dbutils)
- SQL (generally does not require secrets)
Requirements
AWS Infrastructure
- An IAM role for EC2 instances used by Databricks compute clusters
- A cross-account IAM role allowing Databricks to manage AWS resources
- Properly configured Instance Profile in Databricks to link IAM roles
Akeyless Configuration
- An Akeyless Access ID
- An AWS IAM Auth Method created in Akeyless
- A secret stored in Akeyless (e.g., /devops/data_gov_api_key)
- A Databricks workspace with internet access or the access to the Akeyless gateway
Architecture Overview
Databricks EC2 → AWS IAM Role → Akeyless IAM Auth Method → Secret Retrieval → Use in Notebook (Python, DLT)
Step-by-Step Guide: Using Akeyless in Databricks
Step 1: Install Required Packages
%pip install akeyless akeyless_cloud_id
%restart_python
Installs the Akeyless SDK and cloud identity helper, then restarts the Python kernel in Databricks.
Step 2: Authenticate with Akeyless Using AWS IAM
from akeyless_cloud_id import CloudId
import akeyless
# Set up Akeyless client
config = akeyless.Configuration(host="https://api.akeyless.io")
client = akeyless.ApiClient(config)
api = akeyless.V2Api(client)
# Generate cloud ID from IAM role
cloud_id = CloudId().generate()
# Replace with your actual Akeyless Access ID
access_id = "REPLACE_WITH_YOUR_ACCESS_ID"
# Define secret path
secret_path = "/devops/data_gov_api_key"
# Authenticate using AWS IAM
auth_request = akeyless.Auth(
access_id=access_id,
access_type="aws_iam",
cloud_id=cloud_id
)
token = api.auth(auth_request).token
This uses the IAM role attached to the Databricks EC2 instance to securely get a short-lived session token from Akeyless.
Step 3: Retrieve a Secret from Akeyless
# Get the API key from Akeyless Vault
secret_request = akeyless.GetSecretValue(names=[secret_path], token=token)
res = api.get_secret_value(secret_request)
API_KEY = res[secret_path]
This fetches your API key securely and stores it in the API_KEY variable.
Step 4: Use the Secret (API Call Example)
import requests
url = "https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv"
params = {"api_key": API_KEY, "per_page": 5}
response = requests.get(url, params=params)
Uses the API key to fetch public healthcare data as an example.
Step 5: Load API Data into a Spark Table
import pandas as pd
from io import StringIO
if response.status_code == 200:
pdf = pd.read_csv(StringIO(response.text))
pdf.columns = pdf.columns.str.replace(" ", "_")
df = spark.createDataFrame(pdf.fillna(""))
table_name = "baby_names_by_non_dlt"
df.write.mode("overwrite").saveAsTable(table_name)
display(spark.sql(f"SELECT * FROM {table_name}"))
else:
print(f"❌ API Request Failed: {response.status_code}")
DLT Version: Using Akeyless in Delta Live Tables
The only change from the non-DLT version is how the data is saved:
@dlt.table(
name="default.baby_names_by_dlt_pipeline",
comment="This table is created by a DLT pipeline."
)
def mytable():
return spark.createDataFrame(pdf)
Uses DLT’s @dlt.table decorator to register the DataFrame as a managed DLT table.
IAM Role Configuration (Summary)
You’ll need:
- IAM Role for EC2: Trusted entity includes ec2.amazonaws.com and Databricks AWS accounts
- IAM Role for cross-account: Allows Databricks to provision compute
- External IDs: Use STORAGE_EXTERNAL-ID and DATABRICKS_WORKSPACE_ID in trust policies
Example trust policy (EC2 role):
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
Example trust policy (cross-account):
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::414351767826:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<STORAGE_EXTERNAL_ID>"
}
}
}
Best Practices
- Use Instance Profiles in Databricks to map to IAM roles
- Use Akeyless short-lived tokens to avoid hardcoding secrets
- For Scala or R notebooks, set the secret in spark.conf and read it back in the appropriate language
Example:
# Python cell
spark.conf.set("api.key", API_KEY)
// Scala cell
val apiKey = spark.conf.get("api.key")
println(apiKey)
Updated about 11 hours ago