django-ms-fabric-livy

Running Spark using Apache Livy/Microsoft Fabric Livy endpoint. Based on Django authentication and using django-azure-auth

Requirements

Create an EntraID application and apply the following settings
- On the Manage/Authentication section
  - Add a redirect URL in the Web section: http://localhost:5000/azure_auth/callback (use another endpoint for other environments)
- On the Certificates & secrets section
  - Create a client secret and store the value in the .env file (CLIENT_SECRET)
- On the Token configuration section
  - Add Optional Claim
    - Token type: ID
    - Claim: upn
  - Add Group Claim (ID, Access, SAML)
    - Group ID
    - Enable: Emit groups as role claims
- On the API permissions section
  - Add permission (with grant admin consent) for
    - Microsoft Graph (email, GroupMember.Read.All, profile, User.Read, User.ReadBasic.All)
    - Power BI Service (Code.AccessAzureDataExplorer.All, Code.AccessAzureDataLake.All, Code.AccessAzureKeyvault.All, Code.AccessFabric.All, Code.AccessStorage.All, Item.ReadWrite.All, Lakehouse.Execute.All, Workspace.ReadWrite.All)
Adjust .env file for other values
- DJANGO_SECRET = Random value generated by Django framework when starting a project
- TENANT_ID = Your Azure tenant ID
- CLIENT_ID = Your EntraID application Client ID
- CLIENT_SECRET = Your EntraID application secret
- REDIRECT_URI = “http://localhost:5000/azure_auth/callback” for local testing. Use another endpoint for other environments
- LOGOUT_URI = “http://localhost:5000/logout” for local testing. Use another endpoint for other environments
- ROLES = ‘{“My_Admin_Entra_Group_ObjectID”: “Administrators”, “My_Editors_Entra_Group_ObjectID”: “Editors”, “My_Viewers_Entra_Group_ObjectID”: “Viewers”}’ This will map the groups you defined on EntraID side with the groups you create in Django admin
- GRAPH_USER_ENDPOINT = “https://graph.microsoft.com/v1.0/me”
- GRAPH_MEMBER_ENDPOINT = “https://graph.microsoft.com/v1.0/me/memberOf”
- LIVY_BACKEND: Possible values “apache” or “fabric”
- LIVY_BASE_ENDPOINT = “https://api.fabric.microsoft.com/v1/workspaces/MyWorkSpaceID/lakehouses/MyLakeHouseID/livyapi/versions/2023-12-01”. Replace MyWorkSpaceID and MyLakeHouseID with the right values. You can also use an Apache Livy endpoint, fo example for local tests: http://localhost:8998
- LIVY_REQUESTS_TIMEOUT: The timeout in seconds for the Livy REST API requests
- LIVY_SESSION_NAME_PREFIX: A prefix to use for session names. Example: MyApp-. A datetime will be appended to this prefix name
- LIVY_SPARK_CONF: Optional custom Spark Configuration. For Microsoft Fabric only, an environmentID can be enabled using the Spark configuration '{"spark.fabric.environmentDetails" : "{\"id\": \"My_EnvironmentID\"}"}'. You can get the environment ID from your Fabric workspace using the REST API: https://learn.microsoft.com/en-us/rest/api/fabric/environment/items/list-environments?tabs=HTTP. If no Environment_ID is specified, the session will default to the workspace’s default environment on the default pool. For faster startup experience, sessions can use the Starter Pool, a medium-sized and prehydrated live pool that is automatically created for each workspace. More information for Starter Pools can be found here: https://learn.microsoft.com/en-us/fabric/data-engineering/configure-starter-pools
- LIVY_SPARK_DEPENDENCIES: Optional, a comma separated absolute paths to the Python packages to be used in the Spark session. For example: “abfss://…path-to…/Files/packages/mypackage-0.1.0-py3-none-any.whl”
Create groups on Django admin
- Disable AUTHENTICATION_BACKENDS = (“azure_auth.backends.AzureBackend”,) on the *settings.py** file
- Create an admin account using python manage.py createsuperuser
- Apply migration python manage.py migrate and start the Django myapp python manage.py runserver localhost:5000
- Login using admin account from http://localhost:5000/admin
- Create groups: Administrators, Editors, Viewers
- Assign privileges to the groups
- Enable back AUTHENTICATION_BACKENDS = (“azure_auth.backends.AzureBackend”,) on the settings.py file, and restart the Django myapp

Setup

Create a virtual environment and run:

pip install -r requirements.txt

How to

Already done, don’t run

django-admin startproject myapp

Start the Django myapp

cd myapp
python manage.py migrate
python manage.py runserver localhost:5000

Important

You need to manage the Fabric token expiration as well as the Livy session timeout (ttl, see Apache Livy reference bellow)
If using Apache Livy 0.8, consider running some java_import before running any Spark code. See: https://github.com/mounirbs/spark-livy/blob/main/python/livy/init_java_gateway.py#L11
Both ttl and idleTimeout seems not working properly in Fabric/Apache Livy. For Apache Livy, the binaries from https://livy.apache.org/download/ where used. Maybe the binaries are not reflecting the code on the Apache Livy master branch: https://livy.incubator.apache.org/docs/latest/rest-api.html. Without using these parameters, the session does not timeout.
The code is not fully production-ready, since it’s not handling fully all the required exceptions. This is only a proof-of-concept!
The “Magic” of Starter Pools: Pre-warmed/Pre-hydrated pools as the Microsoft’s way of providing a near-instant Spark experience. They achieve this by when a workspace is created by having a pool of Spark cluster (medium size by default) that is essentially “pre-warmed” or “pre-hydrated” with a default set of common libraries with no applicable cost until a session starts. When you request a session from a starter pool without a custom environment, Fabric can quickly allocate a ready-to-go cluster. This minimizes the setup time, leading to those quick 5-10 second session starts.
Installing packages when using Livy/Fabric:
- Option 1 (Fabric only - Recommended for Production): Attaching an existing Fabric Environment (having the packages already installed) by specify the environment ID in the Spark configuration for the LIVY_SPARK_CONF variable when starting your Livy session:
```
{"spark.fabric.environmentDetails" : "{\"id\": \"My_EnvironmentID\"}"}
```
- Option 2 (Fabric only - Recommended for Production): Utilizing a Default Fabric Environment (Workspace-wide Default) for the default Pool(Starter or Custom). An administrator sets a particular Fabric Environment as the default within the workspace settings. Any new Spark sessions will automatically use this default environment.
- Option 3 (Apache Livy/Fabric - Recommended for Production, but could be slow): When starting the session and by using the LIVY_SPARK_DEPENDENCIES environment variable (represents the Livy pyFiles configuration), a list of comma separated absolute paths to the Python packages to be used in the Spark session. While not as heavy as full environment provisioning, there might be a slight increase in startup time as files need to be retrieved and staged on the cluster nodes.
```
{
    "kind": "pyspark",
    "name": "MySessionWithCustomPyFiles",
    "pyFiles": [
      "abfss://<your_lakehouse_filesystem_name>@onelake.dfs.fabric.microsoft.com/<your_lakehouse_name>/Files/my_module.py",
      "abfss://<your_lakehouse_filesystem_name>@onelake.dfs.fabric.microsoft.com/<your_lakehouse_name>/Files/my_package.whl"
    ]
  }
```
- Option 4 (Apache Livy/Fabric -Not recommended for Production): Installing packages from within a started session. Suitable for interactive exploration and development where the package is only needed on the driver or for very specific, non-distributed tasks:
```
import subprocess
# If mypackage is on PyPi
print(subprocess.check_output(["pip", "install", "mypackage"]))

# If mypackage is on a wheel file on Fabric
from notebookutils import mssparkutils  
mssparkutils.fs.mount("abfss://...adfs_path../Files/packages/", "/packages")
      
# This will return an absolute path to the mounted package folder (/synfs/notebook/xxx-yyy-zzz/packages)
print(subprocess.check_output(["pip", "install", "/synfs/notebook/xxx-yyy-zzz/packages/mypackage-0.1.0-py3-none-any.whl"]))

# Check installed packages      
print(subprocess.check_output(["pip", "list"]).decode("utf-8"))
```
  Performance Note: For options 1, 2, and 3: Sessions using a custom environment will not benefit from the “fast startup” experience of the Starter Pool. This is because Fabric needs to provision the specific environment on the cluster nodes before your code begins executing. However, the packages benefit from the Cluster-wide Availability and they are installed on both the driver and executor nodes and ready for distributed compute.
Distributed Compute Note For option 4: Session-scoped packages (inline installation) works only for interactive notebooks, where you can install packages directly within your notebook session using inline commands like %pip install and where the packages will be available on both the driver and executors. However, when using Fabric Livy, or Spark job definition, inline installation is disabled and installing packages directly using subprocess.check_output(["pip", "install", ...]) will only make them available on the driver node and NOT on the executors. This means your code will fail if you attempt to use functions or classes from this package in distributed Spark operations (e.g., UDFs, custom transformations) that run on the executors.

Reference

https://learn.microsoft.com/en-us/fabric/data-engineering/get-started-api-livy-session
https://github.com/apache/incubator-livy/blob/master/docs/rest-api.md
https://livy.incubator.apache.org/docs/latest/rest-api.html (not up to date, idleTimeout is not there)

This site is open source. Improve this page.