Databricks cluster access mode#
What is cluster access mode#
Just a copy from Azure Databricks official doc:
Please ignore the Notes
column in the below table.
Amazon Databricks official doc has less info (there's no Notes
column ) on access mode. Update 2023-12-18, Just realized that Azure doc has been updated, now it's the same as Amazon doc, no more below Notes
column. I think this is because the init scripts, libraries, and credential passthrough, etc. are supported now.
Access Mode | Visible to user | UC Support | Supported Languages | Notes |
---|---|---|---|---|
Single User | Always | Yes | Python, SQL, Scala, R | Can be assigned to and used by a single user only. Dynamic views are not supported. Credential passthrough is not supported. |
Shared | Always (Premium plan required) | Yes | Python, SQL | Init scripts, third-party libraries, and JARS are not supported. Credential passthrough is not supported. |
No Isolation Shared | Hidden (Enforce User Isolation required) | No | Python, SQL, Scala, R | Admin console configuration required to enforce user isolation |
Custom | Hidden (For all new clusters) | No | Python, SQL, Scala, R | This option is shown only if you have existing clusters without a specified access mode. |
Single User
mode is easy to understand, the cluster is reserved to a single user, other user cannot use it.
Custom
mode is often seen in job cluster, which means cluster created by a job running in a cluster pool for example, because when creating a cluster pool, there's no option for access mode.
This post will talk about Shared
and No Isolation Shared
access modes.
Please ensure to test on your own runtime.
All the below examples were tested on a cluster with Databricks runtime v10.4 LTS (Scala 2.12 Spark 3.2.1). Since we now have a much newer Databricks runtime, so the results might differ.
Shared
access mode#
From two different users, running the same command python -m site
, I got two different results.
- in a notebook from
user1
, the mapped user isspark-6166cfd7-9154-4017-b0ff-89
:
%%sh
whoami
echo ======
which python
echo ======
python -m site
# outputs:
spark-6166cfd7-9154-4017-b0ff-89
======
/databricks/python3/bin/python
======
sys.path = [
'/home/spark-6166cfd7-9154-4017-b0ff-89',
'/databricks/spark/python',
'/databricks/spark/python/lib/py4j-0.10.9.1-src.zip',
'/databricks/jars/spark--driver--driver-spark_3.2_2.12_deploy.jar',
'/WSFS_NOTEBOOK_DIR',
'/databricks/python_shell',
'/usr/lib/python38.zip',
'/usr/lib/python3.8',
'/usr/lib/python3.8/lib-dynload',
'/databricks/python3/lib/python3.8/site-packages',
'/usr/local/lib/python3.8/dist-packages',
'/usr/lib/python3/dist-packages',
]
USER_BASE: '/home/spark-6166cfd7-9154-4017-b0ff-89/.local' (exists)
USER_SITE: '/home/spark-6166cfd7-9154-4017-b0ff-89/.local/lib/python3.8/site-packages' (doesn't exist)
ENABLE_USER_SITE: True
- in a notebook from
user2
, the mapped user isspark-5a9eefa7-49d3-4176-9805-1e
:
%%sh
whoami
echo ======
which python
echo ======
python -m site
# outputs:
spark-5a9eefa7-49d3-4176-9805-1e
======
/databricks/python3/bin/python
======
sys.path = [
'/home/spark-5a9eefa7-49d3-4176-9805-1e',
'/databricks/spark/python',
'/databricks/spark/python/lib/py4j-0.10.9.1-src.zip',
'/databricks/jars/spark--driver--driver-spark_3.2_2.12_deploy.jar',
'/WSFS_NOTEBOOK_DIR',
'/databricks/python_shell',
'/usr/lib/python38.zip',
'/usr/lib/python3.8',
'/usr/lib/python3.8/lib-dynload',
'/databricks/python3/lib/python3.8/site-packages',
'/usr/local/lib/python3.8/dist-packages',
'/usr/lib/python3/dist-packages',
]
USER_BASE: '/home/spark-5a9eefa7-49d3-4176-9805-1e/.local' (exists)
USER_SITE: '/home/spark-5a9eefa7-49d3-4176-9805-1e/.local/lib/python3.8/site-packages' (doesn't exist)
ENABLE_USER_SITE: True
- Pip install a third party Python module will fail
Below example demonstrates the phrase "Init scripts, third-party libraries, and JARS are not supported" in the above table.
%%sh
pip install requests==2.26.0
# same error message for: `python -m pip install requests==2.26.0 --user`,
# except for there's no the first phrase: "Defaulting to user installation because normal site-packages is not writeable"
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://[REDACTED]:****@[REDACTED]/_packaging/[REDACTED]/pypi/simple/
Collecting requests==2.26.0
Downloading https://[REDACTED]/_packaging/daa86ee5-06b8-417b-bc88-e64e3e2eef29/pypi/download/requests/2.26/requests-2.26.0-py2.py3-none-any.whl (62 kB)
Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.8/site-packages (from requests==2.26.0) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /databricks/python3/lib/python3.8/site-packages (from requests==2.26.0) (1.25.11)
Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.8/site-packages (from requests==2.26.0) (2.10)
Collecting charset-normalizer~=2.0.0
Downloading https://[REDACTED]/_packaging/daa86ee5-06b8-417b-bc88-e64e3e2eef29/pypi/download/charset-normalizer/2.0.12/charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
ERROR: Will not install to the user site because it will lack sys.path precedence to requests in /databricks/python3/lib/python3.8/site-packages
WARNING: You are using pip version 21.0.1; however, version 22.2.2 is available.
You should consider upgrading via the '/databricks/python3/bin/python -m pip install --upgrade pip' command.
CalledProcessError: Command 'b'pip install requests==2.26.0\n'' returned non-zero exit status 1.
No Isolation Shared
access mode#
Warning
Update 2023-02-01, I retested the No Isolation Shared
access mode today, it seems that something has been changed at Databricks level.
Hereunder the new behavior:
- The user is still
root
, but the Python binary is not a system one, instead an isolated venv is used, and pip install occurs in the venv too. - For the same user, each time we re-attach to the cluster, the venv path is changed. And therefore, previous pip install is discarded.
%%sh
whoami
echo ======
which python
echo ======
python -m site
# outputs:
root
======
/local_disk0/.ephemeral_nfs/envs/pythonEnv-76eac499-b8f2-451c-ac6a-88f9a68fcae7/bin/python
======
sys.path = [
'/databricks/driver',
'/databricks/spark/python',
'/databricks/spark/python/lib/py4j-0.10.9.5-src.zip',
'/databricks/jars/spark--driver--driver-spark_3.3_2.12_deploy.jar',
'/WSFS_NOTEBOOK_DIR',
'/databricks/jars/spark--maven-trees--ml--11.x--graphframes--org.graphframes--graphframes_2.12--org.graphframes__graphframes_2.12__0.8.2-db1-spark3.2.jar',
'/databricks/python_shell',
'/usr/lib/python39.zip',
'/usr/lib/python3.9',
'/usr/lib/python3.9/lib-dynload',
'/local_disk0/.ephemeral_nfs/envs/pythonEnv-76eac499-b8f2-451c-ac6a-88f9a68fcae7/lib/python3.9/site-packages',
'/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages',
'/databricks/python/lib/python3.9/site-packages',
'/usr/local/lib/python3.9/dist-packages',
'/usr/lib/python3/dist-packages',
'/databricks/.python_edge_libs',
]
USER_BASE: '/root/.local' (exists)
USER_SITE: '/root/.local/lib/python3.9/site-packages' (doesn't exist)
ENABLE_USER_SITE: False
Below is the test result on 2022-09-20:
In contrast to Shared
mode, within the No Isolation Shared
mode, running the same commands, I got the same results from two different users. We can find that all the users are logged as root
account.
%%sh
whoami
echo ======
which python
echo ======
python -m site
# outputs:
root
======
/databricks/python3/bin/python
======
sys.path = [
'/databricks/driver',
'/databricks/spark/python',
'/databricks/spark/python/lib/py4j-0.10.9-src.zip',
'/databricks/jars/spark--driver--driver-spark_3.1_2.12_deploy.jar',
'/WSFS_NOTEBOOK_DIR',
'/databricks/python_shell',
'/usr/lib/python38.zip',
'/usr/lib/python3.8',
'/usr/lib/python3.8/lib-dynload',
'/databricks/python3/lib/python3.8/site-packages',
'/usr/local/lib/python3.8/dist-packages',
'/usr/lib/python3/dist-packages',
]
USER_BASE: '/root/.local' (exists)
USER_SITE: '/root/.local/lib/python3.8/site-packages' (doesn't exist)
ENABLE_USER_SITE: True
- Pip install a third party Python module will succeed
Conclusion#
Shared
access mode maps different users to different user space, their environments are isolated, but they cannot install any additional packages or modules.No Isolation Shared
access mode maps all the users to the root account, everything is shared, they can install anything, but the changes imply to all users. After cluster restart, all the additional installations are purged. So maybe one project per cluster is a choice.- Another good choice is to use the
non-interactive job cluster
with a cluster pool, where the cluster pool is shared, but any user can install anything (can be limited by cluster policy), and the installation is isolated at job level. Which means even two jobs are created by the same user, the two jobs will use different environments (VMs with Databricks runtime container re-deployed in the cluster pool after each job run).