python#

February 1, 2025
in python, spark, database
2 min read

PySpark database connectors

General

Use spark.jars to add local ODBC/JDBC drivers to PySpark, and use spark.jars.packages to add remote ODBC/JDBC drivers, PySpark will download the packages from Maven repository.

For spark-shell: https://docs.snowflake.com/en/user-guide/spark-connector-install#installing-additional-packages-if-needed

February 1, 2025
in python, linter
17 min read

Python is a dynamically typed language, meaning variable types don't require explicit declaration. However, as projects grow in complexity, type annotations become increasingly valuable for code maintainability and clarity.

Type hints (PEP 484) have been a major focus of recent Python releases, and I was particularly intrigued when I heard about Guido van Rossum's work on MyPy at Dropbox, where the team needed robust tooling to migrate their codebase from Python 2 to Python 3.

Today, type hints are essential for modern Python development. They significantly enhance IDE capabilities and AI-powered development tools by providing better code completion, static analysis, and error detection. This mirrors the evolution we've seen with TypeScript's adoption over traditional JavaScript—explicit typing leads to more reliable and maintainable code.

Typed Python vs data science projects

We know that type hints are not very popular among data science projects for some reasons, but we won't discuss them here.

November 22, 2024
in python, azure, databricks, spark, pandas
7 min read

Databricks Connect

Databricks Connect allows you to connect your favorite IDE (PyCharm, VSCode, etc.) and other custom applications to Databricks compute and run Spark (or non-Spark) code.

This post is not a comprehensive guide on Databricks Connect; rather, it consists of side notes from the Azure Databricks docs. Most of the notes also apply to Databricks on AWS and GCP.

November 14, 2024
in python, async, azure, auth, certificate
1 min read

Generating Azure OAuth2 Access Token By Python

There are two modern ways to generate an Azure OAuth2 access token using Python: one is by using the MSAL library, and the other is by using the Azure Identity library, which is based on the former.

There're also other ways to get the token, like using the requests or aiohttp libraries etc. to send a POST request to the Azure OAuth2 token endpoint, but it's not recommended. As the MSAL and Azure Identity libraries are the official libraries provided by Microsoft, they are more secure and easier to use. For e.g. they handle token caching, token refreshing, and token expiration automatically. Furthermore, some of the credential types are difficult (too many code) to be implemented by raw requests or aiohttp.

November 10, 2024
in python
2 min read

Python adding version info to docstrings

When checking PySpark's source code, find a nice way it uses to add version information to docstrings by a @since() decorator. Here is an example:

October 19, 2024
in python, multithreading
3 min read

Python thread safe operations

Quick example from the official Python documentation about thread safety in Python:

Thread safe operations
L.append(x)
L1.extend(L2)
x = L[i]
x = L.pop()
L1[i:j] = L2
L.sort()
x = y
x.field = y
D[x] = y
D1.update(D2)
D.keys()

Not thread safe operations
i = i+1
L.append(L[-1])
L[i] = L[j]
D[x] = D[x] + 1

It's important to understand that Python, due to its Global Interpreter Lock (GIL), can only switch between threads between bytecode instructions. The frequency of these switches can be adjusted using sys.setswitchinterval(). This ensures that within a single bytecode instruction, Python will not switch threads, making the operation atomic (thread-safe). For a deeper dive into this topic, you can read this discussion on atomic and thread-safe operations in Python.

September 11, 2024
in python
1 min read

Generating .env file

During local testing, we often need to set environment variables. One way to do this is to create a .env file in the root directory of the project. This file contains key-value pairs of environment variables. For example, a .env file might look like this:

ENV=dev
SECRET=xxx

Hereunder a quick bash script to generate a .env file from a list of Azure KeyVault secrets, same logic can be applied to other secret managers.

August 24, 2024
in python, sqlalchemy
1 min read

Generating ERD from sqlalchemy

This posts tests some popular Python tools (sqlalchemy_data_model_visualizer , sqlalchemy_schemadisplay, and eralchemy2) to generate ERD (Entity-Relation Diagram) from sqlalchemy models.

The test code can be found in this Github repo.

June 12, 2024
in python, debug
1 min read

Profiling Python code

Name	Scope	web framework middleware	VSCode Extension
scalene	cpu, gpu, memory, duration	partially	yes
cProfile (Python native, function level only and cli only)	duration	no	no
VizTracer	duration	unknown	yes
profyle (based on Viztracer)	duration	yes	no
pyinstrument	duration	yes	no
py-spy	duration	no	no
yappi (cli only)	duration	unknown	no
austin	duration	unknown	yes

Interesting reading:

June 4, 2024
in python, async, databricks
3 min read

Running asyncio task in Databricks

Standard method to run asyncio task is as simple as asyncio.run(main()). But in Databricks, it is not that simple. With the same command, you will get the following error:

import asyncio
async def main():
    await asyncio.sleep(1)
asyncio.run(main())

RuntimeError: asyncio.run() cannot be called from a running event loop

Indeed, in Databricks, we've already in a running loop:

import asyncio
asyncio.get_running_loop()

<_UnixSelectorEventLoop running=True closed=False debug=False>