Process and port usage in Linux
Use lsof
, fuser
, ss
, pgrep
, pstree
, ps
, htop
, etc. to find process and port usage in Linux.
Use lsof
, fuser
, ss
, pgrep
, pstree
, ps
, htop
, etc. to find process and port usage in Linux.
Since Python 3.9, most of types in typing
module is deprecated, and collections
module is recommended.
Some types like: typing.Any
, typing.Generic
, typing.TypeVar
, etc. are still not deprecated.
Sometimes, we need to download online videos maybe a Teams recording. Here's a tip from StackOverflow by using ffmpeg. Be sure to check the comments for solutions to errors like "Error opening input: Invalid data found when processing input.". Another solution is to use kylon/Sharedown which is much more faster.
Databricks Connect allows you to connect your favorite IDE (PyCharm, VSCode, etc.) and other custom applications to Databricks compute and run Spark (or non-Spark) code.
This post is not a comprehensive guide on Databricks Connect; rather, it consists of side notes from the Azure Databricks docs. Most of the notes also apply to Databricks on AWS and GCP.
There are two modern ways to generate an Azure OAuth2 access token using Python: one is by using the MSAL library, and the other is by using the Azure Identity library, which is based on the former.
There're also other ways to get the token, like using the requests
or aiohttp
libraries etc. to send a POST request to the Azure OAuth2 token endpoint, but it's not recommended. As the MSAL and Azure Identity libraries are the official libraries provided by Microsoft, they are more secure and easier to use. For e.g. they handle token caching, token refreshing, and token expiration automatically. Furthermore, some of the credential types are difficult (too many code) to be implemented by raw requests
or aiohttp
.
When checking PySpark's source code, find a nice way it uses to add version information to docstrings by a @since() decorator. Here is an example:
Quick example from the official Python documentation about thread safety in Python:
Thread safe operations | |
---|---|
It's important to understand that Python, due to its Global Interpreter Lock (GIL), can only switch between threads between bytecode instructions. The frequency of these switches can be adjusted using sys.setswitchinterval(). This ensures that within a single bytecode instruction, Python will not switch threads, making the operation atomic (thread-safe). For a deeper dive into this topic, you can read this discussion on atomic and thread-safe operations in Python.
During local testing, we often need to set environment variables. One way to do this is to create a .env
file in the root directory of the project. This file contains key-value pairs of environment variables. For example, a .env
file might look like this:
Hereunder a quick bash script to generate a .env
file from a list of Azure KeyVault secrets, same logic can be applied to other secret managers.
This posts tests some popular Python tools (sqlalchemy_data_model_visualizer , sqlalchemy_schemadisplay, and eralchemy2) to generate ERD (Entity-Relation Diagram) from sqlalchemy models.
The test code can be found in this Github repo.
The ForeEach
activity in Azure Data Factory has some important limitations. One of them is when working with the batch
mode, it would be nice to embed only pipeline activities inside.