Skip to content

azure#

Databricks Connect

Databricks Connect allows you to connect your favorite IDE (PyCharm, VSCode, etc.) and other custom applications to Databricks compute and run Spark (or non-Spark) code.

This post is not a comprehensive guide on Databricks Connect; rather, it consists of side notes from the Azure Databricks docs. Most of the notes also apply to Databricks on AWS and GCP.

Generating Azure OAuth2 Access Token By Python

There are two modern ways to generate an Azure OAuth2 access token using Python: one is by using the MSAL library, and the other is by using the Azure Identity library, which is based on the former.

There're also other ways to get the token, like using the requests or aiohttp libraries etc. to send a POST request to the Azure OAuth2 token endpoint, but it's not recommended. As the MSAL and Azure Identity libraries are the official libraries provided by Microsoft, they are more secure and easier to use. For e.g. they handle token caching, token refreshing, and token expiration automatically. Furthermore, some of the credential types are difficult (too many code) to be implemented by raw requests or aiohttp.

Azure Messaging Service

This post is based on the official Azure documentations (Asynchronous messaging options, Compare Azure messaging services, Enterprise integration using message broker and events, Azure Well-Architected Framework) and describes a resume of differences and uses cases for Azure messaging service, including Service Bus, Event Grid, Event Hubs. The official documentations are very good and comprehensive, this post is for my personal reference as a quick reminder.

Getting all users from MS Graph API in few seconds

MS Graph API's endpoint for retrieving users, GET /users can return all users of the tenant. The default limit is 100 users per page, and the maximum limit is 999 users per page. If there are more than 999 users, the response will contain a @odata.nextLink field, which is a URL to the next page of users. For a big company having a large number of users (50,000, 100,000, or even more), and it can be time-consuming to retrieve all users.

While MS Graph API provides generous throttling limits, we should find a way to parallelize the queries. This post explores sharding as a strategy to retrieve all users in a matter of seconds. The idea is to get all users by dividing users based on the first character of the userPrincipalName field.For instance, shard 1 would encompass users whose userPrincipalName starts with a, shard 2 would handle users starting with b, and so forth.

Some nice cicd bash common scripts

During CICD, we often have a large log output, it might be nice to have some common scripts to help us to format the log output, so that we can easily find the information we need.

Recently, when working with Sonar, I found that they have some scripts for such output formatting.

Hashing files

During CI/CD processes, and particularly during CI, we frequently hash dependency files to create cache keys (referred to as key input in Github Action actions/cache and key parameter in Azure pipelines Cache@2 task). However, the default hash functions come with certain limitations like this comment. To address this, we can use the following pure Bash shell command to manually generate the hash value.

Github Actions: copdips/get-azure-keyvault-secrets-action

Recently, I began a new project that requires migrating some process from Azure Pipelines to Github Actions. One of the tasks involves retrieving secrets from Azure Key Vault.

In Azure Pipelines, we have an official task called AzureKeyVault@2 designed for this purpose. However, its official counterpart in Github Actions, Azure/get-keyvault-secrets@v1, has been deprecated. The recommended alternative is Azure CLI. While Azure CLI is a suitable option, it operates in a bash shell without multithreading. If numerous secrets need to be fetched, this can be time-consuming.