Databricks Connect allows you to connect your favorite IDE (PyCharm, VSCode, etc.) and other custom applications to Databricks compute and run Spark (or non-Spark) code.
This post is not a comprehensive guide on Databricks Connect; rather, it consists of side notes from the Azure Databricks docs. Most of the notes also apply to Databricks on AWS and GCP. Continue reading
There're also other ways to get the token, like using the requests or aiohttp libraries etc. to send a POST request to the Azure OAuth2 token endpoint, but it's not recommended. As the MSAL and Azure Identity libraries are the official libraries provided by Microsoft, they are more secure and easier to use. For e.g. they handle token caching, token refreshing, and token expiration automatically. Furthermore, some of the credential types are difficult (too many code) to be implemented by raw requests or aiohttp. Continue reading
The ForeEach activity in Azure Data Factory has some important limitations. One of them is when working with the batch mode, it would be nice to embed only pipeline activities inside. Continue reading
MS Graph API's endpoint for retrieving users, GET /users can return all users of the tenant. The default limit is 100 users per page, and the maximum limit is 999 users per page. If there are more than 999 users, the response will contain a @odata.nextLink field, which is a URL to the next page of users. For a big company having a large number of users (50,000, 100,000, or even more), and it can be time-consuming to retrieve all users.
While MS Graph API provides generous throttling limits, we should find a way to parallelize the queries. This post explores sharding as a strategy to retrieve all users in a matter of seconds. The idea is to get all users by dividing users based on the first character of the userPrincipalName field.For instance, shard 1 would encompass users whose userPrincipalName starts with a, shard 2 would handle users starting with b, and so forth. Continue reading
During CICD, we often have a large log output, it might be nice to have some common scripts to help us to format the log output, so that we can easily find the information we need.
Recently, when working with Sonar, I found that they have some scripts for such output formatting. Continue reading
During CI/CD processes, and particularly during CI, we frequently hash dependency files to create cache keys (referred to as key input in Github Action actions/cache and key parameter in Azure pipelines Cache@2 task). However, the default hash functions come with certain limitations like this comment. To address this, we can use the following pure Bash shell command to manually generate the hash value. Continue reading
Recently, I began a new project that requires migrating some process from Azure Pipelines to Github Actions. One of the tasks involves retrieving secrets from Azure Key Vault.
In Azure Pipelines, we have an official task called AzureKeyVault@2 designed for this purpose. However, its official counterpart in Github Actions, Azure/get-keyvault-secrets@v1, has been deprecated. The recommended alternative is Azure CLI. While Azure CLI is a suitable option, it operates in a bash shell without multithreading. If numerous secrets need to be fetched, this can be time-consuming. Continue reading