pandas#

November 22, 2024
in python, azure, databricks, spark, pandas
7 min read

Databricks Connect

Databricks Connect allows you to connect your favorite IDE (PyCharm, VSCode, etc.) and other custom applications to Databricks compute and run Spark (or non-Spark) code.

This post is not a comprehensive guide on Databricks Connect; rather, it consists of side notes from the Azure Databricks docs. Most of the notes also apply to Databricks on AWS and GCP.

February 26, 2024
in api, python, async, azure, pandas
3 min read

Getting all users from MS Graph API in few seconds

MS Graph API's endpoint for retrieving users, GET /users can return all users of the tenant. The default limit is 100 users per page, and the maximum limit is 999 users per page. If there are more than 999 users, the response will contain a @odata.nextLink field, which is a URL to the next page of users. For a big company having a large number of users (50,000, 100,000, or even more), and it can be time-consuming to retrieve all users.

While MS Graph API provides generous throttling limits, we should find a way to parallelize the queries. This post explores sharding as a strategy to retrieve all users in a matter of seconds. The idea is to get all users by dividing users based on the first character of the userPrincipalName field.For instance, shard 1 would encompass users whose userPrincipalName starts with a, shard 2 would handle users starting with b, and so forth.

July 13, 2019
in python, pandas
5 min read

Filtering In Pandas Dataframe

Pandas dataframe is like a small database, we can use it to inject some data and do some in-memory filtering without any external SQL. This post is much like a summary of this StackOverflow thread.