Pandas dtypes: To category or not to category?
Starting out programming in Python is easy. Python is designed with user-friendliness in mind. You can quickly write programs without the need to think about data types or memory management. But once you scale your code to production-like settings with bigger workloads and complexity, you need to step up your Python game.
When it comes to data analysis Pythons pandas library offers a lot of optimization potential out of the box, but relying on default settings is risky. One case in hand are dtypes, the data types of your columns (like float or integer). For the most part, pandas uses NumPy arrays and dtypes for individual columns of a DataFrame and extends those at a few places.
The dtype describes how the bytes in the fixed-size block of memory corresponding to an underlying array item should be interpreted. It describes things like the type of data, the size of the data or the byte order of the data.
Pandas default dtypes are not always memory efficient. This is particularly true for strings. Pandas official documentations recommends casting text data columns with relatively few unique values (“low-cardinality” data) as type Categorical. Now you might ask yourself how low is low enough to profit from recasting?
The answer is analytical in nature, but 20 minutes of coding can spare you 5 minutes of thinking, so I wrote a quick simulation on memory reduction through recasting (in pandas 1.4.2):
Two parameters drive our investigation. How many keys do we have in general and how often do they repeat themselves. Here is the memory gain from using category in percent:
If all keys are unique feel free to stick with the default dtype, but once you expect your strings to repeat themselves once, start using categorical dtypes. On average you will save 25% of memory. Once your strings show up 10 times or more, memory gain is up to 80%.