Pandas dtypes: To category or not to category?

Photo by Fredy Jacob on Unsplash

Starting out programming in Python is easy. Python is designed with user-friendliness in mind. You can quickly write programs without the need to think about data types or memory management. But once you scale your code to production-like settings with bigger workloads and complexity, you need to step up your Python game.

When it comes to data analysis Pythons pandas library offers a lot of optimization potential out of the box, but relying on default settings is risky. One case in hand are dtypes, the data types of your columns (like float or integer). For the most part, pandas uses NumPy arrays and dtypes for individual columns of a DataFrame and extends those at a few places.

The dtype describes how the bytes in the fixed-size block of memory corresponding to an underlying array item should be interpreted. It describes things like the type of data, the size of the data or the byte order of the data.

Pandas default dtypes are not always memory efficient. This is particularly true for strings. Pandas official documentations recommends casting text data columns with relatively few unique values (“low-cardinality” data) as type Categorical. Now you might ask yourself how low is low enough to profit from recasting?

The answer is analytical in nature, but 20 minutes of coding can spare you 5 minutes of thinking, so I wrote a quick simulation on memory reduction through recasting (in pandas 1.4.2):

Two parameters drive our investigation. How many keys do we have in general and how often do they repeat themselves. Here is the memory gain from using category in percent:

Simulation results by number of strings and reps

If all keys are unique feel free to stick with the default dtype, but once you expect your strings to repeat themselves once, start using categorical dtypes. On average you will save 25% of memory. Once your strings show up 10 times or more, memory gain is up to 80%.




Data Science, Trail-Running, Philosophy and Art

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

“Not every country is created equal” — Data Science and Coronavirus

Un(in)sure(d) of Why I Pay So Much

Sentiment Analysis of Social Media Data

Introducing Fieldscanner: Real-time Drone Mapping is Here

MS Excel — All About Pivot Tables

Exploring the food venues nearby Berlin metro stations (U-Bahn)

94% accurate context-dependent Conversational AI for customer service

Overview of Data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Christoph Hoffmann

Christoph Hoffmann

Data Science, Trail-Running, Philosophy and Art

More from Medium

Pandas Sum DataFrame Columns With Examples

Python Assert Statements — Use Cases

Self-Join and Cross Join in Pandas DataFrame