Learnings from migrating Pandas to Polars

Thom Hopmans · August 2, 2024

Learnings from migrating Pandas to Polars

I recently migrated a collection of pipelines from Pandas to Polars to address the growing need for faster data processing and improved memory efficiency. As the datasets expanded in size, Pandas began to show limitations in performance and memory peak usage. Polars offered a compelling alternative with its Rust-based engine, and provides considerable speed improvements, better memory management, and advanced features like lazy evaluation and parallel processing.

In this blog, I’ll delve into some of the challenges and discoveries that were encountered during the migration from Pandas to Polars. These insights will highlight the differences between the two libraries and offer practical tips for anyone considering a similar migration.

Data type conversions

When migrating from Pandas to Polars, it’s crucial to understand that Polars enforces stricter data type rules compared to Pandas. In Pandas, columns can contain mixed types, such as strings and numbers, within the same column without much issue. However, Polars requires that each column maintains a consistent data type, which means that if you attempt to create a column with mixed types, Polars will either convert all values to a single type (typically a Utf8 string) or raise an error, depending on the situation.

For example, if you have a column in Pandas containing both integers and strings, Polars might convert the entire column to String during the migration, which can lead to unexpected behavior in downstream operations that expect numeric data. This strictness helps Polars optimize performance and memory usage but requires careful handling when working with data that might have been more loosely typed in Pandas.

To avoid issues, you should explicitly define the data types for your columns when converting your data to Polars, or use Polars’ type casting methods to ensure that your data is in the correct format. Being proactive about these data type conversions will help prevent errors and ensure that your data processing pipelines behave as expected in Polars.

>>> import polars as pl
>>> from datetime import date
>>> 
>>> 
>>> pl.DataFrame(
...     [
...         {"date": date(2024, 8, 1), "col1": 1, "col2": 1},
...         {"date": date(2024, 8, 2), "col1": 2.0, "col2": 3.0},
...     ],
...     schema=[("date", pl.Date), ("col1", pl.Int64), ("col2", pl.Int64)],
... )

shape: (2, 3)
┌────────────┬──────┬──────┐
│ date       ┆ col1 ┆ col2 │
│ ---        ┆ ---  ┆ ---  │
│ date       ┆ i64  ┆ i64  │
╞════════════╪══════╪══════╡
│ 2024-08-01 ┆ 1    ┆ 1    │
│ 2024-08-02 ┆ 2    ┆ 3    │
└────────────┴──────┴──────┘

If you are unable to define the schema during the initialization of a Polars dataframe, an alternative is to use a library such as pandera to validate whether the given dataframe matches the expected schema.

import pandera.polars as pa

schema = pa.DataFrameSchema({"date": pa.Column(pl.Date)})
schema.validate(df)

Mixed-type lists are hard

In line with the stricter data types mentioned above, another notable change involves handling columns containing semi-structured data. In Pandas, a column with data type object can seamlessly hold lists, dictionaries, etc., allowing for flexible and dynamic data structures within the dataframe. However, when using Polars, such a column is automatically converted to a Struct data type. Similar as DataFrames, the elements in these structs are also casted to a data type. Polars does a pretty good job at inferring these datatypes itself, if the schema is not given explicitly.

>>> import polars as pl
>>> 
>>> df = pl.DataFrame(
...     [
...         {"dictionary": {"a": 1, "b": 2}},
...         {"dictionary": {"a": 3, "b": 4}},
...     ],
...     orient='row'
... )
>>> df
shape: (2, 1)
┌────────────┐
│ dictionary │
│ ---        │
│ struct[2]  │
╞════════════╡
│ {1,2}      │
│ {3,4}      │
└────────────┘
>>> df.schema
Schema([('dictionary', Struct({'a': Int64, 'b': Int64}))])

However, it will fail if it cannot infer all datatypes, e.g. when a column contains a list with different datatypes, e.g. floats and strings.

>>> import polars as pl
>>> 
>>> df = pl.DataFrame(
...     data=[
...         [
...             {"dictionary": {"a": 1, "b": 2}, "lists": ["a", 1]},
...         ]
...     ],
...     orient="row",
... )

Traceback (most recent call last):
TypeError: argument 'data': unexpected value while building Series of type String; found value of type Int64: 1

This behavior underscores the importance of understanding data type conversions. When migrating from an existing application using Pandas, the above example can be very annoying, because you will have to refactor the DataFrames in your application more than you might have wanted to, to be able to solve this issue.

The testing module is not imported by default

One important aspect to note is that the Polars testing module is not imported by default. Unlike Pandas, where testing utilities are available within the primary namespace, Polars requires explicit importation of its testing functionalities. This means that when writing tests for your dataframes, you need to remember to import polars.testing separately. This approach keeps the core namespace clean and lightweight but requires users to be aware of the need for these additional imports during testing phases.

As a result, the approach in the code snippet below has been chosen for the pytest files, which allows to use pl for the polars namespace, and imports the testing module.

import polars as pl
import polars.testing

pl.testing.assert_frame_equal(expected, result)

Row order is not guaranteed

Another critical difference between Polars and Pandas is the handling of dataframe row ordering. In Polars, the ordering of the result dataframe is not guaranteed by default. This is a significant change from Pandas, where the order of rows is typically preserved through various operations unless explicitly altered.

Polars optimizes for performance and may reorder data as part of its internal processing. Therefore, if the order of the rows is important for your application, you should explicitly sort the dataframe after performing your operations to ensure consistency. This behavior emphasizes the importance of understanding and controlling the data flow within Polars to achieve the desired outcomes.

Indexing differences

In Pandas, an index acts as a unique identifier for rows, allowing for easy access to rows by label, and the ability to perform operations like reindexing, setting multi-level indices, and joining dataframes based on their indices. This index can be either a single column or a multi-level index (MultiIndex), and it plays a crucial role.

However, in Polars, there is no concept of an index. Dataframes in Polars are purely columnar, meaning each column is treated as a series of data without any inherent row identifier. This design choice in Polars is intentional, as it aligns with Polars’ focus on performance and memory efficiency, avoiding the overhead associated with maintaining and manipulating an index.

Since there is no index in Polars, you’ll need to handle row-based operations differently. For example, you might use the row number directly or create a specific column to act as a unique identifier for rows if such functionality is required.

Conclusion

The above learnings and discoveries are just a few of many that were learned during the migration process. Like always, learning a new technology takes time, and the best way to learn is by doing. The reward for a successful transition from Pandas to Polars is rewarding though, as you’ll experience a significant boost in performance. Who doesn’t like that? 🚀