Tolerances#

When comparing data frames, floating point and temporal values may differ slightly due to precision or rounding. diffly supports configurable tolerances to handle these cases.

We continue with the supermarket data pipeline scenario from the quickstart:

import polars as pl

pl.Config.set_fmt_float("full")

df_previous = pl.read_csv("data/previous_load.csv", try_parse_dates=True)
df_current = pl.read_csv("data/current_load.csv", try_parse_dates=True)

Default behavior#

By default, diffly uses abs_tol=1e-08 and rel_tol=1e-05 for floating point comparisons, matching Python’s math.isclose defaults. Temporal types (dates, datetimes) are compared exactly (abs_tol_temporal=0). This means tiny floating point rounding is automatically ignored, but even a one-second timestamp difference will be flagged.

In our scenario, the total column has some values that differ only at the 10th decimal place due to how the totals were calculated in each system:

df_previous.join(
    df_current,
    on="transaction_id",
    suffix="_current",
).filter(
    pl.col("total") != pl.col("total_current")
).select("transaction_id", "total", "total_current")
shape: (6, 3)
transaction_idtotaltotal_current
strf64f64
"TXN-001"33.0000000001
"TXN-003"3.23.1999999999
"TXN-006"35.8
"TXN-007"10.320.6
"TXN-008"0.751.3
"TXN-009"2.42.4000000001

With default tolerances, the tiny precision differences are correctly ignored — only real differences remain:

from diffly import compare_frames

comparison = compare_frames(df_previous, df_current, primary_key="transaction_id")
comparison.joined_unequal("total", select="subset")
shape: (3, 3)
transaction_idtotal_lefttotal_right
strf64f64
"TXN-006"35.8
"TXN-007"10.320.6
"TXN-008"0.751.3

Strict comparison#

To catch all differences including floating point precision issues, set both tolerances to zero:

comparison_strict = compare_frames(
    df_previous,
    df_current,
    primary_key="transaction_id",
    abs_tol=0,
    rel_tol=0,
)
comparison_strict.fraction_same()["total"]
0.4

Now the precision differences are also counted as mismatches:

comparison_strict.joined_unequal("total", select="subset")
shape: (6, 3)
transaction_idtotal_lefttotal_right
strf64f64
"TXN-001"33.0000000001
"TXN-003"3.23.1999999999
"TXN-006"35.8
"TXN-007"10.320.6
"TXN-008"0.751.3
"TXN-009"2.42.4000000001

Temporal tolerances#

Similarly, temporal columns can have small differences between systems — for example, due to clock drift on a register. In our data, the three transactions from store S2 have timestamps that differ by 2–3 seconds between loads:

comparison.joined_unequal("timestamp", select="subset")
shape: (3, 3)
transaction_idtimestamp_lefttimestamp_right
strdatetime[μs]datetime[μs]
"TXN-006"2024-03-01 11:20:002024-03-01 11:20:02
"TXN-007"2024-03-01 11:45:002024-03-01 11:45:03
"TXN-008"2024-03-01 12:10:002024-03-01 12:10:02

Use abs_tol_temporal to allow small temporal differences. With a 2-second tolerance, two of the above gaps are ignored, increasing the match rate from 70% to 90%:

import datetime as dt

comparison_temporal = compare_frames(
    df_previous,
    df_current,
    primary_key="transaction_id",
    abs_tol_temporal=dt.timedelta(seconds=2),
)
comparison_temporal.fraction_same()["timestamp"]
0.9

Per-column tolerances#

Tolerances can be specified per column by passing a mapping:

from collections import defaultdict

comparison = compare_frames(
    df_previous,
    df_current,
    primary_key="transaction_id",
    abs_tol=defaultdict(lambda: 0, {"total": 0.01}),
    rel_tol=defaultdict(lambda: 1e-05, {"total": 0}),
)

Note

When using a plain dict, a value must be provided for every common column except the primary key. A defaultdict (as above) avoids this by supplying a fallback automatically.

Note

Tolerances are not supported for List and Array columns. These are always compared for exact equality.