Tolerances#
When comparing data frames, floating point and temporal values may differ slightly due to precision or rounding. diffly supports configurable tolerances to handle these cases.
We continue with the supermarket data pipeline scenario from the quickstart:
import polars as pl
pl.Config.set_fmt_float("full")
df_previous = pl.read_csv("data/previous_load.csv", try_parse_dates=True)
df_current = pl.read_csv("data/current_load.csv", try_parse_dates=True)
Default behavior#
By default, diffly uses abs_tol=1e-08 and rel_tol=1e-05 for floating point comparisons, matching Python’s math.isclose defaults. Temporal types (dates, datetimes) are compared exactly (abs_tol_temporal=0). This means tiny floating point rounding is automatically ignored, but even a one-second timestamp difference will be flagged.
In our scenario, the total column has some values that differ only at the 10th decimal place due to how the totals were calculated in each system:
df_previous.join(
df_current,
on="transaction_id",
suffix="_current",
).filter(
pl.col("total") != pl.col("total_current")
).select("transaction_id", "total", "total_current")
| transaction_id | total | total_current |
|---|---|---|
| str | f64 | f64 |
| "TXN-001" | 3 | 3.0000000001 |
| "TXN-003" | 3.2 | 3.1999999999 |
| "TXN-006" | 3 | 5.8 |
| "TXN-007" | 10.3 | 20.6 |
| "TXN-008" | 0.75 | 1.3 |
| "TXN-009" | 2.4 | 2.4000000001 |
With default tolerances, the tiny precision differences are correctly ignored — only real differences remain:
from diffly import compare_frames
comparison = compare_frames(df_previous, df_current, primary_key="transaction_id")
comparison.joined_unequal("total", select="subset")
| transaction_id | total_left | total_right |
|---|---|---|
| str | f64 | f64 |
| "TXN-006" | 3 | 5.8 |
| "TXN-007" | 10.3 | 20.6 |
| "TXN-008" | 0.75 | 1.3 |
Strict comparison#
To catch all differences including floating point precision issues, set both tolerances to zero:
comparison_strict = compare_frames(
df_previous,
df_current,
primary_key="transaction_id",
abs_tol=0,
rel_tol=0,
)
comparison_strict.fraction_same()["total"]
0.4
Now the precision differences are also counted as mismatches:
comparison_strict.joined_unequal("total", select="subset")
| transaction_id | total_left | total_right |
|---|---|---|
| str | f64 | f64 |
| "TXN-001" | 3 | 3.0000000001 |
| "TXN-003" | 3.2 | 3.1999999999 |
| "TXN-006" | 3 | 5.8 |
| "TXN-007" | 10.3 | 20.6 |
| "TXN-008" | 0.75 | 1.3 |
| "TXN-009" | 2.4 | 2.4000000001 |
Temporal tolerances#
Similarly, temporal columns can have small differences between systems — for example, due to clock drift on a register. In our data, the three transactions from store S2 have timestamps that differ by 2–3 seconds between loads:
comparison.joined_unequal("timestamp", select="subset")
| transaction_id | timestamp_left | timestamp_right |
|---|---|---|
| str | datetime[μs] | datetime[μs] |
| "TXN-006" | 2024-03-01 11:20:00 | 2024-03-01 11:20:02 |
| "TXN-007" | 2024-03-01 11:45:00 | 2024-03-01 11:45:03 |
| "TXN-008" | 2024-03-01 12:10:00 | 2024-03-01 12:10:02 |
Use abs_tol_temporal to allow small temporal differences. With a 2-second tolerance, two of the above gaps are ignored, increasing the match rate from 70% to 90%:
import datetime as dt
comparison_temporal = compare_frames(
df_previous,
df_current,
primary_key="transaction_id",
abs_tol_temporal=dt.timedelta(seconds=2),
)
comparison_temporal.fraction_same()["timestamp"]
0.9
Per-column tolerances#
Tolerances can be specified per column by passing a mapping:
from collections import defaultdict
comparison = compare_frames(
df_previous,
df_current,
primary_key="transaction_id",
abs_tol=defaultdict(lambda: 0, {"total": 0.01}),
rel_tol=defaultdict(lambda: 1e-05, {"total": 0}),
)
Note
When using a plain dict, a value must be provided for every common column except the primary key. A defaultdict (as above) avoids this by supplying a fallback automatically.
Note
Tolerances are not supported for List and Array columns. These are always compared for exact equality.