Investigating Differences#

The summary provides a high-level overview of differences between two data frames. But to understand why they differ — and whether the differences point to a systematic issue — you need to drill deeper. This guide walks through diffly’s investigation API.

We continue with the supermarket data pipeline scenario from the quickstart:

import polars as pl

df_previous = pl.read_csv("data/previous_load.csv", try_parse_dates=True)
df_current = pl.read_csv("data/current_load.csv", try_parse_dates=True)
from diffly import compare_frames

comparison = compare_frames(df_previous, df_current, primary_key="transaction_id")

The summary showed us that unit_price, discount, timestamp, and total have mismatches at 70%, while loyalty_card_id is at 90%. But are the pricing mismatches happening in the same rows, or different ones? If they’re independent, we might have separate issues. If they’re correlated, we likely have one root cause. Let’s find out.

Investigating column mismatches#

joined_unequal() returns rows where values differ in the specified columns. Let’s start by looking at unit_price:

comparison.joined_unequal("unit_price", select="subset")
shape: (3, 3)
transaction_idunit_price_leftunit_price_right
strf64f64
"TXN-006"0.751.5
"TXN-007"5.410.8
"TXN-008"0.751.5

Three rows have different unit prices: TXN-006, TXN-007, and TXN-008. Now let’s check discount:

comparison.joined_unequal("discount", select="subset")
shape: (3, 3)
transaction_iddiscount_leftdiscount_right
strf64f64
"TXN-006"0.00.2
"TXN-007"0.51.0
"TXN-008"0.00.2

The exact same three rows. We can confirm this directly by passing all three columns at once — this returns rows where any of the specified columns differ:

comparison.joined_unequal("unit_price", "discount", "total", select="subset")
shape: (3, 7)
transaction_idtotal_lefttotal_rightdiscount_leftdiscount_rightunit_price_leftunit_price_right
strf64f64f64f64f64f64
"TXN-006"3.05.80.00.20.751.5
"TXN-007"10.320.60.51.05.410.8
"TXN-008"0.751.30.00.20.751.5

Still exactly 3 rows — confirming that unit_price, discount, and total mismatches are fully correlated. These aren’t three independent problems; it’s one.

Now let’s use the select parameter to pull in additional context. By passing a list, we can include columns beyond those being checked for mismatches:

comparison.joined_unequal("unit_price", select=["store_id", "register_id"])
shape: (3, 7)
transaction_idstore_id_leftstore_id_rightregister_id_leftregister_id_rightunit_price_leftunit_price_right
strstrstrstrstrf64f64
"TXN-006""S2""S2""R4""R4"0.751.5
"TXN-007""S2""S2""R4""R4"5.410.8
"TXN-008""S2""S2""R4""R4"0.751.5

This reveals that all three affected rows share store_id = "S2" and register_id = "R4". The data load for store S2’s register R4 appears to have applied incorrect pricing — the unit prices in the new load are exactly double the originals, and the totals were recomputed from those wrong prices.

Seeing common change patterns#

change_counts() shows the most frequent value transitions for a column, confirming systematic patterns:

comparison.change_counts("unit_price")
shape: (2, 3)
leftrightcount
f64f64u32
0.751.52
5.410.81

The most common change (0.75 → 1.5) appears in two transactions, and the remaining change (5.4 → 10.8) follows the same doubling pattern — a clear sign of a systematic issue rather than random data corruption. With a larger dataset, you might see a single dominant change pattern (e.g., 2.99 5.98 appearing 500 times), making the doubling pattern immediately obvious.

Checking the loyalty card mismatch#

We’ve explained the pricing discrepancies, but what about the loyalty_card_id mismatch? Let’s check if it’s related:

comparison.joined_unequal("loyalty_card_id", select="subset")
shape: (1, 3)
transaction_idloyalty_card_id_leftloyalty_card_id_right
strstrstr
"TXN-004""LC-1004""LC-9999"

This affects a completely different row (TXN-004) — it’s unrelated to the pricing issue. The card ID changed from LC-1004 to LC-9999, but the reason isn’t clear from the data alone. It could be a customer updating their loyalty card, a manual data correction, or something else entirely. Not every discrepancy has an obvious explanation, and that’s fine — the important thing is that we’ve confirmed it’s a separate issue from the systematic pricing problem.

Investigating extra or missing rows#

Removed entries#

left_only() returns rows that were in the previous load but not the new one:

comparison.left_only()
shape: (2, 10)
transaction_idtimestampstore_idregister_idproductquantityunit_pricediscounttotalloyalty_card_id
strdatetime[μs]strstrstri64f64f64f64str
"TXN-011"2024-03-01 08:00:00"S2""R3""Soap"12.50.02.5"LC-2004"
"TXN-012"2024-03-01 08:20:00"S2""R3""Pasta"21.80.03.6"LC-2005"

Both removed entries are from store S2 — the same store that has price differences. These early-morning transactions from register R3 were apparently identified as incorrect and removed in the new load.

New entries#

right_only() returns rows that exist only in the new load:

comparison.right_only()
shape: (2, 10)
transaction_idtimestampstore_idregister_idproductquantityunit_pricediscounttotalloyalty_card_id
strdatetime[μs]strstrstri64f64f64f64str
"TXN-013"2024-03-01 14:00:00"S1""R1""Coffee"16.50.06.5"LC-1008"
"TXN-014"2024-03-01 14:15:00"S1""R2""Bananas"51.20.06.0null

As expected, the new entries are the afternoon’s transactions. We can validate our expectation that these are the most recent entries — their timestamps should be ≥ all timestamps in the previous load:

new_entries = comparison.right_only()
min_new = new_entries.select(pl.col("timestamp").min()).item()
max_existing = df_previous.select(pl.col("timestamp").max()).item()
assert min_new >= max_existing

This passes: the earliest new entry (2024-03-01 14:00) is after the latest existing entry (2024-03-01 13:30), confirming that the new load appended transactions chronologically as expected.

Working with large DataFrames#

All methods shown above accept a lazy=True parameter to return a LazyFrame instead of a DataFrame. This is useful when investigating differences between large DataFrames that don’t fit into memory:

# Returns a LazyFrame — no computation happens yet
mismatched_lazy = comparison.joined_unequal("unit_price", lazy=True)

# Only computes when you explicitly collect
mismatched_lazy.filter(pl.col("store_id_left") == "S2").collect()
shape: (3, 19)
transaction_iddiscount_leftdiscount_rightloyalty_card_id_leftloyalty_card_id_rightproduct_leftproduct_rightquantity_leftquantity_rightregister_id_leftregister_id_rightstore_id_leftstore_id_righttimestamp_lefttimestamp_righttotal_lefttotal_rightunit_price_leftunit_price_right
strf64f64strstrstrstri64i64strstrstrstrdatetime[μs]datetime[μs]f64f64f64f64
"TXN-006"0.00.2"LC-2001""LC-2001""Apples""Apples"44"R4""R4""S2""S2"2024-03-01 11:20:002024-03-01 11:20:023.05.80.751.5
"TXN-007"0.51.0"LC-2002""LC-2002""Chicken""Chicken"22"R4""R4""S2""S2"2024-03-01 11:45:002024-03-01 11:45:0310.320.65.410.8
"TXN-008"0.00.2nullnull"Rice""Rice"11"R4""R4""S2""S2"2024-03-01 12:10:002024-03-01 12:10:020.751.30.751.5

See also

The DataFrameComparison API reference for the full list of available methods.