Investigating Differences#

The summary provides a high-level overview of differences between two data frames. But to understand why they differ — and whether the differences point to a systematic issue — you need to drill deeper. This guide walks through diffly’s investigation API.

We continue with the supermarket data pipeline scenario from the quickstart:

import polars as pl

df_previous = pl.read_csv("data/previous_load.csv", try_parse_dates=True)
df_current = pl.read_csv("data/current_load.csv", try_parse_dates=True)

from diffly import compare_frames

comparison = compare_frames(df_previous, df_current, primary_key="transaction_id")

The summary showed us that unit_price, discount, timestamp, and total have mismatches at 70%, while loyalty_card_id is at 90%. But are the pricing mismatches happening in the same rows, or different ones? If they’re independent, we might have separate issues. If they’re correlated, we likely have one root cause. Let’s find out.

Investigating column mismatches#

joined_unequal() returns rows where values differ in the specified columns. Let’s start by looking at unit_price:

comparison.joined_unequal("unit_price", select="subset")

shape: (3, 3)

transaction_id	unit_price_left	unit_price_right
str	f64	f64
"TXN-006"	0.75	1.5
"TXN-007"	5.4	10.8
"TXN-008"	0.75	1.5

Three rows have different unit prices: TXN-006, TXN-007, and TXN-008. Now let’s check discount:

comparison.joined_unequal("discount", select="subset")

shape: (3, 3)

transaction_id	discount_left	discount_right
str	f64	f64
"TXN-006"	0.0	0.2
"TXN-007"	0.5	1.0
"TXN-008"	0.0	0.2

The exact same three rows. We can confirm this directly by passing all three columns at once — this returns rows where any of the specified columns differ:

comparison.joined_unequal("unit_price", "discount", "total", select="subset")

shape: (3, 7)

transaction_id	total_left	total_right	discount_left	discount_right	unit_price_left	unit_price_right
str	f64	f64	f64	f64	f64	f64
"TXN-006"	3.0	5.8	0.0	0.2	0.75	1.5
"TXN-007"	10.3	20.6	0.5	1.0	5.4	10.8
"TXN-008"	0.75	1.3	0.0	0.2	0.75	1.5

Still exactly 3 rows — confirming that unit_price, discount, and total mismatches are fully correlated. These aren’t three independent problems; it’s one.

Now let’s use the select parameter to pull in additional context. By passing a list, we can include columns beyond those being checked for mismatches:

comparison.joined_unequal("unit_price", select=["store_id", "register_id"])

shape: (3, 7)

transaction_id	store_id_left	store_id_right	register_id_left	register_id_right	unit_price_left	unit_price_right
str	str	str	str	str	f64	f64
"TXN-006"	"S2"	"S2"	"R4"	"R4"	0.75	1.5
"TXN-007"	"S2"	"S2"	"R4"	"R4"	5.4	10.8
"TXN-008"	"S2"	"S2"	"R4"	"R4"	0.75	1.5

This reveals that all three affected rows share store_id = "S2" and register_id = "R4". The data load for store S2’s register R4 appears to have applied incorrect pricing — the unit prices in the new load are exactly double the originals, and the totals were recomputed from those wrong prices.

Seeing common change patterns#

change_counts() shows the most frequent value transitions for a column, confirming systematic patterns:

comparison.change_counts("unit_price")

shape: (2, 3)

left	right	count
f64	f64	u32
0.75	1.5	2
5.4	10.8	1

The most common change (0.75 → 1.5) appears in two transactions, and the remaining change (5.4 → 10.8) follows the same doubling pattern — a clear sign of a systematic issue rather than random data corruption. With a larger dataset, you might see a single dominant change pattern (e.g., 2.99 → 5.98 appearing 500 times), making the doubling pattern immediately obvious.

Checking the loyalty card mismatch#

We’ve explained the pricing discrepancies, but what about the loyalty_card_id mismatch? Let’s check if it’s related:

comparison.joined_unequal("loyalty_card_id", select="subset")

shape: (1, 3)

transaction_id	loyalty_card_id_left	loyalty_card_id_right
str	str	str
"TXN-004"	"LC-1004"	"LC-9999"

This affects a completely different row (TXN-004) — it’s unrelated to the pricing issue. The card ID changed from LC-1004 to LC-9999, but the reason isn’t clear from the data alone. It could be a customer updating their loyalty card, a manual data correction, or something else entirely. Not every discrepancy has an obvious explanation, and that’s fine — the important thing is that we’ve confirmed it’s a separate issue from the systematic pricing problem.

Investigating extra or missing rows#

Removed entries#

left_only() returns rows that were in the previous load but not the new one:

comparison.left_only()

shape: (2, 10)

transaction_id	timestamp	store_id	register_id	product	quantity	unit_price	discount	total	loyalty_card_id
str	datetime[μs]	str	str	str	i64	f64	f64	f64	str
"TXN-011"	2024-03-01 08:00:00	"S2"	"R3"	"Soap"	1	2.5	0.0	2.5	"LC-2004"
"TXN-012"	2024-03-01 08:20:00	"S2"	"R3"	"Pasta"	2	1.8	0.0	3.6	"LC-2005"

Both removed entries are from store S2 — the same store that has price differences. These early-morning transactions from register R3 were apparently identified as incorrect and removed in the new load.

New entries#

right_only() returns rows that exist only in the new load:

comparison.right_only()

shape: (2, 10)

transaction_id	timestamp	store_id	register_id	product	quantity	unit_price	discount	total	loyalty_card_id
str	datetime[μs]	str	str	str	i64	f64	f64	f64	str
"TXN-013"	2024-03-01 14:00:00	"S1"	"R1"	"Coffee"	1	6.5	0.0	6.5	"LC-1008"
"TXN-014"	2024-03-01 14:15:00	"S1"	"R2"	"Bananas"	5	1.2	0.0	6.0	null

As expected, the new entries are the afternoon’s transactions. We can validate our expectation that these are the most recent entries — their timestamps should be ≥ all timestamps in the previous load:

new_entries = comparison.right_only()
min_new = new_entries.select(pl.col("timestamp").min()).item()
max_existing = df_previous.select(pl.col("timestamp").max()).item()
assert min_new >= max_existing

This passes: the earliest new entry (2024-03-01 14:00) is after the latest existing entry (2024-03-01 13:30), confirming that the new load appended transactions chronologically as expected.

Working with large DataFrames#

All methods shown above accept a lazy=True parameter to return a LazyFrame instead of a DataFrame. This is useful when investigating differences between large DataFrames that don’t fit into memory:

# Returns a LazyFrame — no computation happens yet
mismatched_lazy = comparison.joined_unequal("unit_price", lazy=True)

# Only computes when you explicitly collect
mismatched_lazy.filter(pl.col("store_id_left") == "S2").collect()

shape: (3, 19)

transaction_id	discount_left	discount_right	loyalty_card_id_left	loyalty_card_id_right	product_left	product_right	quantity_left	quantity_right	register_id_left	register_id_right	store_id_left	store_id_right	timestamp_left	timestamp_right	total_left	total_right	unit_price_left	unit_price_right
str	f64	f64	str	str	str	str	i64	i64	str	str	str	str	datetime[μs]	datetime[μs]	f64	f64	f64	f64
"TXN-006"	0.0	0.2	"LC-2001"	"LC-2001"	"Apples"	"Apples"	4	4	"R4"	"R4"	"S2"	"S2"	2024-03-01 11:20:00	2024-03-01 11:20:02	3.0	5.8	0.75	1.5
"TXN-007"	0.5	1.0	"LC-2002"	"LC-2002"	"Chicken"	"Chicken"	2	2	"R4"	"R4"	"S2"	"S2"	2024-03-01 11:45:00	2024-03-01 11:45:03	10.3	20.6	5.4	10.8
"TXN-008"	0.0	0.2	null	null	"Rice"	"Rice"	1	1	"R4"	"R4"	"S2"	"S2"	2024-03-01 12:10:00	2024-03-01 12:10:02	0.75	1.3	0.75	1.5