Investigating Differences#
The summary provides a high-level overview of differences between two data frames. But to understand why they differ — and whether the differences point to a systematic issue — you need to drill deeper. This guide walks through diffly’s investigation API.
We continue with the supermarket data pipeline scenario from the quickstart:
import polars as pl
df_previous = pl.read_csv("data/previous_load.csv", try_parse_dates=True)
df_current = pl.read_csv("data/current_load.csv", try_parse_dates=True)
from diffly import compare_frames
comparison = compare_frames(df_previous, df_current, primary_key="transaction_id")
The summary showed us that unit_price, discount, timestamp, and total have mismatches at 70%, while loyalty_card_id is at 90%. But are the pricing mismatches happening in the same rows, or different ones? If they’re independent, we might have separate issues. If they’re correlated, we likely have one root cause. Let’s find out.
Investigating column mismatches#
joined_unequal() returns rows where values differ in the specified columns. Let’s start by looking at unit_price:
comparison.joined_unequal("unit_price", select="subset")
| transaction_id | unit_price_left | unit_price_right |
|---|---|---|
| str | f64 | f64 |
| "TXN-006" | 0.75 | 1.5 |
| "TXN-007" | 5.4 | 10.8 |
| "TXN-008" | 0.75 | 1.5 |
Three rows have different unit prices: TXN-006, TXN-007, and TXN-008. Now let’s check discount:
comparison.joined_unequal("discount", select="subset")
| transaction_id | discount_left | discount_right |
|---|---|---|
| str | f64 | f64 |
| "TXN-006" | 0.0 | 0.2 |
| "TXN-007" | 0.5 | 1.0 |
| "TXN-008" | 0.0 | 0.2 |
The exact same three rows. We can confirm this directly by passing all three columns at once — this returns rows where any of the specified columns differ:
comparison.joined_unequal("unit_price", "discount", "total", select="subset")
| transaction_id | total_left | total_right | discount_left | discount_right | unit_price_left | unit_price_right |
|---|---|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 | f64 | f64 |
| "TXN-006" | 3.0 | 5.8 | 0.0 | 0.2 | 0.75 | 1.5 |
| "TXN-007" | 10.3 | 20.6 | 0.5 | 1.0 | 5.4 | 10.8 |
| "TXN-008" | 0.75 | 1.3 | 0.0 | 0.2 | 0.75 | 1.5 |
Still exactly 3 rows — confirming that unit_price, discount, and total mismatches are fully correlated. These aren’t three independent problems; it’s one.
Now let’s use the select parameter to pull in additional context. By passing a list, we can include columns beyond those being checked for mismatches:
comparison.joined_unequal("unit_price", select=["store_id", "register_id"])
| transaction_id | store_id_left | store_id_right | register_id_left | register_id_right | unit_price_left | unit_price_right |
|---|---|---|---|---|---|---|
| str | str | str | str | str | f64 | f64 |
| "TXN-006" | "S2" | "S2" | "R4" | "R4" | 0.75 | 1.5 |
| "TXN-007" | "S2" | "S2" | "R4" | "R4" | 5.4 | 10.8 |
| "TXN-008" | "S2" | "S2" | "R4" | "R4" | 0.75 | 1.5 |
This reveals that all three affected rows share store_id = "S2" and register_id = "R4". The data load for store S2’s register R4 appears to have applied incorrect pricing — the unit prices in the new load are exactly double the originals, and the totals were recomputed from those wrong prices.
Seeing common change patterns#
change_counts() shows the most frequent value transitions for a column, confirming systematic patterns:
comparison.change_counts("unit_price")
| left | right | count |
|---|---|---|
| f64 | f64 | u32 |
| 0.75 | 1.5 | 2 |
| 5.4 | 10.8 | 1 |
The most common change (0.75 → 1.5) appears in two transactions, and the remaining change (5.4 → 10.8) follows the same doubling pattern — a clear sign of a systematic issue rather than random data corruption. With a larger dataset, you might see a single dominant change pattern (e.g., 2.99 → 5.98 appearing 500 times), making the doubling pattern immediately obvious.
Checking the loyalty card mismatch#
We’ve explained the pricing discrepancies, but what about the loyalty_card_id mismatch? Let’s check if it’s related:
comparison.joined_unequal("loyalty_card_id", select="subset")
| transaction_id | loyalty_card_id_left | loyalty_card_id_right |
|---|---|---|
| str | str | str |
| "TXN-004" | "LC-1004" | "LC-9999" |
This affects a completely different row (TXN-004) — it’s unrelated to the pricing issue. The card ID changed from LC-1004 to LC-9999, but the reason isn’t clear from the data alone. It could be a customer updating their loyalty card, a manual data correction, or something else entirely. Not every discrepancy has an obvious explanation, and that’s fine — the important thing is that we’ve confirmed it’s a separate issue from the systematic pricing problem.
Investigating extra or missing rows#
Removed entries#
left_only() returns rows that were in the previous load but not the new one:
comparison.left_only()
| transaction_id | timestamp | store_id | register_id | product | quantity | unit_price | discount | total | loyalty_card_id |
|---|---|---|---|---|---|---|---|---|---|
| str | datetime[μs] | str | str | str | i64 | f64 | f64 | f64 | str |
| "TXN-011" | 2024-03-01 08:00:00 | "S2" | "R3" | "Soap" | 1 | 2.5 | 0.0 | 2.5 | "LC-2004" |
| "TXN-012" | 2024-03-01 08:20:00 | "S2" | "R3" | "Pasta" | 2 | 1.8 | 0.0 | 3.6 | "LC-2005" |
Both removed entries are from store S2 — the same store that has price differences. These early-morning transactions from register R3 were apparently identified as incorrect and removed in the new load.
New entries#
right_only() returns rows that exist only in the new load:
comparison.right_only()
| transaction_id | timestamp | store_id | register_id | product | quantity | unit_price | discount | total | loyalty_card_id |
|---|---|---|---|---|---|---|---|---|---|
| str | datetime[μs] | str | str | str | i64 | f64 | f64 | f64 | str |
| "TXN-013" | 2024-03-01 14:00:00 | "S1" | "R1" | "Coffee" | 1 | 6.5 | 0.0 | 6.5 | "LC-1008" |
| "TXN-014" | 2024-03-01 14:15:00 | "S1" | "R2" | "Bananas" | 5 | 1.2 | 0.0 | 6.0 | null |
As expected, the new entries are the afternoon’s transactions. We can validate our expectation that these are the most recent entries — their timestamps should be ≥ all timestamps in the previous load:
new_entries = comparison.right_only()
min_new = new_entries.select(pl.col("timestamp").min()).item()
max_existing = df_previous.select(pl.col("timestamp").max()).item()
assert min_new >= max_existing
This passes: the earliest new entry (2024-03-01 14:00) is after the latest existing entry (2024-03-01 13:30), confirming that the new load appended transactions chronologically as expected.
Working with large DataFrames#
All methods shown above accept a lazy=True parameter to return a LazyFrame instead of a DataFrame. This is useful when investigating differences between large DataFrames that don’t fit into memory:
# Returns a LazyFrame — no computation happens yet
mismatched_lazy = comparison.joined_unequal("unit_price", lazy=True)
# Only computes when you explicitly collect
mismatched_lazy.filter(pl.col("store_id_left") == "S2").collect()
| transaction_id | discount_left | discount_right | loyalty_card_id_left | loyalty_card_id_right | product_left | product_right | quantity_left | quantity_right | register_id_left | register_id_right | store_id_left | store_id_right | timestamp_left | timestamp_right | total_left | total_right | unit_price_left | unit_price_right |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | f64 | f64 | str | str | str | str | i64 | i64 | str | str | str | str | datetime[μs] | datetime[μs] | f64 | f64 | f64 | f64 |
| "TXN-006" | 0.0 | 0.2 | "LC-2001" | "LC-2001" | "Apples" | "Apples" | 4 | 4 | "R4" | "R4" | "S2" | "S2" | 2024-03-01 11:20:00 | 2024-03-01 11:20:02 | 3.0 | 5.8 | 0.75 | 1.5 |
| "TXN-007" | 0.5 | 1.0 | "LC-2002" | "LC-2002" | "Chicken" | "Chicken" | 2 | 2 | "R4" | "R4" | "S2" | "S2" | 2024-03-01 11:45:00 | 2024-03-01 11:45:03 | 10.3 | 20.6 | 5.4 | 10.8 |
| "TXN-008" | 0.0 | 0.2 | null | null | "Rice" | "Rice" | 1 | 1 | "R4" | "R4" | "S2" | "S2" | 2024-03-01 12:10:00 | 2024-03-01 12:10:02 | 0.75 | 1.3 | 0.75 | 1.5 |
See also
The DataFrameComparison API reference for the full list of available methods.