Quickstart#
Imagine a supermarket’s data pipeline delivers updated sales data on a regular schedule. Each load is a snapshot of recent transactions. To validate a new load, we compare it against the previous one.
Let’s load two data frames representing consecutive data loads:
import polars as pl
df_previous = pl.read_csv("features/data/current_load.csv", try_parse_dates=True)
df_current = pl.read_csv("features/data/previous_load.csv", try_parse_dates=True)
Each transaction record contains:
transaction_id: Unique identifier for the transaction (primary key)timestamp: When the transaction occurredstore_id,register_id: Location identifiersproduct: Item purchasedquantity,unit_price,discount,total: Pricing detailsloyalty_card_id: Customer’s loyalty card (if any)
Here’s a preview of the previous load:
df_previous.head()
| transaction_id | timestamp | store_id | register_id | product | quantity | unit_price | discount | total | loyalty_card_id |
|---|---|---|---|---|---|---|---|---|---|
| str | datetime[μs] | str | str | str | i64 | f64 | f64 | f64 | str |
| "TXN-001" | 2024-03-01 09:01:00 | "S1" | "R1" | "Milk" | 2 | 1.5 | 0.0 | 3.0 | "LC-1001" |
| "TXN-002" | 2024-03-01 09:15:00 | "S1" | "R1" | "Bread" | 1 | 2.8 | 0.0 | 2.8 | null |
| "TXN-003" | 2024-03-01 10:02:00 | "S1" | "R2" | "Eggs" | 1 | 3.2 | 0.0 | 3.2 | "LC-1003" |
| "TXN-004" | 2024-03-01 10:30:00 | "S1" | "R2" | "Butter" | 3 | 1.9 | 0.1 | 5.6 | "LC-9999" |
| "TXN-005" | 2024-03-01 11:00:00 | "S1" | "R1" | "Cheese" | 1 | 4.5 | 0.0 | 4.5 | null |
Basic comparison#
To get started with diffly, use the compare_frames() function to compare two polars data frames:
from diffly import compare_frames
comparison = compare_frames(df_previous, df_current, primary_key="transaction_id")
This creates a DataFrameComparison object that can be used to explore the differences between the two data frames.
Checking equality#
The simplest check is whether the two data frames are equal:
if comparison.equal():
print("Data frames are identical!")
else:
print("Data frames have differences")
Data frames have differences
Generating a summary#
To understand what changed, generate a detailed summary:
comparison.summary()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Diffly Summary ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Primary key: transaction_id
Schemas
▔▔▔▔▔▔▔
Schemas match exactly (column count: 10).
Rows
▔▔▔▔
Left count Right count
12 (no change) 12
┏━┯━┯━┯━┯━┓
┃-│-│-│-│-┃ 2 left only (16.67%)
┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃ 6 equal (60.00%) │
┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴ 10 joined
┃ │ │ │ │ ┃ ≠ ┃ │ │ │ │ ┃ 4 unequal (40.00%) │
┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
┃+│+│+│+│+┃ 2 right only (16.67%)
┗━┷━┷━┷━┷━┛
Columns
▔▔▔▔▔▔▔
┌─────────────────┬─────────┐
│ discount │ 70.00% │
│ loyalty_card_id │ 90.00% │
│ product │ 100.00% │
│ quantity │ 100.00% │
│ register_id │ 100.00% │
│ store_id │ 100.00% │
│ timestamp │ 70.00% │
│ total │ 70.00% │
│ unit_price │ 70.00% │
└─────────────────┴─────────┘
The summary shows:
Schemas match — both data frames have the same columns and types.
Row counts: 2 left-only rows, 2 right-only rows, and 10 joined rows (6 equal, 4 unequal).
Column match rates:
unit_price,discount,timestamp, andtotalat 70%,loyalty_card_idat 90%, and the remaining columns at 100%.
Investigating differences#
Beyond the summary, diffly provides methods to retrieve the actual rows that differ. For example, to see rows where unit_price changed:
comparison.joined_unequal("unit_price", select=["store_id"])
| transaction_id | store_id_left | store_id_right | unit_price_left | unit_price_right |
|---|---|---|---|---|
| str | str | str | f64 | f64 |
| "TXN-006" | "S2" | "S2" | 1.5 | 0.75 |
| "TXN-007" | "S2" | "S2" | 10.8 | 5.4 |
| "TXN-008" | "S2" | "S2" | 1.5 | 0.75 |
All three affected rows come from store S2 — suggesting a systematic issue with that store’s data rather than random corruption.
Working without a primary key#
If your data frames don’t have a natural primary key, you can still perform comparisons:
comparison_no_pk = compare_frames(df_previous, df_current)
if comparison_no_pk.equal():
print("Data frames are identical")
else:
print("Data frames have differences")
Data frames have differences
Without a primary key, diffly can only check:
Schema equality (column names and types)
Overall equality (including row order)
Row-level comparisons and detailed summaries require a primary key to join the data frames.
Note
diffly works seamlessly with both polars.DataFrame and polars.LazyFrame. Internally, it uses lazy evaluation for efficiency, only computing what’s necessary when you request specific comparisons.
Outlook#
This concludes the quickstart guide. For more information, explore the Features section or see the API Reference for all available comparison methods and configuration options.