Quickstart#

Imagine a supermarket’s data pipeline delivers updated sales data on a regular schedule. Each load is a snapshot of recent transactions. To validate a new load, we compare it against the previous one.

Let’s load two data frames representing consecutive data loads:

import polars as pl

df_previous = pl.read_csv("features/data/current_load.csv", try_parse_dates=True)
df_current = pl.read_csv("features/data/previous_load.csv", try_parse_dates=True)

Each transaction record contains:

transaction_id: Unique identifier for the transaction (primary key)
timestamp: When the transaction occurred
store_id, register_id: Location identifiers
product: Item purchased
quantity, unit_price, discount, total: Pricing details
loyalty_card_id: Customer’s loyalty card (if any)

Here’s a preview of the previous load:

df_previous.head()

shape: (5, 10)

transaction_id	timestamp	store_id	register_id	product	quantity	unit_price	discount	total	loyalty_card_id
str	datetime[μs]	str	str	str	i64	f64	f64	f64	str
"TXN-001"	2024-03-01 09:01:00	"S1"	"R1"	"Milk"	2	1.5	0.0	3.0	"LC-1001"
"TXN-002"	2024-03-01 09:15:00	"S1"	"R1"	"Bread"	1	2.8	0.0	2.8	null
"TXN-003"	2024-03-01 10:02:00	"S1"	"R2"	"Eggs"	1	3.2	0.0	3.2	"LC-1003"
"TXN-004"	2024-03-01 10:30:00	"S1"	"R2"	"Butter"	3	1.9	0.1	5.6	"LC-9999"
"TXN-005"	2024-03-01 11:00:00	"S1"	"R1"	"Cheese"	1	4.5	0.0	4.5	null

Basic comparison#

To get started with diffly, use the compare_frames() function to compare two polars data frames:

from diffly import compare_frames

comparison = compare_frames(df_previous, df_current, primary_key="transaction_id")

This creates a DataFrameComparison object that can be used to explore the differences between the two data frames.

Checking equality#

The simplest check is whether the two data frames are equal:

if comparison.equal():
    print("Data frames are identical!")
else:
    print("Data frames have differences")

Data frames have differences

Generating a summary#

To understand what changed, generate a detailed summary:

comparison.summary()

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                     Diffly Summary                                     ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
   Primary key: transaction_id

 Schemas
 ▔▔▔▔▔▔▔
   Schemas match exactly (column count: 10).

 Rows
 ▔▔▔▔
   Left count             Right count
       12     (no change)     12

   ┏━┯━┯━┯━┯━┓
   ┃-│-│-│-│-┃                2  left only   (16.67%)
   ┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
   ┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃  6  equal       (60.00%)  │
   ┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴  10  joined
   ┃ │ │ │ │ ┃ ≠ ┃ │ │ │ │ ┃  4  unequal     (40.00%)  │
   ┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
                 ┃+│+│+│+│+┃  2  right only  (16.67%)
                 ┗━┷━┷━┷━┷━┛

 Columns
 ▔▔▔▔▔▔▔
   ┌─────────────────┬─────────┐
   │ discount        │  70.00% │
   │ loyalty_card_id │  90.00% │
   │ product         │ 100.00% │
   │ quantity        │ 100.00% │
   │ register_id     │ 100.00% │
   │ store_id        │ 100.00% │
   │ timestamp       │  70.00% │
   │ total           │  70.00% │
   │ unit_price      │  70.00% │
   └─────────────────┴─────────┘

The summary shows:

Schemas match — both data frames have the same columns and types.
Row counts: 2 left-only rows, 2 right-only rows, and 10 joined rows (6 equal, 4 unequal).
Column match rates: unit_price, discount, timestamp, and total at 70%, loyalty_card_id at 90%, and the remaining columns at 100%.

Investigating differences#

Beyond the summary, diffly provides methods to retrieve the actual rows that differ. For example, to see rows where unit_price changed:

comparison.joined_unequal("unit_price", select=["store_id"])

shape: (3, 5)

transaction_id	store_id_left	store_id_right	unit_price_left	unit_price_right
str	str	str	f64	f64
"TXN-006"	"S2"	"S2"	1.5	0.75
"TXN-007"	"S2"	"S2"	10.8	5.4
"TXN-008"	"S2"	"S2"	1.5	0.75

All three affected rows come from store S2 — suggesting a systematic issue with that store’s data rather than random corruption.

Working without a primary key#

If your data frames don’t have a natural primary key, you can still perform comparisons:

comparison_no_pk = compare_frames(df_previous, df_current)

if comparison_no_pk.equal():
    print("Data frames are identical")
else:
    print("Data frames have differences")

Data frames have differences

Without a primary key, diffly can only check:

Schema equality (column names and types)
Overall equality (including row order)

Row-level comparisons and detailed summaries require a primary key to join the data frames.

Note

diffly works seamlessly with both polars.DataFrame and polars.LazyFrame. Internally, it uses lazy evaluation for efficiency, only computing what’s necessary when you request specific comparisons.

Outlook#

This concludes the quickstart guide. For more information, explore the Features section or see the API Reference for all available comparison methods and configuration options.