Quickstart#

Imagine a supermarket’s data pipeline delivers updated sales data on a regular schedule. Each load is a snapshot of recent transactions. To validate a new load, we compare it against the previous one.

Let’s load two data frames representing consecutive data loads:

import polars as pl

df_previous = pl.read_csv("features/data/current_load.csv", try_parse_dates=True)
df_current = pl.read_csv("features/data/previous_load.csv", try_parse_dates=True)

Each transaction record contains:

  • transaction_id: Unique identifier for the transaction (primary key)

  • timestamp: When the transaction occurred

  • store_id, register_id: Location identifiers

  • product: Item purchased

  • quantity, unit_price, discount, total: Pricing details

  • loyalty_card_id: Customer’s loyalty card (if any)

Here’s a preview of the previous load:

df_previous.head()
shape: (5, 10)
transaction_idtimestampstore_idregister_idproductquantityunit_pricediscounttotalloyalty_card_id
strdatetime[μs]strstrstri64f64f64f64str
"TXN-001"2024-03-01 09:01:00"S1""R1""Milk"21.50.03.0"LC-1001"
"TXN-002"2024-03-01 09:15:00"S1""R1""Bread"12.80.02.8null
"TXN-003"2024-03-01 10:02:00"S1""R2""Eggs"13.20.03.2"LC-1003"
"TXN-004"2024-03-01 10:30:00"S1""R2""Butter"31.90.15.6"LC-9999"
"TXN-005"2024-03-01 11:00:00"S1""R1""Cheese"14.50.04.5null

Basic comparison#

To get started with diffly, use the compare_frames() function to compare two polars data frames:

from diffly import compare_frames

comparison = compare_frames(df_previous, df_current, primary_key="transaction_id")

This creates a DataFrameComparison object that can be used to explore the differences between the two data frames.

Checking equality#

The simplest check is whether the two data frames are equal:

if comparison.equal():
    print("Data frames are identical!")
else:
    print("Data frames have differences")
Data frames have differences

Generating a summary#

To understand what changed, generate a detailed summary:

comparison.summary()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                     Diffly Summary                                     ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
   Primary key: transaction_id

 Schemas
 ▔▔▔▔▔▔▔
   Schemas match exactly (column count: 10).

 Rows
 ▔▔▔▔
   Left count             Right count
       12     (no change)     12

   ┏━┯━┯━┯━┯━┓
   ┃-│-│-│-│-┃                2  left only   (16.67%)
   ┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
   ┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃  6  equal       (60.00%)  │
   ┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴  10  joined
   ┃ │ │ │ │ ┃┃ │ │ │ │ ┃  4  unequal     (40.00%)  │
   ┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
                 ┃+│+│+│+│+┃  2  right only  (16.67%)
                 ┗━┷━┷━┷━┷━┛

 Columns
 ▔▔▔▔▔▔▔
   ┌─────────────────┬─────────┐
   │ discount        │  70.00% │
   │ loyalty_card_id │  90.00% │
   │ product         │ 100.00% │
   │ quantity        │ 100.00% │
   │ register_id     │ 100.00% │
   │ store_id        │ 100.00% │
   │ timestamp       │  70.00% │
   │ total           │  70.00% │
   │ unit_price      │  70.00% │
   └─────────────────┴─────────┘

The summary shows:

  • Schemas match — both data frames have the same columns and types.

  • Row counts: 2 left-only rows, 2 right-only rows, and 10 joined rows (6 equal, 4 unequal).

  • Column match rates: unit_price, discount, timestamp, and total at 70%, loyalty_card_id at 90%, and the remaining columns at 100%.

Investigating differences#

Beyond the summary, diffly provides methods to retrieve the actual rows that differ. For example, to see rows where unit_price changed:

comparison.joined_unequal("unit_price", select=["store_id"])
shape: (3, 5)
transaction_idstore_id_leftstore_id_rightunit_price_leftunit_price_right
strstrstrf64f64
"TXN-006""S2""S2"1.50.75
"TXN-007""S2""S2"10.85.4
"TXN-008""S2""S2"1.50.75

All three affected rows come from store S2 — suggesting a systematic issue with that store’s data rather than random corruption.

Working without a primary key#

If your data frames don’t have a natural primary key, you can still perform comparisons:

comparison_no_pk = compare_frames(df_previous, df_current)

if comparison_no_pk.equal():
    print("Data frames are identical")
else:
    print("Data frames have differences")
Data frames have differences

Without a primary key, diffly can only check:

  • Schema equality (column names and types)

  • Overall equality (including row order)

Row-level comparisons and detailed summaries require a primary key to join the data frames.

Note

diffly works seamlessly with both polars.DataFrame and polars.LazyFrame. Internally, it uses lazy evaluation for efficiency, only computing what’s necessary when you request specific comparisons.

Outlook#

This concludes the quickstart guide. For more information, explore the Features section or see the API Reference for all available comparison methods and configuration options.