Customizing the Summary#
The summary() method provides several options to control what information is displayed.
Scenario#
We continue with the supermarket data pipeline scenario from the quickstart. To validate a new load, we compare it against the previous one and check that:
New transactions have been added (
right_onlyis non-empty).Old or incorrect entries have been removed (
left_onlyis non-empty).The new entries are the most recent (their timestamps are ≥ everything else).
Existing transactions that appear in both loads still match.
Let’s set up two data frames representing consecutive data loads:
import polars as pl
df_previous = pl.read_csv("data/previous_load.csv", try_parse_dates=True)
df_current = pl.read_csv("data/current_load.csv", try_parse_dates=True)
Here’s a preview of the previous load:
df_previous.head()
| transaction_id | timestamp | store_id | register_id | product | quantity | unit_price | discount | total | loyalty_card_id |
|---|---|---|---|---|---|---|---|---|---|
| str | datetime[μs] | str | str | str | i64 | f64 | f64 | f64 | str |
| "TXN-001" | 2024-03-01 09:01:00 | "S1" | "R1" | "Milk" | 2 | 1.5 | 0.0 | 3.0 | "LC-1001" |
| "TXN-002" | 2024-03-01 09:15:00 | "S1" | "R1" | "Bread" | 1 | 2.8 | 0.0 | 2.8 | null |
| "TXN-003" | 2024-03-01 10:02:00 | "S1" | "R2" | "Eggs" | 1 | 3.2 | 0.0 | 3.2 | "LC-1003" |
| "TXN-004" | 2024-03-01 10:30:00 | "S1" | "R2" | "Butter" | 3 | 1.9 | 0.1 | 5.6 | "LC-1004" |
| "TXN-005" | 2024-03-01 11:00:00 | "S1" | "R1" | "Cheese" | 1 | 4.5 | 0.0 | 4.5 | null |
Now let’s compare the two loads:
from diffly import compare_frames
comparison = compare_frames(df_previous, df_current, primary_key="transaction_id")
Default summary#
The first step in any investigation is looking at the summary. It tells us:
Schemas match — both data frames have the same columns and types.
Rows: 2 left-only, 2 right-only, 10 joined (6 equal, 4 unequal).
Column match rates:
unit_price,discount,timestamp, andtotalat 70%,loyalty_card_idat 90%. Other columns likeproduct,quantity,store_id, andregister_idare at 100%.
print(comparison.summary())
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Diffly Summary ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Primary key: transaction_id
Schemas
▔▔▔▔▔▔▔
Schemas match exactly (column count: 10).
Rows
▔▔▔▔
Left count Right count
12 (no change) 12
┏━┯━┯━┯━┯━┓
┃-│-│-│-│-┃ 2 left only (16.67%)
┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃ 6 equal (60.00%) │
┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴ 10 joined
┃ │ │ │ │ ┃ ≠ ┃ │ │ │ │ ┃ 4 unequal (40.00%) │
┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
┃+│+│+│+│+┃ 2 right only (16.67%)
┗━┷━┷━┷━┷━┛
Columns
▔▔▔▔▔▔▔
┌─────────────────┬─────────┐
│ discount │ 70.00% │
│ loyalty_card_id │ 90.00% │
│ product │ 100.00% │
│ quantity │ 100.00% │
│ register_id │ 100.00% │
│ store_id │ 100.00% │
│ timestamp │ 70.00% │
│ total │ 70.00% │
│ unit_price │ 70.00% │
└─────────────────┴─────────┘
Display options#
Custom names#
Use left_name and right_name to give meaningful labels to each DataFrame. For our scenario, we’re comparing data loads:
print(comparison.summary(left_name="Previous Load", right_name="Current Load"))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Diffly Summary ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Primary key: transaction_id
Schemas
▔▔▔▔▔▔▔
Schemas match exactly (column count: 10).
Rows
▔▔▔▔
Previous Load count Current Load count
12 (no change) 12
┏━┯━┯━┯━┯━┓
┃-│-│-│-│-┃ 2 Previous Load only (16.67%)
┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃ 6 equal (60.00%) │
┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴ 10 joined
┃ │ │ │ │ ┃ ≠ ┃ │ │ │ │ ┃ 4 unequal (40.00%) │
┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
┃+│+│+│+│+┃ 2 Current Load only (16.67%)
┗━┷━┷━┷━┷━┛
Columns
▔▔▔▔▔▔▔
┌─────────────────┬─────────┐
│ discount │ 70.00% │
│ loyalty_card_id │ 90.00% │
│ product │ 100.00% │
│ quantity │ 100.00% │
│ register_id │ 100.00% │
│ store_id │ 100.00% │
│ timestamp │ 70.00% │
│ total │ 70.00% │
│ unit_price │ 70.00% │
└─────────────────┴─────────┘
Showing column value changes#
Use top_k_column_changes to display specific value changes for columns with mismatches:
print(comparison.summary(top_k_column_changes=3))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Diffly Summary ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Primary key: transaction_id
Schemas
▔▔▔▔▔▔▔
Schemas match exactly (column count: 10).
Rows
▔▔▔▔
Left count Right count
12 (no change) 12
┏━┯━┯━┯━┯━┓
┃-│-│-│-│-┃ 2 left only (16.67%)
┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃ 6 equal (60.00%) │
┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴ 10 joined
┃ │ │ │ │ ┃ ≠ ┃ │ │ │ │ ┃ 4 unequal (40.00%) │
┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
┃+│+│+│+│+┃ 2 right only (16.67%)
┗━┷━┷━┷━┷━┛
Columns
▔▔▔▔▔▔▔
┌─────────────────┬─────────┬─────────────────────────────────────────────────┐
│ discount │ 70.00% │ 0.0 -> 0.2 (2x) │
│ │ │ 0.5 -> 1.0 (1x) │
├─────────────────┼─────────┼─────────────────────────────────────────────────┤
│ loyalty_card_id │ 90.00% │ "LC-1004" -> "LC-9999" (1x) │
├─────────────────┼─────────┼─────────────────────────────────────────────────┤
│ product │ 100.00% │ │
├─────────────────┼─────────┼─────────────────────────────────────────────────┤
│ quantity │ 100.00% │ │
├─────────────────┼─────────┼─────────────────────────────────────────────────┤
│ register_id │ 100.00% │ │
├─────────────────┼─────────┼─────────────────────────────────────────────────┤
│ store_id │ 100.00% │ │
├─────────────────┼─────────┼─────────────────────────────────────────────────┤
│ timestamp │ 70.00% │ 2024-03-01 12:10:00 -> 2024-03-01 12:10:02 (1x) │
│ │ │ 2024-03-01 11:45:00 -> 2024-03-01 11:45:03 (1x) │
│ │ │ 2024-03-01 11:20:00 -> 2024-03-01 11:20:02 (1x) │
├─────────────────┼─────────┼─────────────────────────────────────────────────┤
│ total │ 70.00% │ 10.3 -> 20.6 (1x) │
│ │ │ 3.0 -> 5.8 (1x) │
│ │ │ 0.75 -> 1.3 (1x) │
├─────────────────┼─────────┼─────────────────────────────────────────────────┤
│ unit_price │ 70.00% │ 0.75 -> 1.5 (2x) │
│ │ │ 5.4 -> 10.8 (1x) │
└─────────────────┴─────────┴─────────────────────────────────────────────────┘
Showing sample rows#
Use sample_k_rows_only to display sample primary keys for rows that exist only in one DataFrame:
print(comparison.summary(sample_k_rows_only=5))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Diffly Summary ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Primary key: transaction_id
Schemas
▔▔▔▔▔▔▔
Schemas match exactly (column count: 10).
Rows
▔▔▔▔
Left count Right count
12 (no change) 12
┏━┯━┯━┯━┯━┓
┃-│-│-│-│-┃ 2 left only (16.67%)
┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃ 6 equal (60.00%) │
┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴ 10 joined
┃ │ │ │ │ ┃ ≠ ┃ │ │ │ │ ┃ 4 unequal (40.00%) │
┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
┃+│+│+│+│+┃ 2 right only (16.67%)
┗━┷━┷━┷━┷━┛
Columns
▔▔▔▔▔▔▔
┌─────────────────┬─────────┐
│ discount │ 70.00% │
│ loyalty_card_id │ 90.00% │
│ product │ 100.00% │
│ quantity │ 100.00% │
│ register_id │ 100.00% │
│ store_id │ 100.00% │
│ timestamp │ 70.00% │
│ total │ 70.00% │
│ unit_price │ 70.00% │
└─────────────────┴─────────┘
Rows left only
▔▔▔▔▔▔▔▔▔▔▔▔▔▔
┏━━━━━━━━━━━━━━━━┓
┃ transaction_id ┃
┡━━━━━━━━━━━━━━━━┩
│ TXN-011 │
│ TXN-012 │
└────────────────┘
Rows right only
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
┏━━━━━━━━━━━━━━━━┓
┃ transaction_id ┃
┡━━━━━━━━━━━━━━━━┩
│ TXN-013 │
│ TXN-014 │
└────────────────┘
Hiding perfect matches#
For wide data frames, set show_perfect_column_matches=False to hide columns with 100% match rate:
print(comparison.summary(show_perfect_column_matches=False))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Diffly Summary ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Primary key: transaction_id
Schemas
▔▔▔▔▔▔▔
Schemas match exactly (column count: 10).
Rows
▔▔▔▔
Left count Right count
12 (no change) 12
┏━┯━┯━┯━┯━┓
┃-│-│-│-│-┃ 2 left only (16.67%)
┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃ 6 equal (60.00%) │
┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴ 10 joined
┃ │ │ │ │ ┃ ≠ ┃ │ │ │ │ ┃ 4 unequal (40.00%) │
┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
┃+│+│+│+│+┃ 2 right only (16.67%)
┗━┷━┷━┷━┷━┛
Columns
▔▔▔▔▔▔▔
┌─────────────────┬────────┐
│ discount │ 70.00% │
│ loyalty_card_id │ 90.00% │
│ timestamp │ 70.00% │
│ total │ 70.00% │
│ unit_price │ 70.00% │
└─────────────────┴────────┘
Slim mode#
Slim mode generates a minimal summary showing only sections that contain differences:
print(comparison.summary(slim=True))
Rows
▔▔▔▔
┏━┯━┯━┯━┯━┓
┃-│-│-│-│-┃ 2 left only (16.67%)
┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃ 6 equal (60.00%) │
┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴ 10 joined
┃ │ │ │ │ ┃ ≠ ┃ │ │ │ │ ┃ 4 unequal (40.00%) │
┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
┃+│+│+│+│+┃ 2 right only (16.67%)
┗━┷━┷━┷━┷━┛
Columns
▔▔▔▔▔▔▔
┌─────────────────┬─────────┐
│ discount │ 70.00% │
│ loyalty_card_id │ 90.00% │
│ product │ 100.00% │
│ quantity │ 100.00% │
│ register_id │ 100.00% │
│ store_id │ 100.00% │
│ timestamp │ 70.00% │
│ total │ 70.00% │
│ unit_price │ 70.00% │
└─────────────────┴─────────┘