Customizing the Summary#

The summary() method provides several options to control what information is displayed.

Scenario#

We continue with the supermarket data pipeline scenario from the quickstart. To validate a new load, we compare it against the previous one and check that:

  • New transactions have been added (right_only is non-empty).

  • Old or incorrect entries have been removed (left_only is non-empty).

  • The new entries are the most recent (their timestamps are ≥ everything else).

  • Existing transactions that appear in both loads still match.

Let’s set up two data frames representing consecutive data loads:

import polars as pl

df_previous = pl.read_csv("data/previous_load.csv", try_parse_dates=True)
df_current = pl.read_csv("data/current_load.csv", try_parse_dates=True)

Here’s a preview of the previous load:

df_previous.head()
shape: (5, 10)
transaction_idtimestampstore_idregister_idproductquantityunit_pricediscounttotalloyalty_card_id
strdatetime[μs]strstrstri64f64f64f64str
"TXN-001"2024-03-01 09:01:00"S1""R1""Milk"21.50.03.0"LC-1001"
"TXN-002"2024-03-01 09:15:00"S1""R1""Bread"12.80.02.8null
"TXN-003"2024-03-01 10:02:00"S1""R2""Eggs"13.20.03.2"LC-1003"
"TXN-004"2024-03-01 10:30:00"S1""R2""Butter"31.90.15.6"LC-1004"
"TXN-005"2024-03-01 11:00:00"S1""R1""Cheese"14.50.04.5null

Now let’s compare the two loads:

from diffly import compare_frames

comparison = compare_frames(df_previous, df_current, primary_key="transaction_id")

Default summary#

The first step in any investigation is looking at the summary. It tells us:

  • Schemas match — both data frames have the same columns and types.

  • Rows: 2 left-only, 2 right-only, 10 joined (6 equal, 4 unequal).

  • Column match rates: unit_price, discount, timestamp, and total at 70%, loyalty_card_id at 90%. Other columns like product, quantity, store_id, and register_id are at 100%.

print(comparison.summary())
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                     Diffly Summary                                     ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
   Primary key: transaction_id

 Schemas
 ▔▔▔▔▔▔▔
   Schemas match exactly (column count: 10).

 Rows
 ▔▔▔▔
   Left count             Right count
       12     (no change)     12

   ┏━┯━┯━┯━┯━┓
   ┃-│-│-│-│-┃                2  left only   (16.67%)
   ┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
   ┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃  6  equal       (60.00%)  │
   ┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴  10  joined
   ┃ │ │ │ │ ┃┃ │ │ │ │ ┃  4  unequal     (40.00%)  │
   ┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
                 ┃+│+│+│+│+┃  2  right only  (16.67%)
                 ┗━┷━┷━┷━┷━┛

 Columns
 ▔▔▔▔▔▔▔
   ┌─────────────────┬─────────┐
   │ discount        │  70.00% │
   │ loyalty_card_id │  90.00% │
   │ product         │ 100.00% │
   │ quantity        │ 100.00% │
   │ register_id     │ 100.00% │
   │ store_id        │ 100.00% │
   │ timestamp       │  70.00% │
   │ total           │  70.00% │
   │ unit_price      │  70.00% │
   └─────────────────┴─────────┘

Display options#

Custom names#

Use left_name and right_name to give meaningful labels to each DataFrame. For our scenario, we’re comparing data loads:

print(comparison.summary(left_name="Previous Load", right_name="Current Load"))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                     Diffly Summary                                     ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
   Primary key: transaction_id

 Schemas
 ▔▔▔▔▔▔▔
   Schemas match exactly (column count: 10).

 Rows
 ▔▔▔▔
   Previous Load count             Current Load count
           12          (no change)         12

   ┏━┯━┯━┯━┯━┓
   ┃-│-│-│-│-┃                2  Previous Load only  (16.67%)
   ┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
   ┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃  6  equal               (60.00%)  │
   ┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴  10  joined
   ┃ │ │ │ │ ┃┃ │ │ │ │ ┃  4  unequal             (40.00%)  │
   ┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
                 ┃+│+│+│+│+┃  2  Current Load only   (16.67%)
                 ┗━┷━┷━┷━┷━┛

 Columns
 ▔▔▔▔▔▔▔
   ┌─────────────────┬─────────┐
   │ discount        │  70.00% │
   │ loyalty_card_id │  90.00% │
   │ product         │ 100.00% │
   │ quantity        │ 100.00% │
   │ register_id     │ 100.00% │
   │ store_id        │ 100.00% │
   │ timestamp       │  70.00% │
   │ total           │  70.00% │
   │ unit_price      │  70.00% │
   └─────────────────┴─────────┘

Showing column value changes#

Use top_k_column_changes to display specific value changes for columns with mismatches:

print(comparison.summary(top_k_column_changes=3))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                     Diffly Summary                                     ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
   Primary key: transaction_id

 Schemas
 ▔▔▔▔▔▔▔
   Schemas match exactly (column count: 10).

 Rows
 ▔▔▔▔
   Left count             Right count
       12     (no change)     12

   ┏━┯━┯━┯━┯━┓
   ┃-│-│-│-│-┃                2  left only   (16.67%)
   ┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
   ┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃  6  equal       (60.00%)  │
   ┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴  10  joined
   ┃ │ │ │ │ ┃┃ │ │ │ │ ┃  4  unequal     (40.00%)  │
   ┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
                 ┃+│+│+│+│+┃  2  right only  (16.67%)
                 ┗━┷━┷━┷━┷━┛

 Columns
 ▔▔▔▔▔▔▔
   ┌─────────────────┬─────────┬─────────────────────────────────────────────────┐
   │ discount        │  70.00% │                                 0.0 -> 0.2 (2x) │
   │                 │         │                                 0.5 -> 1.0 (1x) │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ loyalty_card_id │  90.00% │                     "LC-1004" -> "LC-9999" (1x) │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ product         │ 100.00% │                                                 │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ quantity        │ 100.00% │                                                 │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ register_id     │ 100.00% │                                                 │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ store_id        │ 100.00% │                                                 │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ timestamp       │  70.00% │ 2024-03-01 12:10:00 -> 2024-03-01 12:10:02 (1x) │
   │                 │         │ 2024-03-01 11:45:00 -> 2024-03-01 11:45:03 (1x) │
   │                 │         │ 2024-03-01 11:20:00 -> 2024-03-01 11:20:02 (1x) │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ total           │  70.00% │                               10.3 -> 20.6 (1x) │
   │                 │         │                                 3.0 -> 5.8 (1x) │
   │                 │         │                                0.75 -> 1.3 (1x) │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ unit_price      │  70.00% │                                0.75 -> 1.5 (2x) │
   │                 │         │                                5.4 -> 10.8 (1x) │
   └─────────────────┴─────────┴─────────────────────────────────────────────────┘

Hidden columns#

Use hidden_columns to prevent sensitive data from being displayed in the summary. For example, loyalty card IDs should not appear in logs or reports shared across teams:

print(comparison.summary(top_k_column_changes=3, hidden_columns=["loyalty_card_id"]))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                     Diffly Summary                                     ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
   Primary key: transaction_id

 Schemas
 ▔▔▔▔▔▔▔
   Schemas match exactly (column count: 10).

 Rows
 ▔▔▔▔
   Left count             Right count
       12     (no change)     12

   ┏━┯━┯━┯━┯━┓
   ┃-│-│-│-│-┃                2  left only   (16.67%)
   ┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
   ┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃  6  equal       (60.00%)  │
   ┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴  10  joined
   ┃ │ │ │ │ ┃┃ │ │ │ │ ┃  4  unequal     (40.00%)  │
   ┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
                 ┃+│+│+│+│+┃  2  right only  (16.67%)
                 ┗━┷━┷━┷━┷━┛

 Columns
 ▔▔▔▔▔▔▔
   ┌─────────────────┬─────────┬─────────────────────────────────────────────────┐
   │ discount        │  70.00% │                                 0.0 -> 0.2 (2x) │
   │                 │         │                                 0.5 -> 1.0 (1x) │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ loyalty_card_id │  90.00% │                                                 │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ product         │ 100.00% │                                                 │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ quantity        │ 100.00% │                                                 │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ register_id     │ 100.00% │                                                 │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ store_id        │ 100.00% │                                                 │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ timestamp       │  70.00% │ 2024-03-01 12:10:00 -> 2024-03-01 12:10:02 (1x) │
   │                 │         │ 2024-03-01 11:45:00 -> 2024-03-01 11:45:03 (1x) │
   │                 │         │ 2024-03-01 11:20:00 -> 2024-03-01 11:20:02 (1x) │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ total           │  70.00% │                               10.3 -> 20.6 (1x) │
   │                 │         │                                 3.0 -> 5.8 (1x) │
   │                 │         │                                0.75 -> 1.3 (1x) │
   ├─────────────────┼─────────┼─────────────────────────────────────────────────┤
   │ unit_price      │  70.00% │                                0.75 -> 1.5 (2x) │
   │                 │         │                                5.4 -> 10.8 (1x) │
   └─────────────────┴─────────┴─────────────────────────────────────────────────┘

Showing sample rows#

Use sample_k_rows_only to display sample primary keys for rows that exist only in one DataFrame:

print(comparison.summary(sample_k_rows_only=5))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                     Diffly Summary                                     ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
   Primary key: transaction_id

 Schemas
 ▔▔▔▔▔▔▔
   Schemas match exactly (column count: 10).

 Rows
 ▔▔▔▔
   Left count             Right count
       12     (no change)     12

   ┏━┯━┯━┯━┯━┓
   ┃-│-│-│-│-┃                2  left only   (16.67%)
   ┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
   ┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃  6  equal       (60.00%)  │
   ┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴  10  joined
   ┃ │ │ │ │ ┃┃ │ │ │ │ ┃  4  unequal     (40.00%)  │
   ┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
                 ┃+│+│+│+│+┃  2  right only  (16.67%)
                 ┗━┷━┷━┷━┷━┛

 Columns
 ▔▔▔▔▔▔▔
   ┌─────────────────┬─────────┐
   │ discount        │  70.00% │
   │ loyalty_card_id │  90.00% │
   │ product         │ 100.00% │
   │ quantity        │ 100.00% │
   │ register_id     │ 100.00% │
   │ store_id        │ 100.00% │
   │ timestamp       │  70.00% │
   │ total           │  70.00% │
   │ unit_price      │  70.00% │
   └─────────────────┴─────────┘

 Rows left only
 ▔▔▔▔▔▔▔▔▔▔▔▔▔▔
   ┏━━━━━━━━━━━━━━━━┓
   ┃ transaction_id ┃
   ┡━━━━━━━━━━━━━━━━┩
   │ TXN-011        │
   │ TXN-012        │
   └────────────────┘

 Rows right only
 ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
   ┏━━━━━━━━━━━━━━━━┓
   ┃ transaction_id ┃
   ┡━━━━━━━━━━━━━━━━┩
   │ TXN-013        │
   │ TXN-014        │
   └────────────────┘

Hiding perfect matches#

For wide data frames, set show_perfect_column_matches=False to hide columns with 100% match rate:

print(comparison.summary(show_perfect_column_matches=False))
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                     Diffly Summary                                     ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
   Primary key: transaction_id

 Schemas
 ▔▔▔▔▔▔▔
   Schemas match exactly (column count: 10).

 Rows
 ▔▔▔▔
   Left count             Right count
       12     (no change)     12

   ┏━┯━┯━┯━┯━┓
   ┃-│-│-│-│-┃                2  left only   (16.67%)
   ┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
   ┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃  6  equal       (60.00%)  │
   ┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴  10  joined
   ┃ │ │ │ │ ┃┃ │ │ │ │ ┃  4  unequal     (40.00%)  │
   ┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
                 ┃+│+│+│+│+┃  2  right only  (16.67%)
                 ┗━┷━┷━┷━┷━┛

 Columns
 ▔▔▔▔▔▔▔
   ┌─────────────────┬────────┐
   │ discount        │ 70.00% │
   │ loyalty_card_id │ 90.00% │
   │ timestamp       │ 70.00% │
   │ total           │ 70.00% │
   │ unit_price      │ 70.00% │
   └─────────────────┴────────┘

Slim mode#

Slim mode generates a minimal summary showing only sections that contain differences:

print(comparison.summary(slim=True))
 Rows
 ▔▔▔▔
   ┏━┯━┯━┯━┯━┓
   ┃-│-│-│-│-┃                2  left only   (16.67%)
   ┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
   ┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃  6  equal       (60.00%)  │
   ┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴  10  joined
   ┃ │ │ │ │ ┃┃ │ │ │ │ ┃  4  unequal     (40.00%)  │
   ┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
                 ┃+│+│+│+│+┃  2  right only  (16.67%)
                 ┗━┷━┷━┷━┷━┛

 Columns
 ▔▔▔▔▔▔▔
   ┌─────────────────┬─────────┐
   │ discount        │  70.00% │
   │ loyalty_card_id │  90.00% │
   │ product         │ 100.00% │
   │ quantity        │ 100.00% │
   │ register_id     │ 100.00% │
   │ store_id        │ 100.00% │
   │ timestamp       │  70.00% │
   │ total           │  70.00% │
   │ unit_price      │  70.00% │
   └─────────────────┴─────────┘