Data Comparison#

diffly.compare_frames(
left: DataFrame | LazyFrame,
right: DataFrame | LazyFrame,
/,
*,
primary_key: str | Sequence[str] | None = None,
abs_tol: float | Mapping[str, float] = 1e-08,
rel_tol: float | Mapping[str, float] = 1e-05,
abs_tol_temporal: timedelta | Mapping[str, timedelta] = datetime.timedelta(0),
) DataFrameComparison[source]#

Compare two polars data frames.

Parameters:
  • left – The first data frame in the comparison.

  • right – The second data frame in the comparison.

  • primary_key – Primary key columns to use for joining the data frames. If not provided, comparisons based on joins will raise an error.

  • abs_tol – Absolute tolerance for comparing floating point types. If a Mapping is provided, it should map from column name to absolute tolerance for every column in the data frame (except the primary key).

  • rel_tol – Relative tolerance for comparing floating point types. If a Mapping is provided, it should map from column name to relative tolerance for every column in the data frame (except the primary key).

  • abs_tol_temporal – Absolute tolerance for comparing temporal types. If a Mapping is provided, it should map from column name to absolute temporal tolerance for every column in the data frame (except the primary key).

Returns:

A data frame comparison object that can be used to explore the differences of the provided data frames.

Note

The implementation of floating point equivalence mirrors the implementation of math.isclose().

class diffly.comparison.DataFrameComparison(
left: LazyFrame,
right: LazyFrame,
left_schema: Schema,
right_schema: Schema,
primary_key: list[str] | None,
_other_common_columns: list[str],
abs_tol_by_column: dict[str, float],
rel_tol_by_column: dict[str, float],
abs_tol_temporal_by_column: dict[str, timedelta],
)[source]#

Object representing a comparison between two polars data frames.

Note

Do not initialize this object directly. Instead, use compare_frames().

See also

Schema Comparison — inspect column names and data types via schemas.

DataFrameComparison.equal(*[, check_dtypes])

Whether the data frames are equal, independent of row and column order.

DataFrameComparison.equal_num_rows()

Whether the number of rows in the left and right data frames are equal.

DataFrameComparison.joined(...)

The rows of both data frames that can be joined, regardless of whether column values match in columns which are not used for joining.

DataFrameComparison.joined_equal(...)

The rows of both data frames that can be joined and have matching values in in all columns in subset.

DataFrameComparison.joined_unequal(...)

The rows of both data frames that can be joined and have at least one mismatching value across any column in subset.

DataFrameComparison.num_rows_left()

Number of rows in the left data frame.

DataFrameComparison.num_rows_right()

Number of rows in the right data frame.

DataFrameComparison.num_rows_joined()

The number of rows that can be joined, regardless of whether column values match in columns which are not used for joining.

DataFrameComparison.num_rows_joined_equal(*subset)

The number of rows that can be joined and have matching values in all columns in subset.

DataFrameComparison.num_rows_joined_unequal(*subset)

The number of rows of both data frames that can be joined and have at least one mismatching value across any column in subset.

DataFrameComparison.left_only(...)

The rows in the left data frame which cannot be joined with a row in the right data frame.

DataFrameComparison.right_only(...)

The rows in the right data frame which cannot be joined with a row in the left data frame.

DataFrameComparison.num_rows_left_only()

The number of rows in the left data frame which cannot be joined with a row in the right data frame.

DataFrameComparison.num_rows_right_only()

The number of rows in the right data frame which cannot be joined with a row in the left data frame.

DataFrameComparison.fraction_same(...)

Compute the fraction of matching values.

DataFrameComparison.change_counts(...)

Get the changes of a column, sorted in descending order of frequency.