pysmatch.matching

pysmatch.matching.perform_exhaustive_match(data: DataFrame, yvar: str, threshold: float = 0.001, nmatches: int = 1, show_progress: bool = False) DataFrame[source]

Perform exhaustive matching while prioritizing unused controls first.

Controls are pre-sorted by propensity score once, then candidate windows are found by binary search for each treated row. Within each window, selection is ordered by: 1. whether the control has been used before, 2. current usage count, 3. absolute score distance.

pysmatch.matching.perform_match(data: DataFrame, yvar: str, threshold: float = 0.001, nmatches: int = 1, method: str = 'min', replacement: bool = False) DataFrame[source]

Perform nearest-neighbor matching using propensity scores.

For each treated sample, this function searches control samples within threshold score distance and selects up to nmatches controls. Selection can be deterministic (method="min") or random.

Parameters:
  • data (pd.DataFrame) – DataFrame containing both test and control groups, must include the yvar column and a ‘scores’ column with propensity scores.

  • yvar (str) – The name of the binary column indicating group membership (0 or 1).

  • threshold (float, optional) – The radius within which to search for neighbors based on propensity score difference. Defaults to 0.001.

  • nmatches (int, optional) – The maximum number of control matches to find for each test unit within the specified radius/threshold. Defaults to 1.

  • method (str, optional) – Match selection method. Use "min" (or the backward-compatible alias "nearest") for smallest score differences, or "random" for random sampling. Defaults to "min".

  • replacement (bool, optional) – Whether control units can be matched multiple times (used more than once as a match). Defaults to False.

Returns:

Matched treated/control rows with match_id and record_id.

Returns an empty DataFrame if no matches are found.

Return type:

pd.DataFrame

Raises:
  • ValueError – If the ‘scores’ column is not found in the input data.

  • ValueError – If an invalid method parameter is provided (not ‘min’ or ‘random’).

pysmatch.matching.prop_retained(original_data: DataFrame, matched_data: DataFrame, yvar: str) float[source]

Calculates the proportion of the minority group retained after matching.

Compares the number of unique minority group members in the matched dataset to the number in the original dataset.

Parameters:
  • original_data (pd.DataFrame) – The dataset before matching.

  • matched_data (pd.DataFrame) – The dataset after matching. Should contain ‘record_id’ or rely on index if ‘record_id’ is missing.

  • yvar (str) – The name of the binary treatment/control indicator column.

Returns:

The proportion (0.0 to 1.0) of the original minority group present

in the matched dataset. Returns 0.0 if the original minority group was empty.

Return type:

float

pysmatch.matching.tune_threshold(data: DataFrame, yvar: str, method: str = 'min', nmatches: int = 1, rng: ndarray | None = None) tuple[source]

Evaluates matching retention across a range of threshold values.

Performs matching using perform_match for each threshold in the specified range (rng) and calculates the proportion of the original minority group that is retained in the matched dataset. Useful for choosing a threshold.

Parameters:
  • data (pd.DataFrame) – The input DataFrame containing scores and yvar.

  • yvar (str) – The name of the binary treatment/control indicator column.

  • method (str, optional) – The matching method (‘min’ or ‘random’) to use for each evaluation. Defaults to ‘min’.

  • nmatches (int, optional) – The number of matches to seek for each test unit. Defaults to 1.

  • rng (Optional[np.ndarray], optional) – A NumPy array of threshold values to test. If None, defaults to np.arange(0, 0.001, 0.0001). Defaults to None.

Returns:

A tuple containing:
  • thresholds (np.ndarray): The array of threshold values tested.

  • retained (list): A list of proportions (float) of the minority group

    retained for each corresponding threshold.

Return type:

Tuple[np.ndarray, list]