pysmatch.matching

pysmatch.matching.perform_match(data: DataFrame, yvar: str, threshold: float = 0.001, nmatches: int = 1, method: str = 'min', replacement: bool = False) DataFrame[source]

Performs nearest neighbor matching based on propensity scores within a radius.

Finds suitable match(es) from the control group for each record in the test (treatment) group based on propensity scores (‘scores’ column). It uses sklearn.neighbors.NearestNeighbors with radius=threshold to find potential neighbors and then applies selection logic based on the method parameter (‘min’ or ‘random’) to choose up to nmatches.

Parameters:
  • data (pd.DataFrame) – DataFrame containing both test and control groups, must include the yvar column and a ‘scores’ column with propensity scores.

  • yvar (str) – The name of the binary column indicating group membership (0 or 1).

  • threshold (float, optional) – The radius within which to search for neighbors based on propensity score difference. Defaults to 0.001.

  • nmatches (int, optional) – The maximum number of control matches to find for each test unit within the specified radius/threshold. Defaults to 1.

  • method (str, optional) –

    The method for selecting matches among neighbors found within the radius. Options: ‘min’: Selects the nmatches neighbors with the smallest

    score difference.

    ’random’: Selects nmatches neighbors randomly from those

    within the radius.

    Defaults to ‘min’.

  • replacement (bool, optional) – Whether control units can be matched multiple times (used more than once as a match). Defaults to False.

Returns:

A DataFrame containing the matched test and control units.

Includes original columns plus ‘match_id’ (linking matched pairs/groups) and ‘record_id’ (preserving the original index of the unit). Returns an empty DataFrame if no matches are found.

Return type:

pd.DataFrame

Raises:
  • ValueError – If the ‘scores’ column is not found in the input data.

  • ValueError – If an invalid method parameter is provided (not ‘min’ or ‘random’).

pysmatch.matching.prop_retained(original_data: DataFrame, matched_data: DataFrame, yvar: str) float[source]

Calculates the proportion of the minority group retained after matching.

Compares the number of unique minority group members in the matched dataset to the number in the original dataset.

Parameters:
  • original_data (pd.DataFrame) – The dataset before matching.

  • matched_data (pd.DataFrame) – The dataset after matching. Should contain ‘record_id’ or rely on index if ‘record_id’ is missing.

  • yvar (str) – The name of the binary treatment/control indicator column.

Returns:

The proportion (0.0 to 1.0) of the original minority group present

in the matched dataset. Returns 0.0 if the original minority group was empty.

Return type:

float

pysmatch.matching.tune_threshold(data: DataFrame, yvar: str, method: str = 'min', nmatches: int = 1, rng: ndarray | None = None) tuple[source]

Evaluates matching retention across a range of threshold values.

Performs matching using perform_match for each threshold in the specified range (rng) and calculates the proportion of the original minority group that is retained in the matched dataset. Useful for choosing a threshold.

Parameters:
  • data (pd.DataFrame) – The input DataFrame containing scores and yvar.

  • yvar (str) – The name of the binary treatment/control indicator column.

  • method (str, optional) – The matching method (‘min’ or ‘random’) to use for each evaluation. Defaults to ‘min’.

  • nmatches (int, optional) – The number of matches to seek for each test unit. Defaults to 1.

  • rng (Optional[np.ndarray], optional) – A NumPy array of threshold values to test. If None, defaults to np.arange(0, 0.001, 0.0001). Defaults to None.

Returns:

A tuple containing:
  • thresholds (np.ndarray): The array of threshold values tested.

  • retained (list): A list of proportions (float) of the minority group

    retained for each corresponding threshold.

Return type:

Tuple[np.ndarray, list]