pysmatch.matching

pysmatch.matching.perform_match(data: DataFrame, yvar: str, threshold: float = 0.001, nmatches: int = 1, method: str = 'min', replacement: bool = False) → DataFrame[source]

Performs nearest neighbor matching based on propensity scores within a radius.

Finds suitable match(es) from the control group for each record in the test (treatment) group based on propensity scores (‘scores’ column). It uses sklearn.neighbors.NearestNeighbors with radius=threshold to find potential neighbors and then applies selection logic based on the method parameter (‘min’ or ‘random’) to choose up to nmatches.

Parameters:

data (pd.DataFrame) – DataFrame containing both test and control groups, must include the yvar column and a ‘scores’ column with propensity scores.
yvar (str) – The name of the binary column indicating group membership (0 or 1).
threshold (float, optional) – The radius within which to search for neighbors based on propensity score difference. Defaults to 0.001.
nmatches (int, optional) – The maximum number of control matches to find for each test unit within the specified radius/threshold. Defaults to 1.
method (str, optional) –
The method for selecting matches among neighbors found within the radius. Options: ‘min’: Selects the nmatches neighbors with the smallest

score difference.

’random’: Selects nmatches neighbors randomly from those
within the radius.

Defaults to ‘min’.
replacement (bool, optional) – Whether control units can be matched multiple times (used more than once as a match). Defaults to False.

Returns:

A DataFrame containing the matched test and control units.: Includes original columns plus ‘match_id’ (linking matched pairs/groups) and ‘record_id’ (preserving the original index of the unit). Returns an empty DataFrame if no matches are found.

Return type:

pd.DataFrame

Raises:

ValueError – If the ‘scores’ column is not found in the input data.
ValueError – If an invalid method parameter is provided (not ‘min’ or ‘random’).

pysmatch.matching.prop_retained(original_data: DataFrame, matched_data: DataFrame, yvar: str) → float[source]

Calculates the proportion of the minority group retained after matching.

Compares the number of unique minority group members in the matched dataset to the number in the original dataset.

Parameters:

original_data (pd.DataFrame) – The dataset before matching.
matched_data (pd.DataFrame) – The dataset after matching. Should contain ‘record_id’ or rely on index if ‘record_id’ is missing.
yvar (str) – The name of the binary treatment/control indicator column.

Returns:

The proportion (0.0 to 1.0) of the original minority group present: in the matched dataset. Returns 0.0 if the original minority group was empty.

Return type:

float

pysmatch.matching.tune_threshold(data: DataFrame, yvar: str, method: str = 'min', nmatches: int = 1, rng: ndarray | None = None) → tuple[source]

Evaluates matching retention across a range of threshold values.

Performs matching using perform_match for each threshold in the specified range (rng) and calculates the proportion of the original minority group that is retained in the matched dataset. Useful for choosing a threshold.

Parameters:

data (pd.DataFrame) – The input DataFrame containing scores and yvar.
yvar (str) – The name of the binary treatment/control indicator column.
method (str, optional) – The matching method (‘min’ or ‘random’) to use for each evaluation. Defaults to ‘min’.
nmatches (int, optional) – The number of matches to seek for each test unit. Defaults to 1.
rng (Optional[np.ndarray], optional) – A NumPy array of threshold values to test. If None, defaults to np.arange(0, 0.001, 0.0001). Defaults to None.

Returns:

A tuple containing:

thresholds (np.ndarray): The array of threshold values tested.
retained (list): A list of proportions (float) of the minority group
retained for each corresponding threshold.

Return type:

Tuple[np.ndarray, list]