pysmatch.matching
- pysmatch.matching.perform_match(data: DataFrame, yvar: str, threshold: float = 0.001, nmatches: int = 1, method: str = 'min', replacement: bool = False) DataFrame [source]
Performs nearest neighbor matching based on propensity scores within a radius.
Finds suitable match(es) from the control group for each record in the test (treatment) group based on propensity scores (‘scores’ column). It uses sklearn.neighbors.NearestNeighbors with radius=threshold to find potential neighbors and then applies selection logic based on the method parameter (‘min’ or ‘random’) to choose up to nmatches.
- Parameters:
data (pd.DataFrame) – DataFrame containing both test and control groups, must include the yvar column and a ‘scores’ column with propensity scores.
yvar (str) – The name of the binary column indicating group membership (0 or 1).
threshold (float, optional) – The radius within which to search for neighbors based on propensity score difference. Defaults to 0.001.
nmatches (int, optional) – The maximum number of control matches to find for each test unit within the specified radius/threshold. Defaults to 1.
method (str, optional) –
The method for selecting matches among neighbors found within the radius. Options: ‘min’: Selects the nmatches neighbors with the smallest
score difference.
- ’random’: Selects nmatches neighbors randomly from those
within the radius.
Defaults to ‘min’.
replacement (bool, optional) – Whether control units can be matched multiple times (used more than once as a match). Defaults to False.
- Returns:
- A DataFrame containing the matched test and control units.
Includes original columns plus ‘match_id’ (linking matched pairs/groups) and ‘record_id’ (preserving the original index of the unit). Returns an empty DataFrame if no matches are found.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the ‘scores’ column is not found in the input data.
ValueError – If an invalid method parameter is provided (not ‘min’ or ‘random’).
- pysmatch.matching.prop_retained(original_data: DataFrame, matched_data: DataFrame, yvar: str) float [source]
Calculates the proportion of the minority group retained after matching.
Compares the number of unique minority group members in the matched dataset to the number in the original dataset.
- Parameters:
original_data (pd.DataFrame) – The dataset before matching.
matched_data (pd.DataFrame) – The dataset after matching. Should contain ‘record_id’ or rely on index if ‘record_id’ is missing.
yvar (str) – The name of the binary treatment/control indicator column.
- Returns:
- The proportion (0.0 to 1.0) of the original minority group present
in the matched dataset. Returns 0.0 if the original minority group was empty.
- Return type:
float
- pysmatch.matching.tune_threshold(data: DataFrame, yvar: str, method: str = 'min', nmatches: int = 1, rng: ndarray | None = None) tuple [source]
Evaluates matching retention across a range of threshold values.
Performs matching using perform_match for each threshold in the specified range (rng) and calculates the proportion of the original minority group that is retained in the matched dataset. Useful for choosing a threshold.
- Parameters:
data (pd.DataFrame) – The input DataFrame containing scores and yvar.
yvar (str) – The name of the binary treatment/control indicator column.
method (str, optional) – The matching method (‘min’ or ‘random’) to use for each evaluation. Defaults to ‘min’.
nmatches (int, optional) – The number of matches to seek for each test unit. Defaults to 1.
rng (Optional[np.ndarray], optional) – A NumPy array of threshold values to test. If None, defaults to np.arange(0, 0.001, 0.0001). Defaults to None.
- Returns:
- A tuple containing:
thresholds (np.ndarray): The array of threshold values tested.
- retained (list): A list of proportions (float) of the minority group
retained for each corresponding threshold.
- Return type:
Tuple[np.ndarray, list]