pysmatch.matching
- pysmatch.matching.perform_exhaustive_match(data: DataFrame, yvar: str, threshold: float = 0.001, nmatches: int = 1, show_progress: bool = False) DataFrame[source]
Perform exhaustive matching while prioritizing unused controls first.
Controls are pre-sorted by propensity score once, then candidate windows are found by binary search for each treated row. Within each window, selection is ordered by: 1. whether the control has been used before, 2. current usage count, 3. absolute score distance.
- pysmatch.matching.perform_match(data: DataFrame, yvar: str, threshold: float = 0.001, nmatches: int = 1, method: str = 'min', replacement: bool = False) DataFrame[source]
Perform nearest-neighbor matching using propensity scores.
For each treated sample, this function searches control samples within
thresholdscore distance and selects up tonmatchescontrols. Selection can be deterministic (method="min") or random.- Parameters:
data (pd.DataFrame) – DataFrame containing both test and control groups, must include the yvar column and a ‘scores’ column with propensity scores.
yvar (str) – The name of the binary column indicating group membership (0 or 1).
threshold (float, optional) – The radius within which to search for neighbors based on propensity score difference. Defaults to 0.001.
nmatches (int, optional) – The maximum number of control matches to find for each test unit within the specified radius/threshold. Defaults to 1.
method (str, optional) – Match selection method. Use
"min"(or the backward-compatible alias"nearest") for smallest score differences, or"random"for random sampling. Defaults to"min".replacement (bool, optional) – Whether control units can be matched multiple times (used more than once as a match). Defaults to False.
- Returns:
- Matched treated/control rows with
match_idandrecord_id. Returns an empty DataFrame if no matches are found.
- Matched treated/control rows with
- Return type:
pd.DataFrame
- Raises:
ValueError – If the ‘scores’ column is not found in the input data.
ValueError – If an invalid method parameter is provided (not ‘min’ or ‘random’).
- pysmatch.matching.prop_retained(original_data: DataFrame, matched_data: DataFrame, yvar: str) float[source]
Calculates the proportion of the minority group retained after matching.
Compares the number of unique minority group members in the matched dataset to the number in the original dataset.
- Parameters:
original_data (pd.DataFrame) – The dataset before matching.
matched_data (pd.DataFrame) – The dataset after matching. Should contain ‘record_id’ or rely on index if ‘record_id’ is missing.
yvar (str) – The name of the binary treatment/control indicator column.
- Returns:
- The proportion (0.0 to 1.0) of the original minority group present
in the matched dataset. Returns 0.0 if the original minority group was empty.
- Return type:
float
- pysmatch.matching.tune_threshold(data: DataFrame, yvar: str, method: str = 'min', nmatches: int = 1, rng: ndarray | None = None) tuple[source]
Evaluates matching retention across a range of threshold values.
Performs matching using perform_match for each threshold in the specified range (rng) and calculates the proportion of the original minority group that is retained in the matched dataset. Useful for choosing a threshold.
- Parameters:
data (pd.DataFrame) – The input DataFrame containing scores and yvar.
yvar (str) – The name of the binary treatment/control indicator column.
method (str, optional) – The matching method (‘min’ or ‘random’) to use for each evaluation. Defaults to ‘min’.
nmatches (int, optional) – The number of matches to seek for each test unit. Defaults to 1.
rng (Optional[np.ndarray], optional) – A NumPy array of threshold values to test. If None, defaults to np.arange(0, 0.001, 0.0001). Defaults to None.
- Returns:
- A tuple containing:
thresholds (np.ndarray): The array of threshold values tested.
- retained (list): A list of proportions (float) of the minority group
retained for each corresponding threshold.
- Return type:
Tuple[np.ndarray, list]