pysmatch.Matcher
- class pysmatch.Matcher.Matcher(test: DataFrame, control: DataFrame, yvar: str, formula: str | None = None, exclude: List[str] | None = None, exhaustive_matching_default: bool = False)[source]
Bases:
object
A class to perform propensity score matching (PSM).
This class encapsulates the entire PSM workflow, including propensity score estimation, matching, and balance assessment.
- data
The input DataFrame containing treatment, outcome, covariates, and record_id.
- Type:
pd.DataFrame
- treatment_col
The name of the treatment column.
- Type:
str
- yvar
The name of the treatment column (for compatibility with other modules).
- Type:
str
- test
The processed test/treated group DataFrame.
- Type:
pd.DataFrame
- control
The processed control group DataFrame.
- Type:
pd.DataFrame
- exclude
A list of columns to exclude from calculations.
- Type:
list
- scores
Propensity scores estimated for each observation, stored in data[‘scores’].
- Type:
pd.Series
- matched_data
DataFrame containing the matched pairs/groups.
- Type:
pd.DataFrame
- models
List of fitted propensity score model objects.
- Type:
List[Any]
- model_accuracy
List of accuracies for the fitted models.
- Type:
List[float]
- exhaustive_matching_default
Default behavior for exhaustive matching for the instance.
- Type:
bool
- assign_weight_vector() None [source]
Assigns inverse frequency weights to records in the matched dataset. Weights are 1/count, where count is how many times an original record (identified by record_id) appears. Added as ‘weight’ column.
- compare_categorical(return_table: bool = False, plot_result: bool = True)[source]
Compares categorical variables between groups before and after matching. Delegates to visualization.compare_categorical. The visualization module is expected to use self.yvar (which is self.treatment_col).
- compare_continuous(save: bool = False, return_table: bool = False, plot_result: bool = True)[source]
Compares continuous variables between groups before and after matching. Delegates to visualization.compare_continuous. The visualization module is expected to use self.yvar (which is self.treatment_col).
- fit_model(index: int, X: DataFrame, y: Series, model_type: str, balance: bool, max_iter: int = 100) Dict[str, Any] [source]
Fits a single propensity score model.
Internal helper method that calls pysmatch.modeling.fit_model. :param index: An identifier for the model. :type index: int :param X: The feature matrix (covariates). :type X: pd.DataFrame :param y: The target variable (treatment indicator). :type y: pd.Series :param model_type: The type of model to fit (e.g., ‘linear’, ‘rf’, ‘gb’). :type model_type: str :param balance: Whether the fitting process should aim to balance covariates. :type balance: bool :param max_iter: Maximum iterations for iterative solvers. Defaults to 100. :type max_iter: int, optional
- Returns:
A dictionary containing the fitted model and its accuracy.
- Return type:
Dict[str, Any]
- fit_scores(balance: bool = True, nmodels: int | None = None, n_jobs: int = 1, model_type: str = 'linear', max_iter: int = 100, use_optuna: bool = False, n_trials: int = 10) None [source]
Fits propensity score model(s) to estimate scores.
- Parameters:
balance (bool, optional) – If True, attempts to create balanced models. Defaults to True.
nmodels (Optional[int], optional) – Number of models for ensemble. Auto-estimated if None and balance=True.
n_jobs (int, optional) – Number of parallel jobs for ensemble fitting. Defaults to 1.
model_type (str, optional) – Type of model (‘linear’, ‘rf’, ‘gb’, ‘knn’, ‘tree’ for catboost). Defaults to ‘linear’.
max_iter (int, optional) – Max iterations for solver. Defaults to 100.
use_optuna (bool, optional) – If True, use Optuna for hyperparameter tuning. Defaults to False.
n_trials (int, optional) – Number of Optuna trials if use_optuna is True. Defaults to 10.
- match(threshold: float = 0.001, nmatches: int = 1, method: str = 'min', replacement: bool = False, exhaustive_matching: bool | None = None) None [source]
Performs matching based on estimated propensity scores.
- Parameters:
threshold (float, optional) – Threshold for score difference. Defaults to 0.001.
nmatches (int, optional) – Number of controls to match to each test unit. Defaults to 1.
method (str, optional) – The matching algorithm to use when exhaustive_matching is False. Passed to pysmatch.matching.perform_match. Defaults to ‘min’.
replacement (bool, optional) – Whether controls can be matched to multiple test units when exhaustive_matching is False. Passed to pysmatch.matching.perform_match. Defaults to False.
exhaustive_matching (Optional[bool], optional) – If True, attempts to use a wider range of controls by prioritizing unused or less-used controls. If None, uses the instance’s default self.exhaustive_matching_default. Defaults to None.
- plot_matched_scores() None [source]
Plots the distribution of propensity scores after matching. Visualizes score overlap in self.matched_data.
- plot_scores() None [source]
Plots the distribution of propensity scores before matching. Visualizes score overlap between test (treated) and control groups.
- predict_scores() None [source]
Predicts propensity scores using the fitted model(s). Scores are stored in self.data[‘scores’].
- prep_prop_test(data: DataFrame, var: str) list | None [source]
Prepares a contingency table for the Chi-Square test.
- Parameters:
data (pd.DataFrame) – The DataFrame (original or matched) to use.
var (str) – The categorical variable name.
- Returns:
List-of-lists for scipy.stats.chi2_contingency. None if error.
- Return type:
Optional[list]
- prop_test(col: str) Dict[str, Any] | None [source]
Performs Chi-Square tests for a categorical variable before and after matching.
- Parameters:
col (str) – Name of the categorical column to test.
- Returns:
Dict with ‘var’, ‘before’ p-value, ‘after’ p-value. None if error.
- Return type:
Optional[Dict[str, Any]]
- record_frequency() DataFrame [source]
Calculates the frequency of each original control record in the matched dataset when exhaustive matching is used, or frequency of match_ids otherwise. :returns: DataFrame with frequencies. :rtype: pd.DataFrame
- tune_threshold(method: str, nmatches: int = 1, rng: ndarray | None = None) None [source]
Evaluates matching retention across a range of threshold values.
- Parameters:
method (str) – Matching method (‘min’, ‘nn’, ‘radius’) for evaluation.
nmatches (int, optional) – Number of matches for ‘nn’/’min’. Defaults to 1.
rng (Optional[np.ndarray], optional) – Threshold values to test. Defaults to np.arange(0, 0.001, 0.0001).