pysmatch.Matcher

class pysmatch.Matcher.Matcher(test: DataFrame, control: DataFrame, yvar: str, formula: str | None = None, exclude: List[str] | None = None, exhaustive_matching_default: bool = False)[source]

Bases: object

A class to perform propensity score matching (PSM).

This class encapsulates the entire PSM workflow, including propensity score estimation, matching, and balance assessment.

data

The input DataFrame containing treatment, outcome, covariates, and record_id.

Type:: pd.DataFrame

treatment_col

The name of the treatment column.

Type:: str

yvar

The name of the treatment column (for compatibility with other modules).

Type:: str

test

The processed test/treated group DataFrame.

Type:: pd.DataFrame

control

The processed control group DataFrame.

Type:: pd.DataFrame

exclude

A list of columns to exclude from calculations.

Type:: list

scores

Propensity scores estimated for each observation, stored in data[‘scores’].

Type:: pd.Series

matched_data

DataFrame containing the matched pairs/groups.

Type:: pd.DataFrame

models

List of fitted propensity score model objects.

Type:: List[Any]

model_accuracy

List of accuracies for the fitted models.

Type:: List[float]

exhaustive_matching_default

Default behavior for exhaustive matching for the instance.

Type:: bool

assign_weight_vector() → None[source]: Assigns inverse frequency weights to records in the matched dataset. Weights are 1/count, where count is how many times an original record (identified by record_id) appears. Added as ‘weight’ column.

compare_categorical(return_table: bool = False, plot_result: bool = True)[source]: Compares categorical variables between groups before and after matching. Delegates to visualization.compare_categorical. The visualization module is expected to use self.yvar (which is self.treatment_col).

compare_continuous(save: bool = False, return_table: bool = False, plot_result: bool = True)[source]: Compares continuous variables between groups before and after matching. Delegates to visualization.compare_continuous. The visualization module is expected to use self.yvar (which is self.treatment_col).

fit_model(index: int, X: DataFrame, y: Series, model_type: str, balance: bool, max_iter: int = 100) → Dict[str, Any][source]

Fits a single propensity score model.

Internal helper method that calls pysmatch.modeling.fit_model. :param index: An identifier for the model. :type index: int :param X: The feature matrix (covariates). :type X: pd.DataFrame :param y: The target variable (treatment indicator). :type y: pd.Series :param model_type: The type of model to fit (e.g., ‘linear’, ‘rf’, ‘gb’). :type model_type: str :param balance: Whether the fitting process should aim to balance covariates. :type balance: bool :param max_iter: Maximum iterations for iterative solvers. Defaults to 100. :type max_iter: int, optional

Returns:: A dictionary containing the fitted model and its accuracy.
Return type:: Dict[str, Any]

fit_scores(balance: bool = True, nmodels: int | None = None, n_jobs: int = 1, model_type: str = 'linear', max_iter: int = 100, use_optuna: bool = False, n_trials: int = 10) → None[source]

Fits propensity score model(s) to estimate scores.

Parameters:

balance (bool, optional) – If True, attempts to create balanced models. Defaults to True.
nmodels (Optional[int], optional) – Number of models for ensemble. Auto-estimated if None and balance=True.
n_jobs (int, optional) – Number of parallel jobs for ensemble fitting. Defaults to 1.
model_type (str, optional) – Type of model (‘linear’, ‘rf’, ‘gb’, ‘knn’, ‘tree’ for catboost). Defaults to ‘linear’.
max_iter (int, optional) – Max iterations for solver. Defaults to 100.
use_optuna (bool, optional) – If True, use Optuna for hyperparameter tuning. Defaults to False.
n_trials (int, optional) – Number of Optuna trials if use_optuna is True. Defaults to 10.

match(threshold: float = 0.001, nmatches: int = 1, method: str = 'min', replacement: bool = False, exhaustive_matching: bool | None = None) → None[source]

Performs matching based on estimated propensity scores.

Parameters:

threshold (float, optional) – Threshold for score difference. Defaults to 0.001.
nmatches (int, optional) – Number of controls to match to each test unit. Defaults to 1.
method (str, optional) – The matching algorithm to use when exhaustive_matching is False. Passed to pysmatch.matching.perform_match. Defaults to ‘min’.
replacement (bool, optional) – Whether controls can be matched to multiple test units when exhaustive_matching is False. Passed to pysmatch.matching.perform_match. Defaults to False.
exhaustive_matching (Optional[bool], optional) – If True, attempts to use a wider range of controls by prioritizing unused or less-used controls. If None, uses the instance’s default self.exhaustive_matching_default. Defaults to None.

plot_matched_scores() → None[source]: Plots the distribution of propensity scores after matching. Visualizes score overlap in self.matched_data.

plot_scores() → None[source]: Plots the distribution of propensity scores before matching. Visualizes score overlap between test (treated) and control groups.

predict_scores() → None[source]: Predicts propensity scores using the fitted model(s). Scores are stored in self.data[‘scores’].

prep_prop_test(data: DataFrame, var: str) → list | None[source]

Prepares a contingency table for the Chi-Square test.

Parameters:

data (pd.DataFrame) – The DataFrame (original or matched) to use.
var (str) – The categorical variable name.

Returns:

List-of-lists for scipy.stats.chi2_contingency. None if error.

Return type:

Optional[list]

prop_test(col: str) → Dict[str, Any] | None[source]

Performs Chi-Square tests for a categorical variable before and after matching.

Parameters:: col (str) – Name of the categorical column to test.
Returns:: Dict with ‘var’, ‘before’ p-value, ‘after’ p-value. None if error.
Return type:: Optional[Dict[str, Any]]

record_frequency() → DataFrame[source]: Calculates the frequency of each original control record in the matched dataset when exhaustive matching is used, or frequency of match_ids otherwise. :returns: DataFrame with frequencies. :rtype: pd.DataFrame

tune_threshold(method: str, nmatches: int = 1, rng: ndarray | None = None) → None[source]

Evaluates matching retention across a range of threshold values.

Parameters:

method (str) – Matching method (‘min’, ‘nn’, ‘radius’) for evaluation.
nmatches (int, optional) – Number of matches for ‘nn’/’min’. Defaults to 1.
rng (Optional[np.ndarray], optional) – Threshold values to test. Defaults to np.arange(0, 0.001, 0.0001).