pysmatch.Matcher

class pysmatch.Matcher.Matcher(test: DataFrame, control: DataFrame, yvar: str, formula: str | None = None, exclude: List[str] | None = None, exhaustive_matching_default: bool = False)[source]

Bases: object

A class to perform propensity score matching (PSM).

This class encapsulates the entire PSM workflow, including propensity score estimation, matching, and balance assessment.

data

The input DataFrame containing treatment, outcome, covariates, and record_id.

Type:

pd.DataFrame

treatment_col

The name of the treatment column.

Type:

str

yvar

The name of the treatment column (for compatibility with other modules).

Type:

str

test

The processed test/treated group DataFrame.

Type:

pd.DataFrame

control

The processed control group DataFrame.

Type:

pd.DataFrame

exclude

A list of columns to exclude from calculations.

Type:

list

scores

Propensity scores estimated for each observation, stored in data[‘scores’].

Type:

pd.Series

matched_data

DataFrame containing the matched pairs/groups.

Type:

pd.DataFrame

models

List of fitted propensity score model objects.

Type:

List[Any]

model_accuracy

List of accuracies for the fitted models.

Type:

List[float]

exhaustive_matching_default

Default behavior for exhaustive matching for the instance.

Type:

bool

assign_weight_vector() None[source]

Assigns inverse frequency weights to records in the matched dataset. Weights are 1/count, where count is how many times an original record (identified by record_id) appears. Added as ‘weight’ column.

compare_categorical(return_table: bool = False, plot_result: bool = True)[source]

Compares categorical variables between groups before and after matching. Delegates to visualization.compare_categorical. The visualization module is expected to use self.yvar (which is self.treatment_col).

compare_continuous(save: bool = False, return_table: bool = False, plot_result: bool = True)[source]

Compares continuous variables between groups before and after matching. Delegates to visualization.compare_continuous. The visualization module is expected to use self.yvar (which is self.treatment_col).

fit_model(index: int, X: DataFrame, y: Series, model_type: str, balance: bool, max_iter: int = 100) Dict[str, Any][source]

Fits a single propensity score model.

Internal helper method that calls pysmatch.modeling.fit_model. :param index: An identifier for the model. :type index: int :param X: The feature matrix (covariates). :type X: pd.DataFrame :param y: The target variable (treatment indicator). :type y: pd.Series :param model_type: The type of model to fit (e.g., ‘linear’, ‘rf’, ‘gb’). :type model_type: str :param balance: Whether the fitting process should aim to balance covariates. :type balance: bool :param max_iter: Maximum iterations for iterative solvers. Defaults to 100. :type max_iter: int, optional

Returns:

A dictionary containing the fitted model and its accuracy.

Return type:

Dict[str, Any]

fit_scores(balance: bool = True, nmodels: int | None = None, n_jobs: int = 1, model_type: str = 'linear', max_iter: int = 100, use_optuna: bool = False, n_trials: int = 10) None[source]

Fits propensity score model(s) to estimate scores.

Parameters:
  • balance (bool, optional) – If True, attempts to create balanced models. Defaults to True.

  • nmodels (Optional[int], optional) – Number of models for ensemble. Auto-estimated if None and balance=True.

  • n_jobs (int, optional) – Number of parallel jobs for ensemble fitting. Defaults to 1.

  • model_type (str, optional) – Type of model (‘linear’, ‘rf’, ‘gb’, ‘knn’, ‘tree’ for catboost). Defaults to ‘linear’.

  • max_iter (int, optional) – Max iterations for solver. Defaults to 100.

  • use_optuna (bool, optional) – If True, use Optuna for hyperparameter tuning. Defaults to False.

  • n_trials (int, optional) – Number of Optuna trials if use_optuna is True. Defaults to 10.

match(threshold: float = 0.001, nmatches: int = 1, method: str = 'min', replacement: bool = False, exhaustive_matching: bool | None = None) None[source]

Performs matching based on estimated propensity scores.

Parameters:
  • threshold (float, optional) – Threshold for score difference. Defaults to 0.001.

  • nmatches (int, optional) – Number of controls to match to each test unit. Defaults to 1.

  • method (str, optional) – The matching algorithm to use when exhaustive_matching is False. Passed to pysmatch.matching.perform_match. Defaults to ‘min’.

  • replacement (bool, optional) – Whether controls can be matched to multiple test units when exhaustive_matching is False. Passed to pysmatch.matching.perform_match. Defaults to False.

  • exhaustive_matching (Optional[bool], optional) – If True, attempts to use a wider range of controls by prioritizing unused or less-used controls. If None, uses the instance’s default self.exhaustive_matching_default. Defaults to None.

plot_matched_scores() None[source]

Plots the distribution of propensity scores after matching. Visualizes score overlap in self.matched_data.

plot_scores() None[source]

Plots the distribution of propensity scores before matching. Visualizes score overlap between test (treated) and control groups.

predict_scores() None[source]

Predicts propensity scores using the fitted model(s). Scores are stored in self.data[‘scores’].

prep_prop_test(data: DataFrame, var: str) list | None[source]

Prepares a contingency table for the Chi-Square test.

Parameters:
  • data (pd.DataFrame) – The DataFrame (original or matched) to use.

  • var (str) – The categorical variable name.

Returns:

List-of-lists for scipy.stats.chi2_contingency. None if error.

Return type:

Optional[list]

prop_test(col: str) Dict[str, Any] | None[source]

Performs Chi-Square tests for a categorical variable before and after matching.

Parameters:

col (str) – Name of the categorical column to test.

Returns:

Dict with ‘var’, ‘before’ p-value, ‘after’ p-value. None if error.

Return type:

Optional[Dict[str, Any]]

record_frequency() DataFrame[source]

Calculates the frequency of each original control record in the matched dataset when exhaustive matching is used, or frequency of match_ids otherwise. :returns: DataFrame with frequencies. :rtype: pd.DataFrame

tune_threshold(method: str, nmatches: int = 1, rng: ndarray | None = None) None[source]

Evaluates matching retention across a range of threshold values.

Parameters:
  • method (str) – Matching method (‘min’, ‘nn’, ‘radius’) for evaluation.

  • nmatches (int, optional) – Number of matches for ‘nn’/’min’. Defaults to 1.

  • rng (Optional[np.ndarray], optional) – Threshold values to test. Defaults to np.arange(0, 0.001, 0.0001).