pysmatch.modeling

pysmatch.modeling.fit_model(index: int, X: DataFrame, y: Series, model_type: str, balance: bool, max_iter: int = 100, random_state: int = 42) Dict[str, Any][source]

Fits a single propensity score model with preprocessing.

This function trains a specified classification model (Logistic Regression,

CatBoost, or KNN) to predict the binary target variable y based on covariates X.

It includes steps for:

  1. Splitting data into training and a (currently unused) test set.

  2. Optionally balancing the training data using RandomOverSampler.

  3. Preprocessing features using StandardScaler for numerical and OneHotEncoder

for categorical variables via a ColumnTransformer.

  1. Training the specified model within a scikit-learn Pipeline that combines

preprocessing and classification.

  1. Calculating the accuracy of the fitted model on the (potentially resampled)

training data.

Args:

index (int): An identifier for this model instance (e.g., used for setting

random states in cross-validation or ensembles).

X (pd.DataFrame): DataFrame containing the covariate features.

y (pd.Series): Series containing the binary target variable (0 or 1).

model_type (str): The type of classifier to use. Options: ‘linear’

(Logistic Regression), ‘tree’ (CatBoostClassifier),

‘knn’ (KNeighborsClassifier).

balance (bool): If True, applies RandomOverSampler to the training data

to balance the classes before fitting the model.

max_iter (int, optional): Maximum number of iterations for solvers in

iterative models (like Logistic Regression).

Defaults to 100.

random_state (Optional[int], optional): Seed for reproducibility in data splitting,

resampling (if balance=True), and model initialization

(where applicable, like LogisticRegression, CatBoost).

Defaults to None. If None, randomness is not fixed.

Note: Uses `index` for train_test_split and sampler seed.

Returns:

Dict[str, Any]: A dictionary containing:

  • ‘model’: The fitted scikit-learn Pipeline object (preprocessor + classifier).

  • ‘accuracy’: The accuracy score of the fitted model on the (potentially

resampled) training data.

Raises:

ValueError: If an invalid model_type is provided (not ‘linear’, ‘tree’, or ‘knn’).

ImportError: If required libraries (sklearn, imblearn, catboost) are not installed.

pysmatch.modeling.optuna_tuner(X: DataFrame, y: Series, model_type: str, n_trials: int = 10, balance: bool = True, random_state: int = 42) Dict[str, Any][source]

Performs hyperparameter optimization using Optuna for a specified model type.

This function uses Optuna to search for the best hyperparameters for a given

model_type (‘linear’, ‘tree’, ‘knn’). It involves:

  1. Splitting data into training and validation sets.

  2. Optionally balancing the training set using RandomOverSampler.

  3. Defining an Optuna objective function that:

  • Takes an Optuna trial object.

  • Defines hyperparameter search spaces using trial.suggest_*.

  • Creates a preprocessing pipeline (StandardScaler + OneHotEncoder).

  • Initializes the specified model with suggested hyperparameters.

  • Combines preprocessing and model into a scikit-learn Pipeline.

  • Fits the pipeline on the (potentially resampled) training data.

  • Evaluates the pipeline on the validation set and returns the accuracy score.

  1. Running the Optuna study for n_trials to maximize the objective (accuracy).

  2. Retrieving the best hyperparameters found by the study.

  3. Training a final pipeline on the (potentially resampled) training data using the

best hyperparameters found.

  1. Returning the final fitted pipeline, its validation accuracy during the study,

and the best parameters.

Args:

X (pd.DataFrame): DataFrame containing the covariate features.

y (pd.Series): Series containing the binary target variable (0 or 1).

model_type (str): The type of classifier to tune. Options: ‘linear’

(Logistic Regression), ‘tree’ (CatBoostClassifier),

‘knn’ (KNeighborsClassifier).

n_trials (int, optional): The number of optimization trials Optuna should run.

Defaults to 10.

balance (bool, optional): If True, applies RandomOverSampler to the training

data within each Optuna trial before fitting the model.

Defaults to True.

random_state (int, optional): Seed for reproducibility in data splitting,

resampling, and model initialization (where applicable).

Defaults to 42.

Returns:

Dict[str, Any]: A dictionary containing:

  • ‘model’: The final scikit-learn Pipeline object fitted with the best

hyperparameters found by Optuna on the training data.

  • ‘accuracy’: The best accuracy score achieved on the validation set during

the Optuna study.

  • ‘best_params’: A dictionary containing the best hyperparameters found.

Raises:

ValueError: If an invalid model_type is provided for tuning.

ImportError: If required libraries (optuna, sklearn, imblearn, catboost) are not installed.