pysmatch.modeling

pysmatch.modeling.fit_model(index: int, X: DataFrame, y: Series, model_type: str, balance: bool, max_iter: int = 100, random_state: int = 42) → Dict[str, Any][source]

Fits a single propensity score model with preprocessing.

This function trains a specified classification model (Logistic Regression, CatBoost, or KNN) to predict the binary target variable y based on covariates X. It includes steps for: 1. Splitting data into training and a (currently unused) test set. 2. Optionally balancing the training data using RandomOverSampler. 3. Preprocessing features using StandardScaler for numerical and OneHotEncoder

for categorical variables via a ColumnTransformer.

Training the specified model within a scikit-learn Pipeline that combines preprocessing and classification.
Calculating the accuracy of the fitted model on the (potentially resampled) training data.

Parameters:

index (int) – An identifier for this model instance (e.g., used for setting random states in cross-validation or ensembles).
X (pd.DataFrame) – DataFrame containing the covariate features.
y (pd.Series) – Series containing the binary target variable (0 or 1).
model_type (str) – The type of classifier to use. Options: ‘linear’ (Logistic Regression), ‘tree’ (CatBoostClassifier), ‘knn’ (KNeighborsClassifier).
balance (bool) – If True, applies RandomOverSampler to the training data to balance the classes before fitting the model.
max_iter (int, optional) – Maximum number of iterations for solvers in iterative models (like Logistic Regression). Defaults to 100.
random_state (Optional[int], optional) – Seed for reproducibility in data splitting, resampling (if balance=True), and model initialization (where applicable, like LogisticRegression, CatBoost). Defaults to None. If None, randomness is not fixed. Note: Uses `index` for train_test_split and sampler seed.

Returns:

A dictionary containing:

’model’: The fitted scikit-learn Pipeline object (preprocessor + classifier).
’accuracy’: The accuracy score of the fitted model on the (potentially
resampled) training data.

Return type:

Dict[str, Any]

Raises:

ValueError – If an invalid model_type is provided (not ‘linear’, ‘tree’, or ‘knn’).
ImportError – If required libraries (sklearn, imblearn, catboost) are not installed.

pysmatch.modeling.optuna_tuner(X: DataFrame, y: Series, model_type: str, n_trials: int = 10, balance: bool = True, random_state: int = 42) → Dict[str, Any][source]

Performs hyperparameter optimization using Optuna for a specified model type.

This function uses Optuna to search for the best hyperparameters for a given model_type (‘linear’, ‘tree’, ‘knn’). It involves: 1. Splitting data into training and validation sets. 2. Optionally balancing the training set using RandomOverSampler. 3. Defining an Optuna objective function that:

Takes an Optuna trial object.

Defines hyperparameter search spaces using trial.suggest_*.

Creates a preprocessing pipeline (StandardScaler + OneHotEncoder).

Initializes the specified model with suggested hyperparameters.

Combines preprocessing and model into a scikit-learn Pipeline.

Fits the pipeline on the (potentially resampled) training data.

Evaluates the pipeline on the validation set and returns the accuracy score.

Running the Optuna study for n_trials to maximize the objective (accuracy).
Retrieving the best hyperparameters found by the study.
Training a final pipeline on the (potentially resampled) training data using the best hyperparameters found.
Returning the final fitted pipeline, its validation accuracy during the study, and the best parameters.