pysmatch.modeling

pysmatch.modeling.fit_model(index: int, X: DataFrame, y: Series, model_type: str, balance: bool, max_iter: int = 100, random_state: int = 42) Dict[str, Any][source]

Fits a single propensity score model with preprocessing.

This function trains a specified classification model (Logistic Regression, CatBoost, or KNN) to predict the binary target variable y based on covariates X. It includes steps for: 1. Splitting data into training and a (currently unused) test set. 2. Optionally balancing the training data using RandomOverSampler. 3. Preprocessing features using StandardScaler for numerical and OneHotEncoder

for categorical variables via a ColumnTransformer.

  1. Training the specified model within a scikit-learn Pipeline that combines preprocessing and classification.

  2. Calculating the accuracy of the fitted model on the (potentially resampled) training data.

Parameters:
  • index (int) – An identifier for this model instance (e.g., used for setting random states in cross-validation or ensembles).

  • X (pd.DataFrame) – DataFrame containing the covariate features.

  • y (pd.Series) – Series containing the binary target variable (0 or 1).

  • model_type (str) – The type of classifier to use. Options: ‘linear’ (Logistic Regression), ‘tree’ (CatBoostClassifier), ‘knn’ (KNeighborsClassifier).

  • balance (bool) – If True, applies RandomOverSampler to the training data to balance the classes before fitting the model.

  • max_iter (int, optional) – Maximum number of iterations for solvers in iterative models (like Logistic Regression). Defaults to 100.

  • random_state (Optional[int], optional) – Seed for reproducibility in data splitting, resampling (if balance=True), and model initialization (where applicable, like LogisticRegression, CatBoost). Defaults to None. If None, randomness is not fixed. Note: Uses `index` for train_test_split and sampler seed.

Returns:

A dictionary containing:
  • ’model’: The fitted scikit-learn Pipeline object (preprocessor + classifier).

  • ’accuracy’: The accuracy score of the fitted model on the (potentially

    resampled) training data.

Return type:

Dict[str, Any]

Raises:
  • ValueError – If an invalid model_type is provided (not ‘linear’, ‘tree’, or ‘knn’).

  • ImportError – If required libraries (sklearn, imblearn, catboost) are not installed.

pysmatch.modeling.optuna_tuner(X: DataFrame, y: Series, model_type: str, n_trials: int = 10, balance: bool = True, random_state: int = 42) Dict[str, Any][source]

Performs hyperparameter optimization using Optuna for a specified model type.

This function uses Optuna to search for the best hyperparameters for a given model_type (‘linear’, ‘tree’, ‘knn’). It involves: 1. Splitting data into training and validation sets. 2. Optionally balancing the training set using RandomOverSampler. 3. Defining an Optuna objective function that:

  • Takes an Optuna trial object.

  • Defines hyperparameter search spaces using trial.suggest_*.

  • Creates a preprocessing pipeline (StandardScaler + OneHotEncoder).

  • Initializes the specified model with suggested hyperparameters.

  • Combines preprocessing and model into a scikit-learn Pipeline.

  • Fits the pipeline on the (potentially resampled) training data.

  • Evaluates the pipeline on the validation set and returns the accuracy score.

  1. Running the Optuna study for n_trials to maximize the objective (accuracy).

  2. Retrieving the best hyperparameters found by the study.

  3. Training a final pipeline on the (potentially resampled) training data using the best hyperparameters found.

  4. Returning the final fitted pipeline, its validation accuracy during the study, and the best parameters.

Parameters:
  • X (pd.DataFrame) – DataFrame containing the covariate features.

  • y (pd.Series) – Series containing the binary target variable (0 or 1).

  • model_type (str) – The type of classifier to tune. Options: ‘linear’ (Logistic Regression), ‘tree’ (CatBoostClassifier), ‘knn’ (KNeighborsClassifier).

  • n_trials (int, optional) – The number of optimization trials Optuna should run. Defaults to 10.

  • balance (bool, optional) – If True, applies RandomOverSampler to the training data within each Optuna trial before fitting the model. Defaults to True.

  • random_state (int, optional) – Seed for reproducibility in data splitting, resampling, and model initialization (where applicable). Defaults to 42.

Returns:

A dictionary containing:
  • ’model’: The final scikit-learn Pipeline object fitted with the best

    hyperparameters found by Optuna on the training data.

  • ’accuracy’: The best accuracy score achieved on the validation set during

    the Optuna study.

  • ’best_params’: A dictionary containing the best hyperparameters found.

Return type:

Dict[str, Any]

Raises:
  • ValueError – If an invalid model_type is provided for tuning.

  • ImportError – If required libraries (optuna, sklearn, imblearn, catboost) are not installed.