pysmatch.utils
- pysmatch.utils.chi2_distance(t: ndarray, c: ndarray, bins: int | str = 'auto') float [source]
Computes the Chi-square distance between the distributions of two samples.
This function calculates a measure of distance between the histograms of two samples (t and c). It first creates histograms with common bins, then computes the Chi-square statistic based on the frequencies in each bin. A small epsilon is added to the denominator to avoid division by zero.
- Parameters:
t (np.ndarray) – Array containing data for the first sample.
c (np.ndarray) – Array containing data for the second sample.
bins (Union[int, str], optional) – The number of bins or the binning strategy to use for np.histogram. Defaults to ‘auto’.
- Returns:
- The calculated Chi-square distance. Returns 0.0 if inputs are empty
or identical after binning.
- Return type:
float
- pysmatch.utils.drop_static_cols(df: DataFrame, yvar: str, cols: List[str] | None = None) DataFrame [source]
Drops columns from a DataFrame that contain only a single unique value (static columns).
It identifies columns with only one unique value among the specified cols (excluding yvar) and removes them in place from the DataFrame. Prints the names of dropped columns.
- Parameters:
df (pd.DataFrame) – The input DataFrame to modify.
yvar (str) – The name of the target variable column, which should not be dropped even if static.
cols (Optional[List[str]], optional) – A list of column names to check for static values. If None, checks all columns in the DataFrame (excluding yvar). Defaults to None.
- Returns:
- The input DataFrame df modified in place (static columns removed).
It returns the same DataFrame object that was passed in.
- Return type:
pd.DataFrame
- pysmatch.utils.grouped_permutation_test(f, t: ndarray, c: ndarray, n_samples: int = 1000) tuple [source]
Performs a permutation test for a given statistic function f.
Evaluates the significance of an observed test statistic calculated by function f applied to samples t and c. It does this by: 1. Calculating the observed statistic truth = f(t, c). 2. Repeatedly (n_samples times):
Combining t and c.
Shuffling the combined data.
Splitting the shuffled data back into two samples of the original sizes.
Calculating the statistic f on the permuted samples.
Counting how often the permuted statistic is greater than or equal to truth.
Calculating the p-value as the proportion of permuted statistics that met the condition in 2e.
- Parameters:
f (Callable[[np.ndarray, np.ndarray], float]) – A function that takes two numpy arrays (representing two groups) and returns a single float value (the test statistic).
t (np.ndarray) – Array containing data for the first group.
c (np.ndarray) – Array containing data for the second group.
n_samples (int, optional) – The number of permutation iterations to perform. Defaults to 1000.
- Returns:
- A tuple containing:
p_value (float): The estimated p-value from the permutation test.
- truth (float): The observed test statistic calculated on the original samples.
Returns (1.0, np.nan) or similar if inputs are invalid.
- Return type:
tuple
- pysmatch.utils.is_continuous(colname: str, df: DataFrame) bool [source]
Checks if a specified column in a DataFrame has a numeric data type.
Uses pandas.api.types.is_numeric_dtype to determine if the column contains numerical data (integers or floats).
- Parameters:
colname (str) – The name of the column to check.
df (pd.DataFrame) – The DataFrame containing the column.
- Returns:
- True if the column exists in the DataFrame and its dtype is numeric,
False otherwise (column doesn’t exist or dtype is non-numeric).
- Return type:
bool
- pysmatch.utils.ks_boot(tr: ndarray, co: ndarray, nboots: int = 1000) float [source]
Performs a bootstrap Kolmogorov-Smirnov (KS) test to estimate the p-value.
This function estimates the p-value for the two-sample KS test by comparing the observed KS statistic between the two input samples (tr and co) against a distribution of KS statistics obtained from bootstrap samples drawn under the null hypothesis (that both samples come from the same distribution).
- Parameters:
tr (np.ndarray) – Array containing data for the first sample (e.g., treatment group).
co (np.ndarray) – Array containing data for the second sample (e.g., control group).
nboots (int, optional) – The number of bootstrap iterations to perform. Defaults to 1000.
- Returns:
- The estimated p-value, calculated as the proportion of bootstrap KS statistics
that are greater than or equal to the observed KS statistic.
- Return type:
float
- pysmatch.utils.progress(i: int, n: int, prestr: str = '') None [source]
Displays a simple progress indicator in the console.
Prints a progress string like “[prefix]: i/n” to standard output, overwriting the previous line using the carriage return character `
`.
- Args:
i (int): The current step or item number (should be 1-based or adjusted). n (int): The total number of steps or items. prestr (str, optional): A prefix string to display before the progress count
(e.g., “Processing file”). Defaults to ‘’.
- pysmatch.utils.std_diff(a: ndarray, b: ndarray) tuple [source]
Calculates standardized differences (median and mean) between two arrays.
Computes the difference in medians and means between arrays a and b, standardized by the pooled standard deviation of the combined data.
- Parameters:
a (np.ndarray) – Array containing data for the first group.
b (np.ndarray) – Array containing data for the second group.
- Returns:
- A tuple containing:
med_diff (float): The standardized median difference (median(a) - median(b)) / std(combined).
- mean_diff (float): The standardized mean difference (mean(a) - mean(b)) / std(combined).
Returns (0.0, 0.0) if the combined standard deviation is zero or if inputs are empty.
- Return type:
Tuple[float, float]