端到端流程
数据准备
优先选择处理前可观测协变量,避免引入处理后的信息泄漏。
import pandas as pd
fields = [
"loan_amnt",
"funded_amnt",
"funded_amnt_inv",
"term",
"int_rate",
"installment",
"grade",
"sub_grade",
"loan_status",
]
raw = pd.read_csv("misc/loan.csv", usecols=fields)
test = raw[raw.loan_status == "Default"].copy()
control = raw[raw.loan_status == "Fully Paid"].copy()
初始化 Matcher
from pysmatch.Matcher import Matcher
matcher = Matcher(
test=test,
control=control,
yvar="is_default",
exclude=["loan_status"],
)
print("xvars:", matcher.xvars)
print("test/control:", matcher.testn, matcher.controln)
拟合倾向得分模型
fit_scores 支持三类模型:
linear(逻辑回归)knntree(CatBoost,需要安装pysmatch[tree])
matcher.fit_scores(
balance=True,
balance_strategy="over", # "over" 或 "under"
nmodels=10,
model_type="linear",
max_iter=200,
n_jobs=2,
)
print("models:", len(matcher.models))
print("avg validation accuracy:", sum(matcher.model_accuracy) / len(matcher.model_accuracy))
Optuna 调参路径(训练单个最优模型):
# matcher.fit_scores(
# balance=True,
# model_type="tree",
# use_optuna=True,
# n_trials=20,
# )
预测并查看得分分布
matcher.predict_scores()
matcher.plot_scores()
预测后,matcher.data 会新增 scores 列。
调整阈值
import numpy as np
matcher.tune_threshold(
method="min",
nmatches=1,
rng=np.arange(0.0001, 0.0051, 0.0005),
)
在保留率与匹配质量之间选择合适阈值。
执行匹配
标准匹配:
matcher.match(
method="min",
nmatches=1,
threshold=0.001,
replacement=False,
exhaustive_matching=False,
)
matcher.plot_matched_scores()
详尽匹配:
matcher.match(
threshold=0.001,
nmatches=1,
exhaustive_matching=True,
)
检查匹配结果与权重
print(matcher.matched_data.head())
print(matcher.record_frequency().head())
matcher.assign_weight_vector()
print(matcher.matched_data[["record_id", "match_id", "weight"]].head())