介绍

拍拍贷是一家互联网金融公司，在2016年拍拍贷举办“魔镜杯”风险算法大赛，首度公开真实的历史数据，旨在寻求高效准确的预测算法，为公司投资人提供决策依据，促进健康高效的互联网金融。

数据

为保护用户隐私安全，项目所用数据均已经经过脱敏处理。
数据主要分为3个表：Master、Log_Info、Userupdate_Info
每一行代表一个样本(一笔成功成交借款),每个样本包含200多个各类字段。
1.Master

主要字段	描述
idx	每一笔贷款的unique key，可以与另外2个文件里的idx相匹配
UserInfo	借款人特征字段
WeblogInfo	Info网络行为字段
Education	学历学籍字段
ThirdParty_Info_PeriodN	第三方数据时间段N字段
SocialNetwork	社交网络字段
LinstingInfo	借款成交时间
Target	违约标签（1/0,违约/正常）测试集不包含target字段

train_master.info()
》
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Columns: 228 entries, Idx to ListingInfo
dtypes: float64(38), int64(170), object(20)
memory usage: 52.2+ MB

2.Log_Info
借款人的登陆信息。

主要字段	描述
ListingInfo	借款成交时间
LogInfo1	操作代码
LogInfo2	操作类别
LogInfo3	登陆时间
idx	每一笔贷款的unique key

train_loginfo.info()
》
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 580551 entries, 0 to 580550
Data columns (total 5 columns):
Idx           580551 non-null int64
Listinginfo1     580551 non-null object
LogInfo1        580551 non-null int64
LogInfo2        580551 non-null int64
LogInfo3        580551 non-null object
dtypes: int64(3), object(2)
memory usage: 22.1+ MB

3.Userupdate_Info
借款人修改信息。

主要字段	描述
ListingInfo1	借款成交时间
UserupdateInfo1	修改内容
UserupdateInfo2	修改时间
idx	每一笔贷款的unique key

train_userinfo.info()
》
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372463 entries, 0 to 372462
Data columns (total 4 columns):
Idx                372463 non-null int64
ListingInfo1          372463 non-null object
UserupdateInfo1        372463 non-null object
UserupdateInfo2        372463 non-null object
dtypes: int64(1), object(3)
memory usage: 11.4+ MB

比赛规则

参赛团队需要基于训练集数据构建预测模型，使用模型计算测试集的评分（评分数值越高，表示越有可能出现贷款违约）。
模型评价标准：
定义：本次比赛采用AUC来评判模型的效果。AUC即以False Positive Rate为横轴，True Positive Rate为纵轴的ROC （Receiver Operating Characteristic）curve下方的面积的大小。
$$AUC=\frac{\sum_{i} S_{i}}{M \times N}$$
其中，M为正样本个数，N为负样本个数，M*N为正负样本对的个数。$S_{i}$为第i个正负样本对的得分，定义如下：
$$S_{i}=\begin{cases}
1 & score_{i-p} > score_{i-n} \\
0.5 & score_{i-p} = score_{i-n} \\
0 & score_{i-p} < score_{i-n}
\end{cases}$$
其中，
$score_{i-p}$为正负样本对中，模型给正样本的评分，
$score_{i-n}$为正负样本对中，模型给负样本的评分，
AUC值在[0,1]区间，越高越好。

数据清洗

缺失值处理

删除缺失数据

Master表信息缺失情况相对比较为严重。
统计各字段的缺失情况：

null_sum = train_master.isnull().sum()
null_sum = null_sum[null_sum!=0]
null_sum_df = pd.DataFrame(null_sum, columns=['num'])
null_sum_df['ratio'] = null_sum_df['num'] / 30000.0
null_sum_df.sort_values(by='ratio', ascending=False, inplace=True)#对每个字段信息进行缺失率排序
null_sum_df[:10]

	num	ratio
WeblogInfo_3	29030	0.967667
WeblogInfo_1	29030	0.967667
UserInfo_11	18909	0.630300
UserInfo_13	18909	0.630300
UserInfo_12	18909	0.630300
WeblogInfo_20	8050	0.268333
WeblogInfo_21	3074	0.102467
WeblogInfo_19	2963	0.098767
WeblogInfo_2	1658	0.055267
WeblogInfo_4	1651	0.055033

删除缺失严重的行:

1	train_master.drop(['WeblogInfo_3', 'WeblogInfo_1', 'UserInfo_11', 'UserInfo_13', 'UserInfo_12', 'WeblogInfo_20'],axis=1,inplace=True)

删除缺失严重的行：

record_nan=train_master.isnull().sum(axis=1).sort_values(ascending=False)
drop_record_index=[i for i in record_nan.loc[(record_nan>=5)].index]
print("Before train_master shape {}".format(train_master.shape))
train_master.drop(drop_record_index, inplace=True)
print("After train_master shape {}".format(train_master.shape))

1 2	Before train_master shape (30000, 222) After train_master shape (29189, 222)

缺失填充

填充缺失值要根据字段的属性进行合理填充，在没有字段信息的情况下，根据数据分布情况填充。
在缺失数据较少的情况下，用频率最高的值填充；在缺失数据较多的情况下，用均值填充。

print('Before all nan num{}'.format(train_master.isnull().sum()
.sum()))

train_master.loc[train_master['UserInfo_2'].isnull(), 'UserInfo_2'] = '位置地点'
train_master.loc[train_master['UserInfo_4'].isnull(), 'UserInfo_4'] = '位置地点'

def fill_nan(f, method):
    if method == "most":
        common_value=pd.value_counts(train_master[f], 					ascending=False).index[0]
    else:
        common_value = train_master[f].mean()
    train_master.loc[train_master[f].isnull(), f]=common_value

# 通过pd.value_counts(train_master[f])的观察得到经验
fill_nan('UserInfo_1', 'most')
fill_nan('UserInfo_3', 'most')
fill_nan('WeblogInfo_2', 'most')
fill_nan('WeblogInfo_4', 'mean')
fill_nan('WeblogInfo_5', 'mean')
fill_nan('WeblogInfo_6', 'mean')
fill_nan('WeblogInfo_19', 'most')
fill_nan('WeblogInfo_21', 'most')

print('After all nan num: {}'.format(train_master.isnull().sum() 
.sum()))

1 2	Before all nan num: 9808 After all nan num: 0

异常值处理

本文在处理离群点时，先通过特征分类将数值型特征单独列出来。
通过画图的方式，发现离群点。

import seaborn as sbn
sbn.set(style='whitegrid')

import matplotlib.pyplot as plt
%matplotlib inline

melt = pd.melt(train_master, id_vars=['target'], value_vars=[f for f in numerical_features])
#FacetGrid绘制各变量之间的关系图
g = sbn.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sbn.stripplot, 'target', 'value', jitter=True, palette='muted')

针对每一个数值型特征值作图分析判断，并删除特征离群的对象。

1 2	29189 lines before drop 28074 lines after drop

删除数值型离群点，剩下28074个样本。

特征分类

特征可以分为二值特征、连续特征、枚举特征。

二值特征
主要是0/1特征，即特征只取两种值：0或者1
连续值特征
取值为有理数的特征，特征取值个数不定，例如距离特征，特征取值为是0~正无穷。
枚举值特征
主要是特征有固定个数个可能值，例如今天周几，只有7个可能值：周1，周2，…，周日。

这块内容主要是特征转换,本文涉及以下几种转换情况：

非二值特征转换为二值特征
检查所有特征值的分布，若出现频率最高的值占该特征所有取值情况高达50%，那么就可以将该特征转换为二值特征。
连续值特征转换为枚举特征
当连续值特征的取值范围过小时，可将每个值单独作为一类，即将连续值特征转换为枚举特征。

"""-----------------------筛选二值特征--开始------------------"""
ratio_threshold = 0.5
binarized_features = []
binarized_features_most_freq_value = []

for f in train_master.columns:
    if f in ['target']:
        continue
    not_null_sum = (train_master[f].notnull()).sum()
    most_count = pd.value_counts(train_master[f], ascending=False).iloc[0]
    most_value = pd.value_counts(train_master[f], ascending=False).index[0]
    ratio = most_count / not_null_sum
    
    if ratio > ratio_threshold:
        binarized_features.append(f)
        binarized_features_most_freq_value.append(most_value)
        
"""-----------------------筛选二值特征--结束------------------"""

"""-------------------筛选连续值特征--开始------------------"""

numerical_features = [f for f in train_master.select_dtypes(exclude=['object']).columns if f not in (['Idx', 'target']) and f not in binarized_features]
                     
"""-------------------筛选连续值特征--结束------------------"""


"""-------------------筛选枚举特征--开始------------------"""
categorical_features = [f for f in train_master.select_dtypes (include=['object']).columns if f not in (['Idx', 'target']) and f not in binarized_features]

"""------------------筛选枚举特征--结束------------------"""

数据转换

特征转换

#将挑选出的特征转换为二值特征
for i in range(len(binarized_features)):
    f = binarized_features[i]
    most_value = binarized_features_most_freq_value[i]
    train_master['b_'+f] = 1
    train_master.loc[train_master[f] == most_value, 'b_'+f] = 0
    train_master.drop([f], axis=1, inplace=True)
    
#连续值特征转换为枚举特征
import numpy as np

feature_unique_count = []
for f in numerical_features:
	feature_unique_count.append((np.count_nonzero(train_master[f]
								.unique()), f))

for c, f in feature_unique_count:
    if c <= 10:
        print('{} moved from numerical to categorical'.format(f))
        numerical_features.remove(f)
        categorical_features.append(f)

对数转换

为提高模型拟合程度，通常要求样本分布呈现(近似)正态分布。
去除离群点后的数值特征分布情况：

明显数据分布偏左，可以采取对数转换的方法将数据分布变得更均匀：

解析时间

解析时间，并将解析后的时间字段加入主表master信息中：

import arrow

def parse_date(date_str, str_format='YYYY/MM/DD'):
    d = arrow.get(date_str, str_format)
    #月初、月中、月末
    month_stage = int((d.day-1) / 10) + 1
    return (d.timestamp, d.year, d.month, d.day, d.week, d.isoweekday(), month_stage)

def parse_ListingInfo(date):
    d = parse_date(date, 'YYYY/M/D')
    return pd.Series(d, index=['ListingInfo_timestamp', 'ListingInfo_year', 'ListingInfo_month','ListingInfo_day', 'ListingInfo_week', 'ListingInfo_isoweekday', 
'ListingInfo_month_stage'], dtype=np.int32)

ListingInfo_parsed =train_master_['ListingInfo'].
apply(parse_ListingInfo)
print('before train_master_ shape{}'.format(train_master_.shape))
train_master_ = train_master_.merge(ListingInfo_parsed, how='left', left_index=True, right_index=True)
print('after train_master_ shape {}'.format(train_master_.shape))

解析时间数据后，整个master表多了7列：

1 2	before train_master_ shape (28074, 223) after train_master_ shape (28074, 230)

转换登陆信息

登陆信息：

	Idx	Listinginfo1	LogInfo1	LogInfo2	LogInfo3
0	10001	2014-03-05	107	6	2014-02-20
1	10001	2014-03-05	107	6	2014-02-23
2	10001	2014-03-05	107	6	2014-02-24
3	10001	2014-03-05	107	6	2014-02-25
4	10001	2014-03-05	107	6	2014-02-27

统计每笔交易登陆次数、交易成功时间戳(包含借款、还款等)、登陆次数、登陆间隔时长、最新登陆时间戳、相同操作次数，得到转换后的登陆信息：

	loginfo_num	loginfo_LogInfo1_unique_num	XXX	loginfo_LogInfo12_unique_num
0	3	26	XXX	8
1	5	11	XXX	4
2	8	125	XXX	13
3	12	199	XXX	11
4	16	15	XXX	7

转换修改信息

修改信息：

	Idx	ListingInfo1	UserupdateInfo1	UserupdateInfo2
0	10001	2014/03/05	_EducationId	2014/02/20
1	10001	2014/03/05	_HasBuyCar	2014/02/20
2	10001	2014/03/05	_LastUpdateDate	2014/02/20
3	10001	2014/03/05	_MarriageStatusId	2014/02/20
4	10001	2014/03/05	_MobilePhone	2014/02/20

统计每笔交易修改信息次数、修改字段数、修改时间次数、间隔天数、最新修改时间戳、按照UserupdateInfo1信息分别统计，得到转换后的修改信息：

	Idx	userinfo_num	XXX
0	3	13	XXX
1	5	13	XXX
2	8	14	XXX
3	12	14	XXX
4	16	13	XXX

枚举特征编码

pandas的get_dummies()函数：
参数：
columns：要编码的DataFrame中的列名称。如果列为None，则将转换具有对象或类别dtype的所有列。

drop_columns= ['Idx', 'ListingInfo', 'UserInfo_20', 'UserInfo_19','UserInfo_8','UserInfo_7','UserInfo_4','UserInfo_2','ListingInfo_timestamp', 'loginfo_last_day_timestamp', 
'userinfo_last_day_timestamp']
train_master_ = train_master_.drop(drop_columns, axis=1)

dummy_columns = categorical_features.copy()
dummy_columns.extend(['ListingInfo_year', 'ListingInfo_month', 'ListingInfo_day', 'ListingInfo_week', 
                      'ListingInfo_isoweekday', 'ListingInfo_month_stage'])
finally_dummy_columns = []

for c in dummy_columns:
    if c not in drop_columns:
        finally_dummy_columns.append(c)

print('before get_dummies train_master_ shape {}'.format(train_master_.shape))
train_master_ = pd.get_dummies(train_master_, columns=finally_dummy_columns)
print('after get_dummies train_master_ shape {}'.format(train_master_.shape))

建模工作前的数据处理工作基本就是以上内容，还包括合并三个表、删除不相关字段、添加at_home字段，增加家乡字段信息，可能与是否违约相关。

算法建模

标准化

创建训练集及验证集，并对训练集做标准化处理：

from sklearn.preprocessing import StandardScaler

X_train = train_master_.drop(['target'], axis=1)
X_train = StandardScaler().fit_transform(X_train)
y_train = train_master_['target']
print(X_train.shape, y_train.shape)

得到训练集及验证集的大小：

1	(28074, 443) (28074,)

交叉验证

选择StratifiedKFoldS折交叉验证法，确保训练集中每一类的比例是相同的。

1
2
3

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=3, shuffle=True)#分层采样

分类算法评估

AUC、精确度、召回率：

from sklearn.model_selection import cross_val_score

def estimate(estimator, name='estimator'):
    auc = cross_val_score(estimator, X_train, y_train, scoring='roc_auc', cv=cv).mean()
    accuracy = cross_val_score(estimator, X_train, y_train, scoring='accuracy', cv=cv).mean()
    recall = cross_val_score(estimator, X_train, y_train, scoring='recall', cv=cv).mean()

    print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))

算法

评估以下分类算法：

线性分类器
LogisticRegression
广义线性分类器
RidgeClassifier
集成方法分类器
RandomForestClassifier、AdaBoostClassifier、XGBClassifier
支持向量机分类器
SVC、LinearSVC

通过评估方法选择分类效果较好的LogisticRegression、XGBClassifier、AdaBoostClassifier三种模型，最终通过投票的方式聚合三种模型。

from sklearn.ensemble import VotingClassifier

estimators = []
estimators.append(('LogisticRegression', LogisticRegression()))
estimators.append(('XGBClassifier', XGBClassifier(learning_rate=0.1, n_estimators=20, objective='binary:logistic')))
estimators.append(('AdaBoostClassifier', AdaBoostClassifier()))

voting = VotingClassifier(estimators = estimators, voting='soft')
estimate(voting, 'voting')

最终达到的效果：

1	voting: auc:0.800663, recall:0.001285, accuracy:0.944433