We use the publicly-available seismic bumps dataset to train a classification model and predict hazardous seismic activity. The data consists of 18 attributes consisting of both categorical and non-categorical data, in addition to the binary decision attribute that denotes the occurence of high energy seismic bumps. The decision classes are highly imbalanced, with positive instances accounting for less than 7% of the total instances.
We begin by loading the data and assigning column names.
import arff
import pandas as pd
def load_data(colnames):
"""Loads data from a file. Returns a dataframe"""
r = arff.load(open('/Users/nargiss/Downloads/seismic-bumps.arff', 'rb'))
df = pd.DataFrame(r['data'])
df.columns = colnames
return df
colnames = ['seismic',
'seismoacoustic',
'shift',
'genergy',
'gpuls',
'gdenergy',
'gdpuls',
'ghazard',
'nbumps',
'nbumps2',
'nbumps3',
'nbumps4',
'nbumps5',
'nbumps6',
'nbumps7',
'nbumps89',
'energy',
'maxenergy',
'outcome']
df = load_data(colnames)
df.head()
We then transform the categorical data from labels to values between 0 and n_classes-1
from sklearn import preprocessing
def preprocess_features(df, cols):
"""transform categorical features"""
le = preprocessing.LabelEncoder()
for clmn in cols:
df[clmn] = le.fit_transform(df[clmn])
return df
cat_cols = ['seismic', 'seismoacoustic', 'shift', 'ghazard', 'outcome']
df = preprocess_features(df, cat_cols)
#check that all categorical columns have more than one category
for clmn in cat_cols:
print df[clmn].unique()
Next we plot the count and real-valued data as well as the decision attribute in a pairplot, and visually explore whether there are attributes that appear to have no impact on the decision attribute. We notice that 'nbumps6', 'nbumps7' and 'nbumps89' consist of one unique value and therefore have no impact on the classification.
import seaborn as sns
real_cols = [clmn for clmn in colnames if clmn not in cat_cols] + ['outcome']
%matplotlib inline
g = sns.pairplot(df[real_cols])
for ax in g.axes.flat:
for label in ax.get_xticklabels():
label.set_rotation(45)
# drop columns that contain only one unique value
df[['nbumps6', 'nbumps7', 'nbumps89']].sum()
df = df.drop(['nbumps6', 'nbumps7', 'nbumps89'], 1)
Since we have a high number of features as well as highly imbalanced classes, we will use SVM with the RBF kernel for this classification problem. Before we fit the data, we perform the following steps:
- Scale all features between -1 and 1
- Split the data into training and testing sets
- Peform Grid search and cross-validation to find proper tuning parameters
- Define scoring functions
def scale_features(df):
"""Scale all features to values between -1 and 1"""
colnames = df.columns
MinMaxScaler = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True)
df = pd.DataFrame(MinMaxScaler.fit_transform(df))
df.columns = colnames
return df
#We scale all features between -1 and 1 to prevent large features from dominating the model
df = scale_features(df)
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation
def split_data(df, features):
X = df[features]
Y = df['outcome']
return cross_validation.train_test_split(X, Y, stratify=Y)
def train_model(x_train, y_train, scoring='recall'):
svc = SVC(class_weight='balanced', probability=False)
#Use cross-validation to find good parameters
tuning_parameters = [{'kernel': ['rbf'], 'gamma': [2**x for x in xrange(-10, 5)],
'C': [2**x for x in xrange(-10, 5)]}]
clf = GridSearchCV(estimator=svc, param_grid=tuning_parameters, scoring=scoring, cv=10)
clf.fit(x_train, y_train)
return clf
from sklearn.metrics import accuracy_score, confusion_matrix
def sensitivity(mat):
"""Use the confusion matrix to calculate sensitivity"""
tp = mat[1][1]
fn = mat[1][0]
try:
s = (1.0*tp) / (tp + fn)
except:
s = None
return s
def specificity(mat):
"""Use the confusion matrix to calculate specificity"""
tn = mat[0][0]
fp = mat[0][1]
try:
sp = (1.0*tn) / (tn + fp)
except:
sp = None
return sp
Since our goal is to predict hazardous seismic activity, it is very important to detect a high number of true positives and minimize false negatives. Consequently, our most important metric of success is sensitivity. We perform cross-validation with various scoring functions with the goal of maximizing sensitivity while keeping accuracy and specificity reasonably high. We find when optimizing for sensitivity (recall) alone, that accuracy is below 10% while sensitivity is 100%. Optimizing for the area under the ROC curve, which maximizes true positives while keeping false positives low, achieves an accuracy of 81% with a sensitivity of 67%.
features = [clmn for clmn in df.columns if clmn not in 'outcome']
x_train, x_test, y_train, y_test = split_data(df, features)
clf = train_model(x_train, y_train, scoring='recall')
y_pred = clf.predict(x_test)
print "Recall-optimized grid search scores"
print("Accuracy : {0}").format(accuracy_score(y_test, y_pred))
print("Sensitivity : {0}").format(sensitivity(confusion_matrix(y_test, y_pred)))
print("Specificity: {0}").format(specificity(confusion_matrix(y_test, y_pred)))
clf = train_model(x_train, y_train, scoring='roc_auc')
y_pred = clf.predict(x_test)
print ""
print "Area under ROC-optimized grid search scores"
print("Accuracy : {0}").format(accuracy_score(y_test, y_pred))
print("Sensitivity : {0}").format(sensitivity(confusion_matrix(y_test, y_pred)))
print("Specificity: {0}").format(specificity(confusion_matrix(y_test, y_pred)))