Imbalanced-class Classification with Seismic Bumps Data


seismic_data

We use the publicly-available seismic bumps dataset to train a classification model and predict hazardous seismic activity. The data consists of 18 attributes consisting of both categorical and non-categorical data, in addition to the binary decision attribute that denotes the occurence of high energy seismic bumps. The decision classes are highly imbalanced, with positive instances accounting for less than 7% of the total instances.

We begin by loading the data and assigning column names.

In [5]:
import arff
import pandas as pd

def load_data(colnames):
    """Loads data from a file. Returns a dataframe"""
    
    r = arff.load(open('/Users/nargiss/Downloads/seismic-bumps.arff', 'rb'))
    df = pd.DataFrame(r['data'])
    df.columns = colnames
    
    return df
In [6]:
colnames = ['seismic',  
'seismoacoustic', 
'shift', 
'genergy', 
'gpuls', 
'gdenergy', 
'gdpuls', 
'ghazard',
'nbumps',
'nbumps2',
'nbumps3',
'nbumps4',
'nbumps5',
'nbumps6',
'nbumps7',
'nbumps89',
'energy',
'maxenergy',
'outcome']

df = load_data(colnames)
In [7]:
df.head()
Out[7]:
seismic seismoacoustic shift genergy gpuls gdenergy gdpuls ghazard nbumps nbumps2 nbumps3 nbumps4 nbumps5 nbumps6 nbumps7 nbumps89 energy maxenergy outcome
0 a a N 15180.0 48.0 -72.0 -72.0 a 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
1 a a N 14720.0 33.0 -70.0 -79.0 a 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 2000.0 2000.0 0
2 a a N 8050.0 30.0 -81.0 -78.0 a 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
3 a a N 28820.0 171.0 -23.0 40.0 a 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 3000.0 3000.0 0
4 a a N 12640.0 57.0 -63.0 -52.0 a 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0

We then transform the categorical data from labels to values between 0 and n_classes-1

In [8]:
from sklearn import preprocessing

def preprocess_features(df, cols):
    """transform categorical features"""
    le = preprocessing.LabelEncoder()
    for clmn in cols:
        df[clmn] = le.fit_transform(df[clmn])
    
    return df

cat_cols = ['seismic', 'seismoacoustic', 'shift', 'ghazard', 'outcome']
df = preprocess_features(df, cat_cols)
In [9]:
#check that all categorical columns have more than one category
for clmn in cat_cols:
    print df[clmn].unique()
[0 1]
[0 1 2]
[0 1]
[0 1 2]
[0 1]

Next we plot the count and real-valued data as well as the decision attribute in a pairplot, and visually explore whether there are attributes that appear to have no impact on the decision attribute. We notice that 'nbumps6', 'nbumps7' and 'nbumps89' consist of one unique value and therefore have no impact on the classification.

In [10]:
import seaborn as sns

real_cols = [clmn for clmn in colnames if clmn not in cat_cols] + ['outcome']

%matplotlib inline

g = sns.pairplot(df[real_cols])
    
for ax in g.axes.flat:
    for label in ax.get_xticklabels():
        label.set_rotation(45)

    
    
In [11]:
# drop columns that contain only one unique value 
df[['nbumps6', 'nbumps7', 'nbumps89']].sum()
df = df.drop(['nbumps6', 'nbumps7', 'nbumps89'], 1)

Since we have a high number of features as well as highly imbalanced classes, we will use SVM with the RBF kernel for this classification problem. Before we fit the data, we perform the following steps:

  1. Scale all features between -1 and 1
  2. Split the data into training and testing sets
  3. Peform Grid search and cross-validation to find proper tuning parameters
  4. Define scoring functions
In [12]:
def scale_features(df): 
    """Scale all features to values between -1 and 1"""
    colnames = df.columns
    MinMaxScaler = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True)
    df = pd.DataFrame(MinMaxScaler.fit_transform(df))
    df.columns = colnames

    return df

#We scale all features between -1 and 1 to prevent large features from dominating the model
df = scale_features(df)
In [13]:
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation

def split_data(df, features):
    X = df[features]
    Y = df['outcome'] 
    
    return cross_validation.train_test_split(X, Y, stratify=Y)

def train_model(x_train, y_train, scoring='recall'):
    svc = SVC(class_weight='balanced', probability=False)

    #Use cross-validation to find good parameters
    tuning_parameters = [{'kernel': ['rbf'], 'gamma': [2**x for x in xrange(-10, 5)],
                         'C': [2**x for x in xrange(-10, 5)]}]

    clf = GridSearchCV(estimator=svc, param_grid=tuning_parameters, scoring=scoring, cv=10)
    clf.fit(x_train, y_train)
    
    return clf
In [14]:
from sklearn.metrics import accuracy_score, confusion_matrix 

def sensitivity(mat):
    """Use the confusion matrix to calculate sensitivity"""
    tp = mat[1][1]
    fn = mat[1][0]
    try:
        s = (1.0*tp) / (tp + fn)
    except:
        s = None
        
    return s

def specificity(mat):
    """Use the confusion matrix to calculate specificity"""
    tn = mat[0][0]
    fp = mat[0][1]
    try:
        sp = (1.0*tn) / (tn + fp)
    except:
        sp = None
    
    return sp

Since our goal is to predict hazardous seismic activity, it is very important to detect a high number of true positives and minimize false negatives. Consequently, our most important metric of success is sensitivity. We perform cross-validation with various scoring functions with the goal of maximizing sensitivity while keeping accuracy and specificity reasonably high. We find when optimizing for sensitivity (recall) alone, that accuracy is below 10% while sensitivity is 100%. Optimizing for the area under the ROC curve, which maximizes true positives while keeping false positives low, achieves an accuracy of 81% with a sensitivity of 67%.

In [15]:
features = [clmn for clmn in df.columns if clmn not in 'outcome']
x_train, x_test, y_train, y_test = split_data(df, features)

clf = train_model(x_train, y_train, scoring='recall')
y_pred = clf.predict(x_test)

print "Recall-optimized grid search scores"
print("Accuracy : {0}").format(accuracy_score(y_test, y_pred))
print("Sensitivity : {0}").format(sensitivity(confusion_matrix(y_test, y_pred)))
print("Specificity: {0}").format(specificity(confusion_matrix(y_test, y_pred)))

clf = train_model(x_train, y_train, scoring='roc_auc')
y_pred = clf.predict(x_test)

print ""
print "Area under ROC-optimized grid search scores"
print("Accuracy : {0}").format(accuracy_score(y_test, y_pred))
print("Sensitivity : {0}").format(sensitivity(confusion_matrix(y_test, y_pred)))
print("Specificity: {0}").format(specificity(confusion_matrix(y_test, y_pred)))
Recall-optimized grid search scores
Accuracy : 0.0650154798762
Sensitivity : 1.0
Specificity: 0.0

Area under ROC-optimized grid search scores
Accuracy : 0.812693498452
Sensitivity : 0.666666666667
Specificity: 0.822847682119
• • •