Understanding the data

df = pd.read_csv('creditcard.csv')
df.head()
Time Amount class v1 v28
1.0 149.62 0 -1.359807 -0.021053
  • NOTE There are 28 ‘v’ features : v1,…,v28
  • Due to privacy reasons most colums dont have a column name
  • Time : not enough information about this feature, going to remove
  • amount : the monentary amount of the transaction that was flagged as fraud
  • classes :
    • 0: No fraud
    • 1: Fraud

Exploratory Analysis (EDA)

plt.bar(x=[0,1],height=[df['Class'].value_counts()[0],df['Class'].value_counts()[1]])

plt.title('Class Distributions (0:No Fraud 1: Fraud)', fontsize=14)

alt

Class  
0 284315
1 492
amount_val = df['Amount'].values

fig,ax = plt.subplots()

sns.distplot(amount_val, color='r',ax=ax)
ax.set(title='Distribution of Transaction Amount')
ax.set_xlim([min(amount_val), max(amount_val)])
plt.show()

alt

  • Tansactions are mainly of smaller amounts

Metric Selection

  • very imbalanced class distribution, positive class is minority
  • positive class more important than negative class, need to detect if fraud occurs with high confidence
    • e.g. 90% precision
  • False negatives; falsely classifiying fraud sample as not fraud is very costly need to minimise.
    • need as low as possible
    • trade-off: high recall as possible due to False negatives
      • but precision needs to be above a threshold to be useful
  • Precision, recall,f1 measure and False negatives going to be used to compare models via cv

Pre-processing

  • Dataset was scaled, used robust scaler which is more robust to outliers
  • To better visualise the data and conduct multi-vartiate analysis, will implement a random under sampling technique, which undersamples the majority class, creating a more balanced dataset
rob_scaler = RobustScaler()

X = df.drop(['Class','Time'],axis=1)
y = df['Class']

X_scaled = rob_scaler.fit_transform(X)

X_scaled_df = pd.DataFrame(X_scaled, columns = X.columns)

rus = RandomUnderSampler(random_state=42)

X, y = rus.fit_resample(X, y)

sns.countplot(y)

alt

balanced_df = pd.DataFrame(X.join(y))

corr = balanced_df.corr()
sns.heatmap(corr, cmap='coolwarm_r', annot_kws={'size':20}).set_title("Imbalanced Correlation Matrix", fontsize=14)

alt

  • v4, v11 features are highly positively correlated with class
  • v14 andv10 are highly negatively skewed with class
  • no features are hightly correlated with eachother
    • if there was highly correlated features , should remove

Model selection

  • Needed to split the data into training and test sets, then apply transformation in pipeline
    • to prevent leakage when estimating model preformance on via cross-validation.
    • Used stratified random sampling, as dataset very imbalanced
  • reloaded orginal unbalanced dataset
X = df.drop(['Class','Time'],axis=1)
y= df['Class']
X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y,random_state=1)

Baseline score, without resampling or specialised imbalanced algorithms

  • First going to use:
    1. naive model: dummy classifier which outputs uniform results
    2. logistic regression
    3. random forest
    4. neural network
  • Aim: get a baseline score without specialised algorithms designed for imbalanced learning
    • naive algorithms without resampling etc
  • The first 3 models were implemented using the sci-kitlearn library
    • stratified 10 fold cv was used to calculate f1, precision and recall values
  • The neural network model was implemented using the tensorflow library

sci-kit learn (model 1,2 and 3)

from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline

pipe1 = Pipeline([
    ('scaler',RobustScaler()),
    ('model',DummyClassifier(strategy='uniform'))
])

pipe2 = Pipeline([
    ('scaler',RobustScaler()),
    ('model',LogisticRegression())
])

pipe3 = Pipeline([
    ('scaler',RobustScaler()),
    ('model',RandomForestClassifier())
])

from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=10)
scores= ['f1','precision','recall']

navie_cv_results = cross_validate(pipe1,X=X_train,y=y_train,cv=cv,scoring=scores,n_jobs=-1)
logistic_cv_results = cross_validate(pipe2,X=X_train,y=y_train,cv=cv,scoring=scores,n_jobs=-1)
randomForest_cv_results = cross_validate(pipe3,X=X_train,y=y_train,cv=cv,scoring=scores,n_jobs=-1)

The mean f1, precision and recall scores

model_1: naive algo

  0
test_f1 0.00347783
test_precision 0.00174494
test_recall 0.503979

model_2: logistic regression

  0
test_f1 0.685409
test_precision 0.852982
test_recall 0.577477

model_3: random forest

  0
test_f1 0.840989
test_precision 0.948857
test_recall 0.756156
  • clearly the naive algo performs the worst
  • mean scores from cv does not show the whole picture
    • must visualise learning curves to diagnose any problems in the learning stage

The precision and recall learning curves

model_2: logistic regression

alt

  • logistic regression seems to be the best, but recall score is low

model_3: random forest

alt

  • randomforest is overfitting
    • represented by large gap between the training and test scores

Tensorflow neural network

  • The neural network will be built via a different library called keras

METRICS = [
  keras.metrics.Precision
  keras.metrics.FalseNegatives
  keras.metrics.Recall
]

def make_model(metrics=METRICS, output_bias=None):
  if output_bias is not None:
    output_bias = tf.keras.initializers.Constant(output_bias)
  model = keras.Sequential([
      keras.layers.Dense(
          16, activation='relu',
          input_shape=(train_features.shape[-1],)),
      keras.layers.Dropout(0.5),
      keras.layers.Dense(1, activation='sigmoid',
                         bias_initializer=output_bias),
  ])

  model.compile(
      optimizer=keras.optimizers.Adam(learning_rate=1e-3),
      loss=keras.losses.BinaryCrossentropy(),
      metrics=metrics)
  
  return model 

  model.compile(
      optimizer=keras.optimizers.Adam(learning_rate=1e-3),
      loss=keras.losses.BinaryCrossentropy(),
      metrics=metrics)
  
  return model 

EPOCHS = 100
BATCH_SIZE = 2048

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_prc', 
    verbose=1,
    patience=10,
    mode='max',
    restore_best_weights=True)

model = make_model()
model.summary()
type Output Shape Param
Dense (None, 16) 480
Dropout (None, 16) 0
Dense (None, 1) 17
  • Total params: 497
  • Trainable params: 497
  • Non-trainable params: 0

alt

concluding first stage

  • logistic regression and Random Forest is shortlisted
    • the complexity of the algorithms will be increased

Specialised algorithms

  • Implemented SMOTE and cost-sensitive learning on the shortlisted model

  • logic of using cv for imbalanced learning:

    • up/down sampling done in inside cv
    • imblearn.pipeline extends scikit-learn pipeline.
    1. Up/downsampled only the data in the training section
    2. Fit the model on the up/down sampled training data
    3. Score the model on the (non-up/downsampled) validation data

recall/precision trade-off

  • SMOTE and balanced class wieghts increases recall at the cost of precision
from imblearn.pipeline import Pipeline as imbpipeline
pipeline1 = imbpipeline([
    ['smote', SMOTE(random_state=11)],
    ['scaler',RobustScaler()],
    ['classifier', LogisticRegression(random_state=11,penalty='l2', n_jobs=-1, max_iter=1000)]
    ])

alt

  • precision of ~0.03 is not acceptable