Problem Description

The task was a supervised learning task, with the aim to build a machine learning model (a classifier) to predictive whether a network intrusion was happening. Based on IoT network traffic features. The dataset used was the AWID2 dataset, this is an academic dataset used by security researchers to develop and test the latest methodologies in preventing intrusion attacks. Please note that each individual member of the team was assigned a specific task in the machine learning development lifecycle. Defined by the following stages:

  1. Pre-processing
  2. Feature selection
  3. Exploring and selecting ML algorithms
  4. Refining algorithms
  5. Evaluating model and analysing results

I was personal tasked with step 4: refining algorithms.

notebooks along with project brief available on https://github.com/asahin01/Network

Refining Algorithms

As a team it was decided accuracy would be the main metric to guage success. The class distributions were balanced.

The team and I decided logistic regression and adaboost would be shortlisted to be evaluated on the test set, and quantify model preformance on unseen data.This was due to these two models preforming the best during 10-fold cross validation on the train set. As seen below:

10-fold CV for model selection

num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
# Accuracy: 94.518% (9.900%)

clf = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(clf, X_train_ready, Y_train, cv=kfold)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
# Accuracy: 98.518% (4.3%)

pipelines

The pipelines used for model evaluation were decide during stage 3. A shortlist of data transformation were identified in the feature selection, pefromance estimates of the whole algorithm (model + feature selection + tansformations) were calculated, the best pre-processing stack was determined to be one which performed equally well with the shortlisted models. The two pipelines below:

pipe = Pipeline([
    ('zero variance', VarianceThreshold()),
    ('scale 0_1', MinMaxScaler()),
    ('top20 features', SelectKBest(chi2, k=20))])

pipe2= Pipeline([
    ('zero variance', VarianceThreshold()),
    ('norm 1', Normalizer()),
    ('top20 features', SelectKBest(chi2, k=20))])

Hyperparamter Seach - grid search

I conducted hyperparameter, using the grid search algorithm hyperparameters were evaluated using a 10-fold cross validation.

For adaboost the hyperparameters were: The number of estimators (N) and learning rate (L). There is a trade-off between N and L, the optimal number of estimators required increases as L decreases. The number of estimators is the number of models that are iteratively trained: the values searched were 10,50,100,150. The learning rate is the contribution each model has on the weights of the algorithm. The values 0.001,0.01,0.1 were searched, as past studies suggest values below 0.1 are optimal. The lowest L values should have the greatest cross validation score as a result of reduced overfitting however, the opposite was true. The highest L achieved the best cv score.

For logistic regression the parameters were ‘C’ are used to optimise logistic regression. The inverse of regularisation strength is referred to as ‘C’. The lower the ‘C’ the greater the regularisation strength. This reduces the variance of the model and therefore should reduce overfitting however, the highest C values resulted in the greatest cv score

learnt skills , future work

This project allowed me experience working as part of a data science team, to see a project from start to finish.

I got to understand the iterative nature of a machine learning project, and that each individual had to work with other members to achieve the best results for their task.

Learning curves should of been used to evaluate whether adaboost was overfitting to the training data as the cv score was high 98%. Also nested 5x2 cv should have instead of re-runing cv for hyperparameter search.