Land Classification

There are multiple satellites that capture the data about the amount of light intensity reflected at different frequencies from the Earth at a very granular geographic level. Some of this information can be used to classify the Earth into different buckets - built-up, barren, green or water. The training data contains different parameters classified into the 4 classes.

Column Description

Numeric columns X1 to X6 and I1 to I6 define characteristics about the land piece
ClusterID is a categorical column which clusters a type of land together
target is the output categorical column which needs to be found for the test dataset
- 1 = Green Land
- 2 = Water
- 3 = Barren Land
- 4 = Built-up

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('land_train.csv') # loading the dataset
df.head()

	X1	X2	X3	X4	X5	X6	I1	I2	I3	I4	I5	I6	clusterID	target
0	207	373	267	1653	886	408	0.721875	-1.023962	2.750628	0.530316	0.208889	0.302087	6	1
1	194	369	241	1539	827	364	0.729213	-1.030143	2.668501	0.546537	0.203306	0.300930	6	1
2	214	385	264	1812	850	381	0.745665	-1.107047	3.000315	0.546156	0.181395	0.361382	6	1
3	212	388	293	1882	912	402	0.730575	-1.077747	3.006150	0.530083	0.156835	0.347172	6	1
4	249	411	332	1773	1048	504	0.684561	-0.941562	2.713079	0.494370	0.205742	0.257001	6	1

dfs = df.sample(frac=0.1).reset_index(drop=True) # shuffles the data and takes a fraction of it
dfs.shape

(48800, 14)

Column level preprocessing

Scatter Plot

# scatter plot of a part of data
pd.plotting.scatter_matrix(dfs[['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'I1', 'I2', 'I3', 'I4', 'I5', 'I6']], figsize=(17,10));

Correlation Plot

import seaborn as sns
sns.set(style="white")

# finding correlation
corr = df[['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'I1', 'I2', 'I3', 'I4', 'I5', 'I6']].corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0.5, square=True, linewidths=.5, cbar_kws={"shrink": .5});

From the scatter plot and correlation plot, we can conclude that-

The X features are almost linearly correlated with each other. Moreover, they show similar correlation with I features. Therefore, we can merge them into one (average).
I5 is almost constant and shows less variance (see the histogram). Therefore, it can be dropped.
I1 and I4 are highly correlated. They can be merged (average).
I2 shows inverse correlation with I1. It can be dropped.

Remaining columns: X_avg, I₁₄, I₃, I₆.

Note: Highly correlated columns are dropped since they provide no extra information to the model and hampers the performance.

Selecting features

# preprocessing columns
df['X'] = df[['X1', 'X2', 'X3', 'X4', 'X5', 'X6']].mean(axis=1) # merging all X features in one
df = df.drop(['X1', 'X2', 'X3', 'X4', 'X5', 'X6'], axis=1)      # dropping the rest
df['I14'] = df[['I1', 'I4']].mean(axis=1)                       # averaging I1 and I4
df = df.drop(['I1', 'I4', 'I2', 'I5'], axis=1)                  # dropping the rest
df.head()

	I3	I6	clusterID	target	X	I14
0	2.750628	0.302087	6	1	632.333333	0.626095
1	2.668501	0.300930	6	1	589.000000	0.637875
2	3.000315	0.361382	6	1	651.000000	0.645910
3	3.006150	0.347172	6	1	681.500000	0.630329
4	2.713079	0.257001	6	1	719.500000	0.589465

df.to_csv('land_train_pruned.csv', index=False) # saving the data

Row level preprocessing

df.describe()

	I3	I6	clusterID	target	X	I14
count	487998.000000	487998.000000	487998.000000	487998.000000	487998.000000	487998.000000
mean	1.602914	0.098514	4.180263	1.062301	2567.438750	0.295633
std	1.106819	0.147183	1.645535	0.350805	1840.983848	0.250844
min	-0.177197	-0.297521	1.000000	1.000000	-83.666667	-0.848529
25%	0.694444	0.016150	3.000000	1.000000	1015.000000	0.067379
50%	1.267495	0.056333	3.000000	1.000000	1839.000000	0.188264
75%	2.430122	0.161845	6.000000	1.000000	4015.666667	0.515738
max	5.663032	0.662566	8.000000	4.000000	10682.333333	5.755439

df.target.value_counts().plot('bar');

Samples per example:
1: 472987 (97%)
2: 0
3: 14630
4: 381
This clearly shows the imbalance in classes. While training on such data even if the model predits class 1 everytime, the accuracy will be 97%. Therefore, we need to balance the classes and for this, the approach used is given below.

Due to very less samples in Class 4, it can be removed.
Class 1 can be divided randomly into 10 subparts (Samples in each part - 47300).
Class 3 can be duplicated 2 times (Samples - 43890).
Trian 10 separate classification models with one part of Class 1 each along with same Class 3 samples in each model.
Take the average of all models while testing.

Normalizing features

from sklearn import preprocessing

x = df[['I3', 'I6', 'I14', 'X']].values # returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_n = pd.DataFrame(x_scaled, columns=['I3', 'I6', 'I14', 'X'])
df_norm = df[['clusterID', 'target']].join(df_n)

df_norm.head()

	clusterID	target	I3	I6	I14	X
0	6	1	0.501320	0.624535	0.223294	0.066506
1	6	1	0.487258	0.623330	0.225077	0.062481
2	6	1	0.544073	0.686296	0.226294	0.068240
3	6	1	0.545072	0.671495	0.223935	0.071073
4	6	1	0.494891	0.577575	0.217747	0.074602

Splitting data into 10 parts

# separating classes
df_1 = df_norm[df_norm.target==1]
df_1 = df_1.sample(frac=1).reset_index(drop=True) # shuffling rows
df_3 = df_norm[df_norm.target==3]

df_1.shape[0], df_3.shape[0]

(472987, 14630)

df_1_split = np.split(df_1, np.arange(47300, 472987, 47300),axis=0) # splits dataframe (of Target=1) in 10 
[i.shape for i in df_1_split]

[(47300, 6),
 (47300, 6),
 (47300, 6),
 (47300, 6),
 (47300, 6),
 (47300, 6),
 (47300, 6),
 (47300, 6),
 (47300, 6),
 (47287, 6)]

df_3_dup = pd.concat([df_3]*3, ignore_index=True) # duplicate data (of Target=3) 3 times
df_3_dup.shape

(43890, 6)

for i in range(len(df_1_split)):
    df_1_split[i] = df_1_split[i].append(df_3_dup, ignore_index=True) # merge Target 1 and 3

[i.shape for i in df_1_split]

[(91190, 6),
 (91190, 6),
 (91190, 6),
 (91190, 6),
 (91190, 6),
 (91190, 6),
 (91190, 6),
 (91190, 6),
 (91190, 6),
 (91177, 6)]

df_1_split[9].target.value_counts().plot('bar');

# saving all parts
for i in range(len(df_1_split)):
    df_1_split[i].to_csv(f'land_train_split_{i+1}.csv', index=False)

Preprocessing Test Data

df_test = pd.read_csv('land_test.csv')
df_test.shape

(2000000, 13)

df_test.head()

	X1	X2	X3	X4	X5	X6	I1	I2	I3	I4	I5	I6	clusterID
0	338	554	698	1605	1752	1310	0.393834	-0.350045	1.565423	0.311659	0.304781	-0.043789	6
1	667	976	1187	1834	1958	1653	0.214167	-0.181467	1.050679	0.196439	0.164085	-0.032700	4
2	249	420	402	1635	1318	736	0.605302	-0.712650	2.268984	0.441984	0.293497	0.107348	6
3	111	348	279	1842	743	328	0.736917	-1.162062	3.074176	0.551699	0.080725	0.425145	6
4	349	559	642	1534	1544	989	0.409926	-0.406678	1.607795	0.323984	0.212753	-0.003249	6

# preprocessing columns
df_test['X'] = df_test[['X1', 'X2', 'X3', 'X4', 'X5', 'X6']].mean(axis=1) # merging all X features in one
df_test = df_test.drop(['X1', 'X2', 'X3', 'X4', 'X5', 'X6'], axis=1)      # dropping the rest
df_test['I14'] = df_test[['I1', 'I4']].mean(axis=1)                       # averaging I1 and I4
df_test = df_test.drop(['I1', 'I4', 'I2', 'I5'], axis=1)                  # dropping the rest

x_test = df_test[['I3', 'I6', 'I14', 'X']].values # returns a numpy array
x_test_scaled = min_max_scaler.transform(x_test)  # applying scaler
df_test_n = pd.DataFrame(x_test_scaled, columns=['I3', 'I6', 'I14', 'X'])
df_test_norm = df_test[['clusterID']].join(df_test_n)
df_test_norm.shape

(2000000, 5)

df_test_norm.head()

	clusterID	I3	I6	I14	X
0	6	0.298382	0.264280	0.181902	0.104635
1	4	0.210245	0.275830	0.159576	0.135875
2	6	0.418850	0.421701	0.207780	0.081460
3	6	0.556720	0.752709	0.226051	0.064292
4	6	0.305637	0.306505	0.184054	0.094727

df_test_norm.to_csv('land_test_preprocessed.csv')

Model

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.neural_network import MLPClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix

# different classifiers
num_models = [KNeighborsClassifier(n_neighbors=5), svm.SVC(), LogisticRegression(), GaussianNB(),
              AdaBoostClassifier(), BernoulliNB(), MLPClassifier(),
              RandomForestClassifier(), DecisionTreeClassifier('entropy')]

# training 9 different model on different data splits
# taking average of 9 predictions while testing
# testing set is the 10th data split

models = []
for i in range(9):
    df_train = pd.read_csv('land_train_split_{}.csv'.format(i+1))
    df_train = df_train.sample(frac=1).reset_index(drop=True)

    x = df_train.drop('target', axis=1).values
    y = df_train.target.values
    
    model = num_models[7]
    models.append(model.fit(x, y))
    
    print(f'Model {i+1} trained.', end='\r')

Model 9 trained.

# loading the test set
df_test = pd.read_csv('land_train_split_10.csv')
df_test = df_test.sample(frac=1).reset_index(drop=True)

x_test = df_test.drop('target', axis=1).values
y_test = df_test.target.values

# making predictions
predictions = []
for i in range(9):
    predictions.append(list(models[i].predict(x_test)))
predictions = np.array(predictions)

# getting the mojority Class

pred_maj = []
for i in range(predictions.shape[1]):
    (values, counts) = np.unique(predictions[:,i], return_counts=True)
    pred_maj.append(values[np.argmax(counts)])
pred_maj = np.array(pred_maj)

print('Accuracy: {:.2f}%'.format(np.mean(pred_maj==y_test)*100))

precision, recall, f1 = precision_score(y_test, pred_maj), recall_score(y_test, pred_maj), f1_score(y_test, pred_maj)
print('Precision: {:.2f}\nRecall: {:.2f}\nF1 score: {:.2f}'.format(precision, recall, f1))

cm = confusion_matrix(y_test, pred_maj, labels=[1,3])
sns.heatmap(cm, xticklabels=['False', 'True'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion matrix');

Accuracy: 97.58%
Precision: 0.98
Recall: 0.97
F1 score: 0.98

Results

Classifier	Accuracy	F1 Score
Decision Tree (Entropy)	99.11%	0.99
Decision Tree (Gini)	99.15%	0.99
Random Forest	99.23%	0.99
KNN	98.53%	0.99
Logistic Regression	91.38%	0.92
Gaussian Naive Bayes	89.83%	0.90
Bernoulli Naive Bayes	51.86%	0.68
Adaboost Classifier	95.87%	0.90
ANN	89.83%	0.90
All	97.58%	0.98

Getting predictions for test set

df_test_set = pd.read_csv("land_test_preprocessed.csv")
df_test_set.drop('Unnamed: 0', inplace=True, axis=1)
df_test_set.head()

	clusterID	I3	I6	I14	X
0	6	0.298382	0.264280	0.181902	0.104635
1	4	0.210245	0.275830	0.159576	0.135875
2	6	0.418850	0.421701	0.207780	0.081460
3	6	0.556720	0.752709	0.226051	0.064292
4	6	0.305637	0.306505	0.184054	0.094727

# predictions
predictions = []
for i in range(9):
    predictions.append(list(models[i].predict(df_test_set.values)))
predictions = np.array(predictions)

# getting the mojority Class
pred_maj = []
for i in range(predictions.shape[1]):
    (values, counts) = np.unique(predictions[:,i], return_counts=True)
    pred_maj.append(values[np.argmax(counts)])
pred_maj = np.array(pred_maj)

pred_maj.shape

(2000000,)

df_sub = pd.DataFrame(pred_maj, columns=['target'])
df_sub.head()

	target
0	3
1	3
2	1
3	1
4	3

df_sub.to_csv('submission.csv', index=False)

Analysis

dfa = pd.read_csv('land_train_split_1.csv')
dfa.groupby(dfa.target).describe().T

	target	1	3
I14	count	47300.000000	43890.000000
	mean	0.172823	0.178143
	std	0.038544	0.010856
	min	0.126043	0.128314
	25%	0.138245	0.172255
	50%	0.153272	0.177356
	75%	0.208025	0.183939
	max	0.313051	0.236896
I3	count	47300.000000	43890.000000
	mean	0.303832	0.300403
	std	0.192337	0.065502
	min	0.005789	0.024574
	25%	0.145690	0.265956
	50%	0.234960	0.289333
	75%	0.455870	0.326778
	max	0.908952	0.874047
I6	count	47300.000000	43890.000000
	mean	0.418548	0.235797
	std	0.151682	0.076489
	min	0.077088	0.029555
	25%	0.330767	0.190215
	50%	0.371948	0.218487
	75%	0.483560	0.258432
	max	0.947382	0.818775
X	count	47300.000000	43890.000000
	mean	0.251571	0.131445
	std	0.172705	0.032215
	min	0.015047	0.061382
	25%	0.101817	0.112948
	50%	0.202729	0.126339
	75%	0.388372	0.142888
	max	0.992523	0.519661
clusterID	count	47300.000000	43890.000000
	mean	4.127294	5.833698
	std	1.634507	0.906823
	min	1.000000	1.000000
	25%	3.000000	6.000000
	50%	3.000000	6.000000
	75%	6.000000	6.000000
	max	8.000000	8.000000

estimator = models[0].estimators_[5]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = ['clusterID', 'I3', 'I6', 'I14', 'X'],
                class_names = ['1', '3'],
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
import pydot
(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('somefile.png')

# taking sum of all GINI indexes
a = models[0].feature_importances_
for i in range(8):
    a += models[i+1].feature_importances_

a

array([2.10152634, 0.58177282, 3.40019577, 2.06218839, 0.85431669])

The model cannot be visualized because of its size, so I have taken the sum of gini index of attributes, which are shown below.
clusterID: 2.10
I3: 0.58
I6: 3.40
I14: 2.06
X: 0.85

Clearly, I6 classifies the data best.