If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
The objective of this research article is to present a novel feature selection strategy for improving the classification performance over high dimensional data sets. Curse of dimensionality is the most serious downside of microarray data as it has more number of genes(features). This leads to discouraged computational stability. In microarray data analytics, identifying more relevant features required full attention. Most of the researchers applied two stage strategy for gene expression data analysis. In first stage, feature selection or feature extraction is employed as a preprocessing step to pinpoint more prominent features. In second stage, classification is applied using selected subset of features.
Method
In this research also we followed the same strategy. But, we tried to introduce a distributed feature selection(dfs) strategy using Symmetrical Uncertainty(SU) and Multi Layer Perceptron(MLP) by distributing across the multiple clusters. Each cluster is equipped with finite number of features in it. MLP is employed over each cluster, and based on the highest accuracy and lowest Root Mean Square error rate(RMS) dominant cluster is nominated.
Result
Classification accuracy with Ridor, Simple Cart (SC), KNN, SVM are measured by considering dominant cluster’s features. The performance of this cluster is compared with the traditional filter based ranking techniques like Information Gain(IG), Gain Ratio Attribute Evaluator(GRAE), Chi-Squared Attribute Evaluator (Chi). The proposed method is recorded approximately 57% success rate, 18% competitive rate against traditional methods after applying it over 7 well high dimensional and one lower dimension dataset.
Conclusion
The proposed methodology applied over very high dimensional microarry datasets. Using this method memory consumption will be reduced and classification performance can be improved.
Feature Selection (FS) is routinely used as a preprocessing step for high-dimensional data. Any technique for high-dimensional data should deal with curse of dimensionality. Few hundreds to thousands of variables of microarray dataset leading to very high dimensionality. The performance of technique demeaned as the number of features rises for the data analysis.
FS is a process of removing an insignificant and superfluous features and attempts to find a candidate set of features that can best describes the data.
Irrelevant and superfluous features change or bias the learning algorithm’s correctness, as they do not provide surplus knowledge to prediction and also it may puzzle the algorithm during learning and classification phase. Hence, it is an obligatory to drop insignificant and redundant features. Because of the high dimensionality the classification task has become more difficult.
As all the genes in a gene expression microarray data can’t influence the classification model, it is now hour of need to derive the key genes by which classification task can become easier and accurate. For this issue, a possible choice is applying FS technique over high dimensional dataset. As a result of FS methods, computing time will be saved as well as prediction performance can be accelerated. FS can be called as attribute selection or variable selection in statistics. Because of the high dimensional property and lower instances rate of microarray data, a great computational approach is required to handle the difficulties of data analysis. It is strongly accepted that in majority of microarray data only few genes play a critical role in classification task and remaining genes are considered as irrelevant. So, FS concept is an important in preprocessing before proceeding to classification task.
To lift this subject there are 3 different types of FS approaches are existed in literature. Those are namely Filter, Wrapper, Embedded.
Filter mode is used to establish the subdivision of input features that have the significant predictive capacity. By nominating the righteous features, we can potentially upgrade the efficiency of classification. It uses distinct statistical tests to select the subdivision of features with the excessive predictive power. For this, decide a statistical measure to apply, then compute the score for every feature. The features are then ordered by the score then features with the top scores are considered in learning model generation, while others are ignored. There by memory consumption can be pruned and computation time can be die down. Information gain (IG), Gain Ratio Attribute Evaluator (GRAE), Chi- Squared Attribute Evaluator (Chi), Symmetric Uncertainty (SU), and ReliefF (REL) are some of the popular filter based methods.
In this paper, we introduced a distributed feature selection (dfs) strategy using SU, Correlation-based Feature Subset Selection (CFS), and MLP by distributing across multiple clusters. As a result of this distribution process, each cluster is loaded with small sized features. To know the strong cluster among the formed clusters, MLP is employed on each cluster. Based on its accuracy and lowest RMS error rate, strong cluster is documented. This strong cluster of features is compared with existing filter based ranking methods like IG, GRAE, Chi.
The idea of this method is inspired from the project allocation strategy applied in academics. Generally, students in final year need to undergo group academic project. Project coordinator need to form the groups(each group consists of ‘N’ students). There are few possibilities and limitations to form the groups. Those are : students can voluntarily form their own group as per the defined group size, it is similar to selection of features randomly. If all toppers are in the same group, their project result can be better than other groups, it is similar to selecting top ranked features from the list. If all poor students are in one group, their outcome of project can be poor. Alternatively, if toppers, average, and poor students are mixed in a proper structure, every group can be balanced. So, average and poor students can learn some useful knowledge from toppers, it is the process of applying ensemble approach which is widely accepted in classification. Here, performance of group (cluster) is depending on the members(features) of that group. In this work, we tried to present a structure which can form the balanced group to give the adequate knowledge. The proposed structure to form the groups will be explained later. For this research, rank of each features is computed using SU, and the minimum number of features to be included in each cluster is defined by the result CFS.
For testing the strength of the proposed method, 7 well known microarray datasets which has features in the range 2000–12582, one lower dimensional dataset are used. After deriving the features by existed methods and proposed method, well known classification techniques such as Ridor(Rule), SC (Tree), KNN(lazy), SVM(Support Vector Machine) are employed, then respective performance is analyzed.
In the next portion of this article, some of the existing literature over microarray dataset is presented. Distribution of feature selection among the cluster is presented in methodology section. Dataset description and experimental setup is given in Section 4, finally results and appropriate discussion is produced in Section 5. Article is concluded with possible suggestions.
2. Literature
Our main objective of this current work is to establish a DFS framework to reduce or select the best subset of features from the high dimensional data set. There are some existing review reports accessible to direct these feature selection. As discussed in the introduction, there are 3 approaches of feature selection (Filter, Wrapper, Embedded), we considered filter based approach. As filter method designates the rank to each feature based on the statistical score of a feature, we chosen one of the filter method namely Symmetric Uncertainty (SU). Little description about SU is given in the methodology division.
In addition to SU, there are few more filter based modes(IG, GR, CHI, REL) available for selection of features and designating the rank to each feature. Based on the rank assigned, and depending on the number of features to be selected for classification, top ranked features can be selected for analysis. SU is applied in FAST (Fast clustering bAsed feature Selection algoriThm) clustering
Both these are based on the graph theory concept. In FAST, prim's algorithm is employed to select the subset, whereas in improved FAST, kruskal's minimum spanning tree is employed to derive the features. Information Gain based clustering frameworks have been proposed to find the best features of kidney and voting data sets.
A new feature selector using minimum variance method based on information gain is implemented and tested on 9 datasets. It displayed greater than the traditional feature selection algorithms.
Ensemble based multi filter feature selection framework is proposed for DDoS detection in cloud computing by combining the features extracted by IG, GR, CHI, REL. Authors considered the feature which are greater than the defined threshold value from the ensembled features.
This method reduced the number of features from 41 to 13.
There have been some literature existed in the past which was conducted over various microarray dataset for addressing the high dimensionality issue using various FS approaches. Distributed FS mechanism is applied over 8 microarray datasets by partitioning the dataset vertically. From the each partition strong features are selected by applying various FS methods like IG, CFS, Consistency based filters and ReliefF. Researchers employed c4.5, Naive Bayes(NB), KNN and SVM classifiers for testing their proposed method.
KNN is applied on 5 microarray datasets using mapreduce and hadoop framework by executing over multiple cluster nodes. Depending on the size of the features, researchers used 2 and 18 cluster nodes.
CFS method is applied over lymphoblastic leukemia dataset. With this approach, researchers have drawn 16 top quality features. After employing KNN, MLP and SVM, 92.5% average accuracy is recorded.
Random Forest(RF), KNN, MLP and SVM is employed over colon and leukemia dataset after preprocessing using Partial Least Squared(PLS), Quadratic Programming FS (QPFS), Max Relevance(Max Rel), and minimum redundancy and maximum relevance(mRMR) methods. Final experimental results indicated that mRMR has recorded best performance among all.
For Classifying multiple disease states associated with cancer based on gene expression profiles a novel method is proposed by the researchers using SVM and Random Forest(RF).
As colon dataset is suffering with imbalanced class issue, to balance the instance rate, SMOTE has been applied. After applying this method, ensembling approaches are employed with J48, RF, REPTree. Researchers also considered IG ranking based filter approach to select the prominent features.
Distance based FS is applied over colon and leukemia datasets. SVM is employed over selected features by the researchers. As per their analysis of research B/SVM is selected 6.36% features and 9.5% misclassification rate over colon dataset. Also, 9.54% of features and 3.08% misclassification is recorded over leukemia dataset.
Artificial Neural Networks (ANN), SVM, Bayesian Networks, Decision Trees are applied over cancer dataset in the mini review report published by the authors.
Proposed method is majorly based three important components: 1. Symmetric Uncertainty (SU), 2. Correlation-based Feature Subset Selection (CFS), 3. Multi Layer Perceptron(MLP).
3.1 SU
Symmetrical Uncertainty is a statistical measure which records the SU score of a feature then designates the rank to each feature. SU score with high value has top rank and less value has least rank. For feature selection, top ranked features can be selected depending on the requirement and type of the problem on which it will be applied. It can be defined as below.
(SU) = 2*IG/(H(F1)+H(F2)
IG is Information Gain; H(F1) is Entropy of F1; H(F2) is Entropy of F2
SU score will be in the coverage [0,1]. SU score 1 shows one feature can predict entirely others, 0 shows two features are uncorrelated. For introduced framework, feature which has SU score 0 is ignored as it can’t influence the learning model.
3.2 CFS
It evaluates the strength of a subset of attributes by taking into account of the individual predictive ability of each attribute along with the degree of redundancy between them. Subsets of features that are strongly correlated with the class while having low intercorrelation are preferred. For drawing the subset, CFS employes the searching technique. In this paper, we used greedy stepwise searching technique is used. Instead of selecting the subset of features, we taken number of features drawn by this method to decide the minimum number of features to be included in each cluster.
3.3 MLP
MLP is one of the favored classifier applied in many domains for high accurate results, which is on the basis of neurons and perceptrons. As current methodology dissect the initial feature area into set of clusters(groups) to know the prominent cluster, primarily MLP is employed on each cluster formed by proposed approach then recorded its accuracy and RMS error rate.
The proposed methodology is as per the flowchart given in Chart 1 .
N: Minimum number of features to be in each cluster
TF: Total number of features (SU score > 0)
ListTF : List of features whose SU score > 0
Output: Cluster of features
1
Apply SU on D and define TF, store all features in ListTF in descending order of its rank (Rank 1 has high priority).
2
Apply CFS on D and define N
3
Define C
C= (TF/N)
4
Store first(next) ‘C’ number of features from ListTF in left to right direction, such that first feature would be placed in first cluster, second feature would be in second cluster and so on. Get next ‘C’ number of feature from ListTF for next iteration.
5
Store next ‘C’ number of features from ListTF in right to left direction, such that first feature would be placed in last cluster, second feature would be in second last cluster and so on. Get next ‘C’ number of feature from ListTF for next iteration.
6
Repeat step 4 and 5 until all features are organized.
7
Group, all vertically first order features into first cluster, second order features into second cluster, and so on.
8
If all clusters are structured with equal number of features, then stop. Otherwise remove the last feature from the cluster which has an extra feature.(As it can’t influence the learning model).
9
Apply MLP on each cluster, based on the highest accuracy and minimum RMS error rate nominate the dominant cluster
3.5 Example
Assume T = 12 (total # features)
TF = 10 (# features whose SU value is greater than 0)
N = 3 (# features in each cluster)
C = 10/3 = 3 (# clusters)
ListTf= {f1, f2, f3, f4, f5, f6, f7, f8, f9, f10}
As per the algorithm proposed, features in each cluster will be formed as below Table 1.
Table 1Cluster of features.
Ist Order (C1)
2nd Order (C2)
3rd Order (C3)
Direction
f1
f2
f3
Left to Right
f6
f5
f4
Right to Left
f7
f8
f9
Left to Right
f10
Right to Left
Cluster 1 : f1, f6, f7.
Cluster 2: f2, f5, f8.
Cluster 3: f3, f4, f9 (Note : f10 will be discarded as per step 9).
In this portion, description of datasets selected for testing the proposed distribution method is given, as well as the filter method and ranker approaches used for comparing the proposed method. For testing the accuracy level of proposed method and existing methods, 4 classifiers which are based on different conceptual theory are selected. The complete experiment is carried out using WEKA tool, with all default settings. The system configuration includes, Intel® Xeon(R) CPU E31220 @ 3.10 GHz × 4 processor with 8 GB Ram and 64 Bit, Ubuntu 16.04 operating system. The summary of datasets are given in below Table 2, which consists of total number of features (F), instances(I), Classes(C)
Datasets are divided into ⅔ and ⅓ ration respectively for training and testing purpose. As per the Table 2 statistics, there are 7 high dimensional datasets with minimum 2000 and maximum 12,582 features in it. To check the proposed method strength a lower dimensional set which has 166 features is considered.
After applying our proposed DFS method over all selected datasets, features are distributed across multiple clusters. For testing the strong clusters, MLP is employed over all those formed clusters and respective accuracy and RMS error rate is recorded. Table 3 shows the number of clusters formed(NC), best cluster id(BCID), accuracy of MLP and RMS error rate on it, number of features in best cluster(SF).
Table 3MLP performance over clusters formed by DFS.
In this work, 3 well known feature ranking methods are used to compare the proposed study. Information Gain is one of the popular univariate practice of assessing the features. This assess the features as per their information worth and appraise a single attribute at a moment. It produces an orderly classification of all the attributes, and then a threshold is needed to choose a certain number of them as per the order recorded. The Gain Ratio is the non-symmetrical measure that is established to requite for the bias of the Information Gain.
In this work 4, well known classifiers are applied over the selected features. Those includes, K-Nearest Neighbor(KNN) is a classification algorithm which is an example of lazy lerner. Support vector machine(SVM) is another classification strategy a hyperplane in high -dimensional space. SC is tree based classification strategy and Ridor is rule based classification methods.
5. Results and discussion
After applying the proposed method, SF number of features are selected as per Table 3. To test the strength of the proposed approach, 3 different filter based ranking FS methods have been used with 4 different classifiers. From the traditional methods top ‘N” features are derived, where ‘N’ is equal to number of features(SF) in a strong cluster. For example, strong cluster of colon dataset has 27 features in it. So, top 27 features derived by traditional methods are chosen for evaluating the performance of proposed method. The performance evaluation of traditional and proposed method is produced in below Table 4.
Table 4Result analysis of traditional and proposed DFS method.
After detailed analysis of the results obtained, following noticeable points are gathered. Proposed method is recorded 75% success rate over IG and GR, 50% over chi with colon. Over leukemia, DFS is recorded 75% winning rate over traditional IG, GR, Chi. DFS recorded 75% success rate when compare with IG and GR, 100% success over Chi with leukemia _3c dataset. With leukemia_4c dataset, 50% success rate is obtained when compare with GR and Chi, it is 25% when compare with IG. In case of Lymphoma, DFS registered 25% winning rate when compare with IG and Chi, it is 50% with GR.Using SRBT, the proposed method is performed little poor, its success rate is 25% only. Using MLL dataset, success rate of dfs is 75% when compare with IG and Chi, it is 25% with GR. Dfs is performed 100% better when compared with GR, its performance is 50% with IG and Chi over Musk dataset.
Average winning, draw and loss rate of the proposed method after applying various classifiers is given in below Table 5. Success rate of ridor is 70.83%, SC is 45.83%, KNN is 45.83% and SVM is 62.50%. The average success rate of the proposed method when compare with traditional methods after applying 4 classifiers is 56.25%, draw rate of the dfs is 15.18%.
Table 5Win/Draw/Loss rate of proposed method.
Ridor
SC
KNN
SVM
AVERAGE
WIN
70.83%
45.83%
45.83%
62.50%
56.25%
Draw
4.17%
19.05%
8.33%
29.17%
15.18%
Loss
25.00%
37.50%
45.83%
8.33%
29.17%
Graphical representation of all result analysis given in below Table 6.
In this research attempt, we have proposed a distributed feature selection(dfs) approach that can be applied for complex high dimensional datasets for increasing the classification performance. As high dimensional dataset contains large number of redundant and irrelevant feature and only prominent features helps in predicting an unknown instance, a novel method is needed for selecting the best features from the existed space. We used the Symmetrical Uncertainty, Correlation based filter selection and multi layer perceptron approaches for the proposed method. Using proposed approach, features are distributed among the various clusters without any repetition. Out of many clusters only one strong cluster based on the result of MLP is selected which has limited number of features. Proposed approach is compared with three well known filter based ranking feature selection methods and 4 various classification algorithms. The proposed method is recorded approximately 57% success rate, 18% competitive rate against the traditional methods after applying it over 7 well high dimensional and one lower dimension dataset. Finally, it is suggested that the analyzing each cluster with MLP in sequential order is time consuming, if large number of cluster are formed, it has to be analyzed parallely using any parallel programming concepts like hadoop which is our future study, so it could be considered as a generalized approach for distributing feature selection.
Acknowledgements
The authors would like to thank Department of Computer Engineering, SRES Sanjivani College of Engineering, Kopargaon, Maharashtra, India for providing necessary support for carrying the research work. The authors also like to thank to Research & Development team of K L University, India for their continuous support for carrying the research work.