Advertisement
Research Article| Volume 7, ISSUE 2, P171-176, June 2019

Download started.

Ok

Distributed feature selection (DFS) strategy for microarray gene expression data to improve the classification performance

      Abstract

      Objective

      The objective of this research article is to present a novel feature selection strategy for improving the classification performance over high dimensional data sets. Curse of dimensionality is the most serious downside of microarray data as it has more number of genes(features). This leads to discouraged computational stability. In microarray data analytics, identifying more relevant features required full attention. Most of the researchers applied two stage strategy for gene expression data analysis. In first stage, feature selection or feature extraction is employed as a preprocessing step to pinpoint more prominent features. In second stage, classification is applied using selected subset of features.

      Method

      In this research also we followed the same strategy. But, we tried to introduce a distributed feature selection(dfs) strategy using Symmetrical Uncertainty(SU) and Multi Layer Perceptron(MLP) by distributing across the multiple clusters. Each cluster is equipped with finite number of features in it. MLP is employed over each cluster, and based on the highest accuracy and lowest Root Mean Square error rate(RMS) dominant cluster is nominated.

      Result

      Classification accuracy with Ridor, Simple Cart (SC), KNN, SVM are measured by considering dominant cluster’s features. The performance of this cluster is compared with the traditional filter based ranking techniques like Information Gain(IG), Gain Ratio Attribute Evaluator(GRAE), Chi-Squared Attribute Evaluator (Chi). The proposed method is recorded approximately 57% success rate, 18% competitive rate against traditional methods after applying it over 7 well high dimensional and one lower dimension dataset.

      Conclusion

      The proposed methodology applied over very high dimensional microarry datasets. Using this method memory consumption will be reduced and classification performance can be improved.

      Keywords

      1. Introduction

      Feature Selection (FS) is routinely used as a preprocessing step for high-dimensional data. Any technique for high-dimensional data should deal with curse of dimensionality. Few hundreds to thousands of variables of microarray dataset leading to very high dimensionality. The performance of technique demeaned as the number of features rises for the data analysis.
      • Bolón-Canedo V.
      • Sánchez-Maroño N.
      • Alonso-Betanzos A.
      Feature selection for high-dimensional data.
      FS is a process of removing an insignificant and superfluous features and attempts to find a candidate set of features that can best describes the data.
      • Peralta D.
      • del Río S.
      • Ramírez-Gallego S.
      • Triguero I.
      • Benitez J.M.
      • Herrera F.
      Evolutionary feature selection for big data classification: a mapreduce approach.
      Irrelevant and superfluous features change or bias the learning algorithm’s correctness, as they do not provide surplus knowledge to prediction and also it may puzzle the algorithm during learning and classification phase. Hence, it is an obligatory to drop insignificant and redundant features. Because of the high dimensionality the classification task has become more difficult.
      As all the genes in a gene expression microarray data can’t influence the classification model, it is now hour of need to derive the key genes by which classification task can become easier and accurate. For this issue, a possible choice is applying FS technique over high dimensional dataset. As a result of FS methods, computing time will be saved as well as prediction performance can be accelerated. FS can be called as attribute selection or variable selection in statistics. Because of the high dimensional property and lower instances rate of microarray data, a great computational approach is required to handle the difficulties of data analysis. It is strongly accepted that in majority of microarray data only few genes play a critical role in classification task and remaining genes are considered as irrelevant. So, FS concept is an important in preprocessing before proceeding to classification task.
      To lift this subject there are 3 different types of FS approaches are existed in literature. Those are namely Filter, Wrapper, Embedded.
      • Panthong R.
      • Srivihok A.
      Wrapper feature subset selection for dimension reduction based on ensemble learning algorithm.
      Filter mode is used to establish the subdivision of input features that have the significant predictive capacity. By nominating the righteous features, we can potentially upgrade the efficiency of classification. It uses distinct statistical tests to select the subdivision of features with the excessive predictive power. For this, decide a statistical measure to apply, then compute the score for every feature. The features are then ordered by the score then features with the top scores are considered in learning model generation, while others are ignored. There by memory consumption can be pruned and computation time can be die down. Information gain (IG), Gain Ratio Attribute Evaluator (GRAE), Chi- Squared Attribute Evaluator (Chi), Symmetric Uncertainty (SU), and ReliefF (REL) are some of the popular filter based methods.
      In this paper, we introduced a distributed feature selection (dfs) strategy using SU, Correlation-based Feature Subset Selection (CFS), and MLP by distributing across multiple clusters. As a result of this distribution process, each cluster is loaded with small sized features. To know the strong cluster among the formed clusters, MLP is employed on each cluster. Based on its accuracy and lowest RMS error rate, strong cluster is documented. This strong cluster of features is compared with existing filter based ranking methods like IG, GRAE, Chi.
      The idea of this method is inspired from the project allocation strategy applied in academics. Generally, students in final year need to undergo group academic project. Project coordinator need to form the groups(each group consists of ‘N’ students). There are few possibilities and limitations to form the groups. Those are : students can voluntarily form their own group as per the defined group size, it is similar to selection of features randomly. If all toppers are in the same group, their project result can be better than other groups, it is similar to selecting top ranked features from the list. If all poor students are in one group, their outcome of project can be poor. Alternatively, if toppers, average, and poor students are mixed in a proper structure, every group can be balanced. So, average and poor students can learn some useful knowledge from toppers, it is the process of applying ensemble approach which is widely accepted in classification. Here, performance of group (cluster) is depending on the members(features) of that group. In this work, we tried to present a structure which can form the balanced group to give the adequate knowledge. The proposed structure to form the groups will be explained later. For this research, rank of each features is computed using SU, and the minimum number of features to be included in each cluster is defined by the result CFS.
      For testing the strength of the proposed method, 7 well known microarray datasets which has features in the range 2000–12582, one lower dimensional dataset are used. After deriving the features by existed methods and proposed method, well known classification techniques such as Ridor(Rule), SC (Tree), KNN(lazy), SVM(Support Vector Machine) are employed, then respective performance is analyzed.
      In the next portion of this article, some of the existing literature over microarray dataset is presented. Distribution of feature selection among the cluster is presented in methodology section. Dataset description and experimental setup is given in Section 4, finally results and appropriate discussion is produced in Section 5. Article is concluded with possible suggestions.

      2. Literature

      Our main objective of this current work is to establish a DFS framework to reduce or select the best subset of features from the high dimensional data set. There are some existing review reports accessible to direct these feature selection. As discussed in the introduction, there are 3 approaches of feature selection (Filter, Wrapper, Embedded), we considered filter based approach. As filter method designates the rank to each feature based on the statistical score of a feature, we chosen one of the filter method namely Symmetric Uncertainty (SU). Little description about SU is given in the methodology division.
      In addition to SU, there are few more filter based modes(IG, GR, CHI, REL) available for selection of features and designating the rank to each feature. Based on the rank assigned, and depending on the number of features to be selected for classification, top ranked features can be selected for analysis. SU is applied in FAST (Fast clustering bAsed feature Selection algoriThm) clustering
      • Song Q.
      • Ni J.
      • Wang G.
      A fast clustering-based feature subset selection algorithm for high-dimensional data.
      and improved FAST clustering techniques for dimensionality reduction.
      • Malji P.
      • Sakhare S.
      January. Significance of entropy correlation coefficient over symmetric uncertainty on FAST clustering feature selection algorithm.
      Both these are based on the graph theory concept. In FAST, prim's algorithm is employed to select the subset, whereas in improved FAST, kruskal's minimum spanning tree is employed to derive the features. Information Gain based clustering frameworks have been proposed to find the best features of kidney and voting data sets.
      • Potharaju S.P.
      • Sreedevi M.
      A novel cluster of feature selection method based on information gain.
      A new feature selector using minimum variance method based on information gain is implemented and tested on 9 datasets. It displayed greater than the traditional feature selection algorithms.
      • Venkataraman S.
      • Sivakumar S.
      • Selvaraj R.
      A novel clustering based feature subset selection framework for effective data classification.
      Ensemble based multi filter feature selection framework is proposed for DDoS detection in cloud computing by combining the features extracted by IG, GR, CHI, REL. Authors considered the feature which are greater than the defined threshold value from the ensembled features.
      • Osanaiye O.
      • Cai H.
      • Choo K.K.R.
      • Dehghantanha A.
      • Xu Z.
      • Dlodlo M.
      Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing.
      This method reduced the number of features from 41 to 13.
      There have been some literature existed in the past which was conducted over various microarray dataset for addressing the high dimensionality issue using various FS approaches. Distributed FS mechanism is applied over 8 microarray datasets by partitioning the dataset vertically. From the each partition strong features are selected by applying various FS methods like IG, CFS, Consistency based filters and ReliefF. Researchers employed c4.5, Naive Bayes(NB), KNN and SVM classifiers for testing their proposed method.
      • Bolón-Canedo Verónica
      • Sánchez-Maroño Noelia
      • Alonso-Betanzos Amparo
      Distributed feature selection: an application to microarray data classification.
      KNN is applied on 5 microarray datasets using mapreduce and hadoop framework by executing over multiple cluster nodes. Depending on the size of the features, researchers used 2 and 18 cluster nodes.
      • Kumar Mukesh
      • Kumar Rath Nitish
      • Swain Amitav
      • Kumar Rath Santanu
      Feature selection and classification of microarray data using MapReduce based ANOVA and K-Nearest neighbor.
      CFS method is applied over lymphoblastic leukemia dataset. With this approach, researchers have drawn 16 top quality features. After employing KNN, MLP and SVM, 92.5% average accuracy is recorded.
      • Singhal Vanika
      • Singh Preety
      Correlation based feature selection for diagnosis of acute lymphoblastic leukemia.
      Random Forest(RF), KNN, MLP and SVM is employed over colon and leukemia dataset after preprocessing using Partial Least Squared(PLS), Quadratic Programming FS (QPFS), Max Relevance(Max Rel), and minimum redundancy and maximum relevance(mRMR) methods. Final experimental results indicated that mRMR has recorded best performance among all.
      • Sun Jing
      • Passi Kalpdrum
      • Kumar Jain Chakresh
      Improved microarray data analysis using feature selection methods with machine learning methods.
      Leukemia dataset is analyzed with SVM classifiers and ReliefF,CFS. Experimental analysis recorded above 90% accuracy.
      • Yang Sitan
      • Naiman Daniel Q.
      Multiclass cancer classification based on gene expression comparison.
      For Classifying multiple disease states associated with cancer based on gene expression profiles a novel method is proposed by the researchers using SVM and Random Forest(RF).
      • TAşçi A.
      • İnce T.
      • GüZELış C.
      A comparison of feature selection algorithms for cancer classification through gene expression data: leukemia case.
      As colon dataset is suffering with imbalanced class issue, to balance the instance rate, SMOTE has been applied. After applying this method, ensembling approaches are employed with J48, RF, REPTree. Researchers also considered IG ranking based filter approach to select the prominent features.
      • Al-Bahrani Reda
      • Agrawal Ankit
      • Choudhary Alok
      Colon cancer survival prediction using ensemble data mining on SEER data.
      SMOTE(Synthetic Minority Over-sampling Technique) and ensembling methods are used over kidney disease dataset by few researchers.
      • Potharaju Sai Prasad
      • Sreedevi M.
      Ensembled rule based classification algorithms for predicting imbalanced kidney disease data.
      ,
      • Potharaju Sai Prasad
      • Sreedevi M.
      An improved prediction of kidney disease using SMOTE.
      Distance based FS is applied over colon and leukemia datasets. SVM is employed over selected features by the researchers. As per their analysis of research B/SVM is selected 6.36% features and 9.5% misclassification rate over colon dataset. Also, 9.54% of features and 3.08% misclassification is recorded over leukemia dataset.
      • Wenyan Z.
      • Xuewen L.
      • Jingjing W.
      Feature selection for cancer classification using microarray gene expression data.
      Artificial Neural Networks (ANN), SVM, Bayesian Networks, Decision Trees are applied over cancer dataset in the mini review report published by the authors.
      • Kourou Konstantina
      • Exarchos Themis P.
      • Exarchos Konstantinos P.
      • Karamouzis Michalis V.
      • Fotiadis Dimitrios I.
      Machine learning applications in cancer prognosis and prediction.

      3. Methodology

      Proposed method is majorly based three important components: 1. Symmetric Uncertainty (SU), 2. Correlation-based Feature Subset Selection (CFS), 3. Multi Layer Perceptron(MLP).

      3.1 SU

      Symmetrical Uncertainty is a statistical measure which records the SU score of a feature then designates the rank to each feature. SU score with high value has top rank and less value has least rank. For feature selection, top ranked features can be selected depending on the requirement and type of the problem on which it will be applied. It can be defined as below.
      (SU) = 2*IG/(H(F1)+H(F2)


      IG is Information Gain; H(F1) is Entropy of F1; H(F2) is Entropy of F2
      SU score will be in the coverage [0,1]. SU score 1 shows one feature can predict entirely others, 0 shows two features are uncorrelated. For introduced framework, feature which has SU score 0 is ignored as it can’t influence the learning model.

      3.2 CFS

      It evaluates the strength of a subset of attributes by taking into account of the individual predictive ability of each attribute along with the degree of redundancy between them. Subsets of features that are strongly correlated with the class while having low intercorrelation are preferred. For drawing the subset, CFS employes the searching technique. In this paper, we used greedy stepwise searching technique is used. Instead of selecting the subset of features, we taken number of features drawn by this method to decide the minimum number of features to be included in each cluster.

      3.3 MLP

      MLP is one of the favored classifier applied in many domains for high accurate results, which is on the basis of neurons and perceptrons. As current methodology dissect the initial feature area into set of clusters(groups) to know the prominent cluster, primarily MLP is employed on each cluster formed by proposed approach then recorded its accuracy and RMS error rate.
      The proposed methodology is as per the flowchart given in Chart 1 .
      Chart 1
      Chart 1Distribution of Features among clusters.

      3.4 DFS algorithm

      Input: D, C, N, TF, ListTF
      D: Balanced data set
      C: Number of clusters
      N: Minimum number of features to be in each cluster
      TF: Total number of features (SU score > 0)
      ListTF : List of features whose SU score > 0
      Output: Cluster of features
      • 1
        Apply SU on D and define TF, store all features in ListTF in descending order of its rank (Rank 1 has high priority).
      • 2
        Apply CFS on D and define N
      • 3
        Define C
      C= (TF/N)


      • 4
        Store first(next) ‘C’ number of features from ListTF in left to right direction, such that first feature would be placed in first cluster, second feature would be in second cluster and so on. Get next ‘C’ number of feature from ListTF for next iteration.
      • 5
        Store next ‘C’ number of features from ListTF in right to left direction, such that first feature would be placed in last cluster, second feature would be in second last cluster and so on. Get next ‘C’ number of feature from ListTF for next iteration.
      • 6
        Repeat step 4 and 5 until all features are organized.
      • 7
        Group, all vertically first order features into first cluster, second order features into second cluster, and so on.
      • 8
        If all clusters are structured with equal number of features, then stop. Otherwise remove the last feature from the cluster which has an extra feature.(As it can’t influence the learning model).
      • 9
        Apply MLP on each cluster, based on the highest accuracy and minimum RMS error rate nominate the dominant cluster

      3.5 Example

      Assume T = 12 (total # features)
      TF = 10 (# features whose SU value is greater than 0)
      N = 3 (# features in each cluster)
      C = 10/3 = 3 (# clusters)
      ListTf= {f1, f2, f3, f4, f5, f6, f7, f8, f9, f10}
      As per the algorithm proposed, features in each cluster will be formed as below Table 1.
      Table 1Cluster of features.
      Ist Order (C1)2nd Order (C2)3rd Order (C3)Direction
      f1f2f3Left to Right
      f6f5f4Right to Left
      f7f8f9Left to Right
      f10Right to Left
      Cluster 1 : f1, f6, f7.
      Cluster 2: f2, f5, f8.
      Cluster 3: f3, f4, f9 (Note : f10 will be discarded as per step 9).

      4. Experiment

      In this portion, description of datasets selected for testing the proposed distribution method is given, as well as the filter method and ranker approaches used for comparing the proposed method. For testing the accuracy level of proposed method and existing methods, 4 classifiers which are based on different conceptual theory are selected. The complete experiment is carried out using WEKA tool, with all default settings. The system configuration includes, Intel® Xeon(R) CPU E31220 @ 3.10 GHz × 4 processor with 8 GB Ram and 64 Bit, Ubuntu 16.04 operating system. The summary of datasets are given in below Table 2, which consists of total number of features (F), instances(I), Classes(C)
      Table 2Dataset description.
      http://csse.szu.edu.cn/staff/zhuzx/Datasets.html.
      Dataset#F#I#C
      COLON2000622
      LEUKEMIA7129722
      LEUKEMIA_3C7129723
      LEUKEMIA_4C7129724
      LYMPHOMA4026663
      SRBT2308834
      MLL12,582723
      MUSK1664762
      Datasets are divided into ⅔ and ⅓ ration respectively for training and testing purpose. As per the Table 2 statistics, there are 7 high dimensional datasets with minimum 2000 and maximum 12,582 features in it. To check the proposed method strength a lower dimensional set which has 166 features is considered.
      After applying our proposed DFS method over all selected datasets, features are distributed across multiple clusters. For testing the strong clusters, MLP is employed over all those formed clusters and respective accuracy and RMS error rate is recorded. Table 3 shows the number of clusters formed(NC), best cluster id(BCID), accuracy of MLP and RMS error rate on it, number of features in best cluster(SF).
      Table 3MLP performance over clusters formed by DFS.
      Dataset#NCBCIDAccuracyRMSSF
      COLON5383.870.375327
      LEUKEMIA121198.610.124685
      LEUKEMIA_3C8598.610.0778114
      LEUKEMIA_4C65970.1183133
      LYMPHOMA1261000.032182
      SRBT521000.0257133
      MLL36161000.044145
      MUSK4389.910.290836
      In this work, 3 well known feature ranking methods are used to compare the proposed study. Information Gain is one of the popular univariate practice of assessing the features. This assess the features as per their information worth and appraise a single attribute at a moment. It produces an orderly classification of all the attributes, and then a threshold is needed to choose a certain number of them as per the order recorded. The Gain Ratio is the non-symmetrical measure that is established to requite for the bias of the Information Gain.
      In this work 4, well known classifiers are applied over the selected features. Those includes, K-Nearest Neighbor(KNN) is a classification algorithm which is an example of lazy lerner. Support vector machine(SVM) is another classification strategy a hyperplane in high -dimensional space. SC is tree based classification strategy and Ridor is rule based classification methods.

      5. Results and discussion

      After applying the proposed method, SF number of features are selected as per Table 3. To test the strength of the proposed approach, 3 different filter based ranking FS methods have been used with 4 different classifiers. From the traditional methods top ‘N” features are derived, where ‘N’ is equal to number of features(SF) in a strong cluster. For example, strong cluster of colon dataset has 27 features in it. So, top 27 features derived by traditional methods are chosen for evaluating the performance of proposed method. The performance evaluation of traditional and proposed method is produced in below Table 4.
      Table 4Result analysis of traditional and proposed DFS method.
      DatasetMethodRidorSCKNNSVMDatasetMethodRidorSCKNNSVM
      ColonIG70.9679.0380.6488.7LymphomaIG96.9696.9696.9698.48
      GR72.5882.2580.6485.48GR10010096.96100
      CHI72.5879.0385.4887.09CHI93.9398.9895.4598.48
      DFS
      Proposed method.
      79.0382.2582.2587.09DFS
      Proposed method.
      98.4896.9698.4898.48
      leukemiaIG91.6686.1198.6197.22MLLIG84.7286.1195.8397.22
      GR90.2784.7295.8297.22GR93.0590.2794.4495.83
      CHI91.6686.1195.8397.22CHI84.7286.1195.8398.61
      DFS
      Proposed method.
      93.0594.4495.8398.61DFS
      Proposed method.
      87.588.8893.05100
      leukemia _3cIG87.584.7294.4497.22SRBTIG81.9286.74100100
      GR84.7284.7297.2294.44GR81.9281.92100100
      CHI88.8886.1193.0595.83CHI81.9285.54100100
      DFS
      Proposed method.
      90.2787.595.8397.22DFS
      Proposed method.
      85.5481.9297.55100
      leukemia_4CIG88.8884.7297.2295.83MUSKIG74.1581.5184.0378.15
      GR91.6688.8893.0591.66GR71.4274.5782.9871.42
      CHI90.2787.593.0591.66CHI73.179.282.9878.78
      DFS
      Proposed method.
      81.9483.3395.8397.22DFS
      Proposed method.
      73.174.7884.6678.99
      * Proposed method.
      After detailed analysis of the results obtained, following noticeable points are gathered. Proposed method is recorded 75% success rate over IG and GR, 50% over chi with colon. Over leukemia, DFS is recorded 75% winning rate over traditional IG, GR, Chi. DFS recorded 75% success rate when compare with IG and GR, 100% success over Chi with leukemia _3c dataset. With leukemia_4c dataset, 50% success rate is obtained when compare with GR and Chi, it is 25% when compare with IG. In case of Lymphoma, DFS registered 25% winning rate when compare with IG and Chi, it is 50% with GR.Using SRBT, the proposed method is performed little poor, its success rate is 25% only. Using MLL dataset, success rate of dfs is 75% when compare with IG and Chi, it is 25% with GR. Dfs is performed 100% better when compared with GR, its performance is 50% with IG and Chi over Musk dataset.
      Average winning, draw and loss rate of the proposed method after applying various classifiers is given in below Table 5. Success rate of ridor is 70.83%, SC is 45.83%, KNN is 45.83% and SVM is 62.50%. The average success rate of the proposed method when compare with traditional methods after applying 4 classifiers is 56.25%, draw rate of the dfs is 15.18%.
      Table 5Win/Draw/Loss rate of proposed method.
      RidorSCKNNSVMAVERAGE
      WIN70.83%45.83%45.83%62.50%56.25%
      Draw4.17%19.05%8.33%29.17%15.18%
      Loss25.00%37.50%45.83%8.33%29.17%
      Graphical representation of all result analysis given in below Table 6.
      Table 6Graphical representation of result analysis.

      6. Conclusion

      In this research attempt, we have proposed a distributed feature selection(dfs) approach that can be applied for complex high dimensional datasets for increasing the classification performance. As high dimensional dataset contains large number of redundant and irrelevant feature and only prominent features helps in predicting an unknown instance, a novel method is needed for selecting the best features from the existed space. We used the Symmetrical Uncertainty, Correlation based filter selection and multi layer perceptron approaches for the proposed method. Using proposed approach, features are distributed among the various clusters without any repetition. Out of many clusters only one strong cluster based on the result of MLP is selected which has limited number of features. Proposed approach is compared with three well known filter based ranking feature selection methods and 4 various classification algorithms. The proposed method is recorded approximately 57% success rate, 18% competitive rate against the traditional methods after applying it over 7 well high dimensional and one lower dimension dataset. Finally, it is suggested that the analyzing each cluster with MLP in sequential order is time consuming, if large number of cluster are formed, it has to be analyzed parallely using any parallel programming concepts like hadoop which is our future study, so it could be considered as a generalized approach for distributing feature selection.

      Acknowledgements

      The authors would like to thank Department of Computer Engineering, SRES Sanjivani College of Engineering, Kopargaon, Maharashtra, India for providing necessary support for carrying the research work. The authors also like to thank to Research & Development team of K L University, India for their continuous support for carrying the research work.

      References

        • Bolón-Canedo V.
        • Sánchez-Maroño N.
        • Alonso-Betanzos A.
        Feature selection for high-dimensional data.
        Prog Artif Intell. 2016; 5: 65-75
        • Peralta D.
        • del Río S.
        • Ramírez-Gallego S.
        • Triguero I.
        • Benitez J.M.
        • Herrera F.
        Evolutionary feature selection for big data classification: a mapreduce approach.
        Math Prob Eng. 2015; 2015
        • Panthong R.
        • Srivihok A.
        Wrapper feature subset selection for dimension reduction based on ensemble learning algorithm.
        Proc Comp Sci. 2015; 72: 162-169
        • Song Q.
        • Ni J.
        • Wang G.
        A fast clustering-based feature subset selection algorithm for high-dimensional data.
        IEEE Trans Knowl Data Eng. 2013; 25: 1-14
        • Malji P.
        • Sakhare S.
        January. Significance of entropy correlation coefficient over symmetric uncertainty on FAST clustering feature selection algorithm.
        2017 11th International Conference on Intelligent Systems and Control (ISCO). IEEE, 2017: 457-463
        • Potharaju S.P.
        • Sreedevi M.
        A novel cluster of feature selection method based on information gain.
        Int J Control Theory Appl. 2017; 10: 10-16
        • Venkataraman S.
        • Sivakumar S.
        • Selvaraj R.
        A novel clustering based feature subset selection framework for effective data classification.
        Indian J Sci Technol. 2016; 9
        • Osanaiye O.
        • Cai H.
        • Choo K.K.R.
        • Dehghantanha A.
        • Xu Z.
        • Dlodlo M.
        Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing.
        EURASIP J Wirel Commun Networking. 2016; 2016: 130
        • Bolón-Canedo Verónica
        • Sánchez-Maroño Noelia
        • Alonso-Betanzos Amparo
        Distributed feature selection: an application to microarray data classification.
        Appl Soft Comput. 2015; 30: 136-150
        • Kumar Mukesh
        • Kumar Rath Nitish
        • Swain Amitav
        • Kumar Rath Santanu
        Feature selection and classification of microarray data using MapReduce based ANOVA and K-Nearest neighbor.
        Proc Comp Sci. 2015; 54: 301-310
        • Singhal Vanika
        • Singh Preety
        Correlation based feature selection for diagnosis of acute lymphoblastic leukemia.
        Proceedings of the Third International Symposium on Women in Computing and Informatics. 2015: 5-9 (ACM)
        • Sun Jing
        • Passi Kalpdrum
        • Kumar Jain Chakresh
        Improved microarray data analysis using feature selection methods with machine learning methods.
        2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2016: 1527-1534 (IEEE)
        • Yang Sitan
        • Naiman Daniel Q.
        Multiclass cancer classification based on gene expression comparison.
        Stat Appl Genet Mol Biol. 2014; 13: 477-496
        • TAşçi A.
        • İnce T.
        • GüZELış C.
        A comparison of feature selection algorithms for cancer classification through gene expression data: leukemia case.
        in: 2017 10th International Conference on Electrical and Electronics Engineering (ELECO), Bursa, Turkey2017: 1352-1354
        • Al-Bahrani Reda
        • Agrawal Ankit
        • Choudhary Alok
        Colon cancer survival prediction using ensemble data mining on SEER data.
        2013 IEEE International Conference on Big Data. 2013: 9-16 (IEEE)
        • Potharaju Sai Prasad
        • Sreedevi M.
        Ensembled rule based classification algorithms for predicting imbalanced kidney disease data.
        J Eng Sci Technol Rev. 2016; 9: 201-207
        • Potharaju Sai Prasad
        • Sreedevi M.
        An improved prediction of kidney disease using SMOTE.
        Indian J Sci Technol. 2016; 9
        • Wenyan Z.
        • Xuewen L.
        • Jingjing W.
        Feature selection for cancer classification using microarray gene expression data.
        Biostat 03 Biometrics Open Acc J. 2017; 1: 555557https://doi.org/10.19080/BBOAJ.2017.01.555557
        • Kourou Konstantina
        • Exarchos Themis P.
        • Exarchos Konstantinos P.
        • Karamouzis Michalis V.
        • Fotiadis Dimitrios I.
        Machine learning applications in cancer prognosis and prediction.
        Comput Struct Biotechnol J. 2015; 13: 8-17