A Microarray Data Pre-processing Method for Cancer Classification

Tay Xin Hui - Universiti Tun Hussein Onn Malaysia, Parit Raja 86400, Johor, Malaysia
Shahreen Kasim - Universiti Tun Hussein Onn Malaysia, Parit Raja 86400, Johor, Malaysia
Mohd Farhan Md Fudzee - Universiti Tun Hussein Onn Malaysia, Parit Raja 86400, Johor, Malaysia
Zubaile Abdullah - Universiti Tun Hussein Onn Malaysia, Parit Raja 86400, Johor, Malaysia
Rohayanti Hassan - Universiti Teknologi Malaysia, 83100, Johor, Malaysia
Aldo Erianda - Politeknik Negeri Padang, Sumatera Barat, Indonesia

Citation Format:

DOI: http://dx.doi.org/10.30630/joiv.6.4.1523


The development of microarray technology has led to significant improvements and research in various fields. With the help of machine learning techniques and statistical methods, it is now possible to organize, analyze, and interpret large amounts of biological data to uncover significant patterns of interest. The exploitation of microarray data is of great challenge for many researchers. Raw gene expression data are usually vulnerable to missing values, noisy data, incomplete data, and inconsistent data. Hence, processing data before being applied for cancer classification is important. In order to extract the biological significance of microarray gene expression data, data pre-processing is a necessary step to obtain valuable information for further analysis and address important hypotheses. This study presents a detailed description of pre-processing data method for cancer classification. The proposed method consists of three phases: data cleaning, transformation, and filtering. The combination of GenePattern software tool and Rstudio was utilized to implement the proposed data pre-processing method. The proposed method was applied to six gene expression datasets: lung cancer dataset, stomach cancer dataset, liver cancer dataset, kidney cancer dataset, thyroid cancer dataset, and breast cancer dataset to demonstrate the feasibility of the proposed method for cancer classification. A comparison has been made to illustrate the differences between the dataset before and after data pre-processing.


Data pre-processing; microarray data; gene expression data; GenePattern.

Full Text:



Owzar, K., Barry, W. T., Jung, S. H., Sohn, I., & George, S. L. (2008). Statistical challenges in pre-processing in microarray experiments in cancer. Clinical Cancer Research, 14(19), 5959-5966.

Bharti, S., Krishnan, N., Veyssi, A., Momeni, M., & Raj, S. (2022). sMAP: An interactive microarray data analysis tool for early-stage researchers. bioRxiv.

Herrero, J., Díaz-Uriarte, R., & Dopazo, J. (2003). Gene expression data pre-processing. Bioinformatics, 19(5), 655-656.

García de la Nava, J., van Hijum, S., & Trelles, O. (2003). Pre P: gene expression data pre-processing. Bioinformatics, 19(17), 2328-2329.

Deepak Jain. (2021, June 29). Data Preprocessing in Data Mining. Retrieved November 1, 2022, from https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/

Revathy N, Amalraj D. Accurate Cancer Classification Using Expressions of Very Few Genes. International Journal of Computer Applications. 2011;14(4):19-22.

Alasadi, S. A., & Bhaya, W. S. (2017). Review of data pre-processing techniques in data mining. Journal of Engineering and Applied Sciences, 12(16), 4102-4107.

Wikipedia contributors. (2018, June 4). Microarray databases. In Wikipedia, The Free Encyclopedia. Retrieved 01:06, November 1, 2022, from https://en.wikipedia.org/w/index.php?title=Microarray_databases&oldid=844388880.

Clough, E., & Barrett, T. (2016). The gene expression omnibus database. In Statistical genomics (pp. 93-110). Humana Press, New York, NY.

Tomczak, K., Czerwinska, P., & Wiznerowicz, M. (2015). The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge, Współczesna Onkologia, vol. 19, no. 1A, pp. A68-A77.

Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena, N., Coulson, R., Farne, A., ... & Brazma, A. (2007). ArrayExpress—a public database of microarray experiments and gene expression profiles. Nucleic acids research, 35(suppl_1), D747-D750.

Sarkans, U., Parkinson, H., Lara, G. G., Oezcimen, A., Sharma, A., Abeygunawardena, N., ... & Brazma, A. (2005). The ArrayExpress gene expression database: a software engineering and implementation perspective. Bioinformatics, 21(8), 1495-1501.

Rafii, F., & Rossi, B. D. (2015). Data pre-processing and reducing for microarray data exploration and analysis. International Journal of Computer Applications, 132(16), 20-26.

Kuehn, H., Liberzon, A., Reich, M., & Mesirov, J. P. (2008). Using GenePattern for gene expression analysis. Current protocols in bioinformatics, 22(1), 7-12.

Wikipedia contributors. (2021, December 23). GenePattern. In Wikipedia, The Free Encyclopedia. Retrieved 03:17, November 1, 2022, from https://en.wikipedia.org/w/index.php?title=GenePattern&oldid=1061704802.

David Eby, Broad Institute. (n.d.). AffySTExpressionFileCreator (v1) BETA. Retrieved November 1, 2022, from https://www.genepattern.org/modules/docs/AffySTExpressionFileCreator/1.

Joshua Gould, Broad Institute. (n.d.). PreprocessDataset (v5). Retrieved November 1, 2022, from https://genepattern.org/modules/docs/PreprocessDataset/5?print=yes.

Seah, C. S., Kasim, S., Fudzee, M. F., Mohamad, M. S., Saedudin, R. R., Hassan, R., ... & Atan, R. (2018). An effective pre-processing phase for gene expression classification. Indonesian Journal of Electrical Engineering and Computer Science, 11(3), 1223.

Sadhvi Anunaya. (2022, June 20). Data Preprocessing in Data Mining -A Hands On Guide. Retrieved November 2, 2022 from https://www.analyticsvidhya.com/blog/2021/08/data-preprocessing-in-data-mining-a-hands-on-guide/.

Peterson, P. L., Baker, E., & McGaw, B. (2010). International encyclopedia of education. Elsevier Ltd.

Normalization supplement: commentary on the impact of different normalization methodologies on variance distributions at a global and pathway level. Retrieved November 2, 2022 from https://doi.org/10.1371/journal.pgen.1002207.s003.

RStudio Team (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.†Nucleic Acids Research, 43(7), e47. doi: 10.1093/nar/gkv007.

Donatin, E., & Drancourt, M. (2012). DNA microarrays for the diagnosis of infectious diseases. Médecine et maladies infectieuses, 42(10), 453-459.

Tzouvelekis, A., Patlakas, G., & Bouros, D. (2004). Application of microarray technology in pulmonary diseases. Respiratory research, 5(1), 1-18.

Yoo, S. M., Choi, J. H., Lee, S. Y., & Yoo, N. C. (2009). Applications of DNA microarray in disease diagnostics. Journal of microbiology and biotechnology, 19(7), 635-646.

Landi MT, Dracheva T, Rotunno M, Figueroa JD et al. Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PLoS One 2008 Feb 20;3(2):e1651. PMID: 18297132.

D'Errico M, de Rinaldis E, Blasi MF, Viti V et al. Genome-wide expression profile of sporadic gastric cancers with microsatellite instability. Eur J Cancer 2009 Feb;45(3):461-9. PMID: 19081245.

Tsuchiya M, Parker JS, Kono H, Matsuda M et al. Gene expression in nontumoral liver tissue and recurrence-free survival in hepatitis C virus-positive hepatocellular carcinoma. Mol Cancer 2010 Apr 9;9:74. PMID: 20380719.

Jones J, Otu H, Spentzos D, Kolia S et al. Gene signatures of progression and metastasis in renal cell cancer. Clin Cancer Res 2005 Aug 15;11(16):5730-9. PMID: 16115910.

Tomás G, Tarabichi M, Gacquer D, Hébrant A et al. A general method to derive robust organ-specific gene expression-based differentiation indices: application to thyroid cancer diagnostic. Oncogene 2012 Oct 11;31(41):4490-8. PMID: 22266856.

Miller LD, Smeds J, George J, Vega VB et al. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci U S A 2005 Sep 20;102(38):13550-5. PMID: 16141321