Potential biomarker detection for liver cancer stem cell by machine learning approach
AbstractObjectives: In this study, we aimed to identify putative biomarkers for identification and characterization of these cells in liver cancer.
Methods: We employed a supervised machine learning method, XGBoost, to data from 13 GEO data series to classify samples using gene expression data.
Results. Across the 376 samples (129 CSCs and 247 non-CSCs cases), XGBoost displayed high performance in the classification of data. XGBoost feature importance scores and SHAP (Shapley Additive explanation) values were used for the interpretation of results and analysis of individual gene importance. We confirmed that expression levels of a 10-gene set (PTGER3, AURKB, C15orf40, IDI2, OR8D1, NACA2, SERPINB6, L1CAM, SMC1A, and RASGRF1) were predictive. The results showed that these 10 genes can detect CSCs robustly with accuracy, sensitivity, and specificity of 97 %, 100 %, and 95 %, respectively.
Conclusions. We suggest that the ten-gene set may be used as a biomarker set for detecting and characterizing CSCs using gene expression data.
2. Bai KH, He SY, Shu LL, et al. Identification of cancer stem cell characteristics in liver hepatocellular carcinoma by WGCNA analysis of transcriptome stemness index. Cancer Medicine 2020.
3. Prasad S, Ramachandran S, Gupta N, et al. Cancer cells stemness: a doorstep to targeted therapy. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease 2020; 1866: 165424.
4. Shibata M and Hoque MO. Targeting cancer stem cells: a strategy for effective eradication of cancer. Cancers 2019; 11: 732.
5. Xiang Y, Yang T, Pang B-y, et al. The progress and prospects of putative biomarkers for liver cancer stem cells in hepatocellular carcinoma. Stem cells international 2016; 2016.
6. Najafi M, Farhood B and Mortezaee K. Cancer stem cells (CSCs) in cancer progression and therapy. Journal of cellular physiology 2019; 234: 8381-8395.
7. Chang N-W, Dai H-J, Shih Y-Y, et al. Biomarker identification of hepatocellular carcinoma using a methodical literature mining strategy. Database 2017; 2017.
8. Shahrjooihaghighi A, Frigui H, Zhang X, et al. An ensemble feature selection method for biomarker discovery. In: 2017 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT) 2017, pp.416-421. IEEE.
9. Toro-Domínguez D, Martorell-Marugán J, López-Domínguez R, et al. ImaGEO: integrative gene expression meta-analysis from GEO database. Bioinformatics 2019; 35: 880-882.
10. Diegues I, Vinga S and Lopes MB. Identification of Common Gene Signatures in Microarray and RNA-Sequencing Data Using Network-Based Regularization. In: International Work-Conference on Bioinformatics and Biomedical Engineering 2020, pp.15-26. Springer.
11. Shukla AK, Singh P and Vardhan M. A two-stage gene selection method for biomarker discovery from microarray data for cancer classification. Chemometrics and Intelligent Laboratory Systems 2018; 183: 47-58.
12. Li Y, Umbach DM, Bingham A, et al. Putative biomarkers for predicting tumor sample purity based on gene expression data. BMC genomics 2019; 20: 1-12.
13. Lundberg SM and Lee S-I. Consistent feature attribution for tree ensembles. arXiv preprint arXiv:170606060 2017.
14. Guthrie NL, Carpenter J, Edwards KL, et al. Emergence of digital biomarkers to predict and modify treatment efficacy: machine learning study. BMJ open 2019; 9: e030710.
15. Caly H, Rabiei H, Coste-Mazeau P, et al. Pregnancy data enable identification of relevant biomarkers and a partial prognosis of autism at birth. bioRxiv 2020.
16. Hathaway QA, Roth SM, Pinti MV, et al. Machine-learning to stratify diabetic patients using novel cardiac biomarkers and integrative genomics. Cardiovascular diabetology 2019; 18: 78.
17. Lundberg SM and Lee S-I. A unified approach to interpreting model predictions. In: Advances in neural information processing systems 2017, pp.4765-4774.
18. Štrumbelj E and Kononenko I. Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems 2014; 41: 647-665.
19. Ribeiro MT, Singh S and Guestrin C. " Why should I trust you?" Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 2016, pp.1135-1144.
20. Parsa AB, Movahedi A, Taghipour H, et al. Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis. Accident Analysis & Prevention 2020; 136: 105405.
21. Chen T and Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016, pp.785-794.
22. Werpachowski R, György A and Szepesvári C. Detecting overfitting via adversarial examples. In: Advances in Neural Information Processing Systems 2019, pp.7858-7868.
23. Wang Y and Ni XS. A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization. arXiv preprint arXiv:190108433 2019.
24. developers x. Understand your dataset with XGBoost R-Project
25. Molnar C. Interpretable machine learning. Lulu com 2019.
26. Wang W, Smits R, Hao H, et al. Wnt/β-catenin signaling in liver cancers. Cancers 2019; 11: 926.
27. Toro-Domínguez D, Villatoro-García JA, Martorell-Marugán J, et al. A survey of gene expression meta-analysis: methods and applications. Briefings in Bioinformatics 2020.
28. Ghosheh N, Küppers-Munther B, Asplund A, et al. Human Pluripotent Stem Cell-Derived Hepatocytes Show Higher Transcriptional Correlation with Adult Liver Tissue than with Fetal Liver Tissue. ACS omega 2020; 5: 4816-4827.
29. Bai J, Zhang X, Kang X, et al. Screening of core genes and pathways in breast cancer development via comprehensive analysis of multi gene expression datasets. Oncology letters 2019; 18: 5821-5830.
30. Xia L, Su X, Shen J, et al. ANLN functions as a key candidate gene in cervical cancer as determined by integrated bioinformatic analysis. Cancer management and research 2018; 10: 663.
31. Kuang Y, Wang Y, Zhai W, et al. Genome-wide analysis of methylation-driven genes and identification of an eight-gene panel for prognosis prediction in breast cancer. Frontiers in Genetics 2020; 11: 301.
32. Guo T, Ma H and Zhou Y. Bioinformatics analysis of microarray data to identify the candidate biomarkers of lung adenocarcinoma. PeerJ 2019; 7: e7313.
33. Zhang X, Li T, Wang J, et al. Identification of cancer-related long non-coding RNAs using XGBoost with high accuracy. Frontiers in genetics 2019; 10: 735.
34. Ding W, Chen G and Shi T. Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis. Epigenetics 2019; 14: 67-80.
35. Si M, Xiong Y, Du S, et al. Evaluation and calibration of a low-cost particle sensor in ambient conditions using machine-learning methods. Atmospheric Measurement Techniques 2020; 13.
36. Rodriguez-Aguayo C, Bayraktar E, Ivan C, et al. PTGER3 induces ovary tumorigenesis and confers resistance to cisplatin therapy through up-regulation Ras-MAPK/Erk-ETS1-ELK1/CFTR1 axis. EBioMedicine 2019; 40: 290-304.
37. Shin J, Kim TW, Kim H, et al. Aurkb/PP1-mediated resetting of Oct4 during the cell cycle determines the identity of embryonic stem cells. Elife 2016; 5: e10877.
38. Yadav S, Kowolik CM, Lin M, et al. SMC1A is associated with radioresistance in prostate cancer and acts by regulating epithelial‐mesenchymal transition and cancer stem‐like properties. Molecular Carcinogenesis 2019; 58: 113-125.
39. Bai C, Liu X, Xu J, et al. Expression profiles of stemness genes in gastrointestinal stromal tumor. Human pathology 2018; 76: 76-84.
40. Lin S-H, Liu T, Ming X, et al. Regulatory role of hexosamine biosynthetic pathway on hepatic cancer stem cell marker CD133 under low glucose conditions. Scientific reports 2016; 6: 1-10.
41. Bao S, Wu Q, Li Z, et al. Targeting cancer stem cells through L1CAM suppresses glioma growth. Cancer research 2008; 68: 6043-6048.