Next Generation Sequencing (NGS) technologies have revolutionized genomics data research over the last decades by facilitating high-throughput sequencing of genetic material such as RNA Sequencing (RNAseq). A significant challenge is to explore innovative methods for further exploitation of these large-scale datasets. The approach described in this paper utilizes the results of RNAseq analysis to identify biomarkers related to the disease and deploy a disease outcome predictive model.
Chronic Lymphocytic Leukemia (CLL) was used as an example in the implementation of this approach. The approach proposed follows this methodology: (1) Analysis of RNAseq raw data, (2) Construction of a gene correlation network, (3) Identification of modules and hub genes in this network, which constitute the features for the classification algorithm, (4) Deployment of an efficient predictive model, with the use of state-of-the-art machine learning techniques and the association of the indicators with the clinical information.
The features/hub genes finally selected were 25 in total and were used as the input to the classifiers. The models, then, were validated leading to very satisfactory results, with the best performing of them achieving 95% cross-validation and 93,75% external validation accuracy.
Concluding, this exploratory data-driven approach attempts to make use of big genomic data by summarizing them in a way that is more understandable and facilitates their use by other techniques, such as Machine Learning. This method manages to extract a gene set that can predict the disease progression. The validation results of the proposed data-driven predictive models are very promising and constitute a significant contribution to medical research and personalized medicine.

Copyright © 2020 Elsevier Ltd. All rights reserved.

Author