Alzheimer’s disease (AD) is a complex and heterogeneous disease that affects neuronal cells over time and it is prevalent among all neurodegenerative diseases. Next Generation Sequencing (NGS) techniques are widely used for developing high-throughput screening methods to identify biomarkers and variants, which help early diagnosis and treatments.
The primary purpose of this study is to develop a classification model using machine learning for predicting the deleterious effect of variants with respect to AD.
We have constructed a set of 20,401 deleterious and 37,452 control variants from Genome-Wide Association Study (GWAS) and Genotype-Tissue Expression (GTEx) portals, respectively. Recursive feature elimination using cross-validation (RFECV) followed by a forward feature selection method was utilized to select the important features and a random forest classifier was used for distinguishing between deleterious and neutral variants.
Our method showed an accuracy of 81.21% on 10-fold cross-validation and 70.63% on a test set of 5785 variants. The same test set was used to compare the performance of CADD and FATHMM and their accuracies are in the range of 54%-62%.
Our model is freely available as the Variant Effect Predictor for Alzheimer’s Disease (VEPAD) at http://web.iitm.ac.in/bioinfo2/vepad/. VEPAD can be used to predict the effect of new variants associated with AD.
Copyright © 2020 Elsevier Ltd. All rights reserved.