The application of machine learning (ML) techniques to digitized images of biopsied cells for breast cancer diagnosis is an active area of research. We hypothesized that reducing noise in the data would lead to an increase in classification accuracies. To test this hypothesis, we first compared several classification techniques in their ability to discriminate between malignant and benign breast cancer tumors using the Wisconsin Breast Cancer Data Set and subsequently evaluated the effect of noise reduction techniques on model accuracies. We applied two noise-reduction techniques based on Principal Component Analysis – dimensionality reduction and outlier removal – to a comprehensive list of ML algorithms with different learning paradigms including Decision Trees (fine, medium, coarse), dimensionality reduction techniques (Linear Discriminant Analysis, Quadratic Discriminant Analysis, Partial Least Squares-Discriminant Analysis), logistic Regression, Bayesian techniques (Gaussian Naive, Kernel Naive), Support Vector Machines (Linear, Quadratic, Cubic, Gaussian), instance-based techniques (fine, medium, coarse, cosine, cubic, and weighted K-Nearest Neighbors), and Artificial Neural Networks. Results showed that noise removal through dimensionality reduction is most effective when using a cross-validated number of principal components, and accuracies surpassing 99% across all ML models are obtained when both noise-reduction techniques are applied sequentially. Even though such a high accuracy has been demonstrated in few instances for specific algorithms, the methodology proposed herein is the first published report demonstrating the applicability of a technique to a wide range of ML models to achieve high accuracies. We show that dimensionality reduction and outlier analysis can be used as effective approaches to improve discrimination accuracies. Also, dimensionality reduction through a cross-validated number of principal components can provide an effective framework for reducing noise in the data prior to applying a ML algorithm.
Copyright © 2021 Elsevier Ltd. All rights reserved.