Class Prediction of High-Dimensional Data with Class Imbalance: Breast Cancer Gene Expression Data

Authors

  • Gideon Nyatuga Nyakundi Department of Mathematics, University of Nairob,i Nairobi, Kenya
  • John Ndiritu Department of Mathematics, University of Nairobi, Nairobi, Kenya
  • Joseph Mwaniki Ivivi Department of Mathematics, University of Nairobi, Nairobi, Kenya
  • Timothy Kamanu Department of Mathematics, University of Nairobi, Nairobi, Kenya

DOI:

https://doi.org/10.31695/IJASRE.2024.11.3

Keywords:

Breast Cancer, Gene Expression Profiling, High-Dimensional Data, Class Imbalance, Machine Learning, Cancer Prediction Models

Abstract

Breast cancer remains a leading cause of mortality among women worldwide, with early and accurate diagnosis being critical for effective treatment. Gene expression profiling has emerged as a powerful tool for understanding the molecular mechanisms of cancer and for developing predictive models. However, the high dimensionality and class imbalance inherent in gene expression data pose significant challenges for developing robust predictive models.

This study aims to develop and evaluate a predictive model for classifying breast cancer subtypes using high-dimensional gene expression data, addressing the challenges of class imbalance. The objective is to improve the accuracy and reliability of breast cancer subtype prediction to facilitate better diagnostic and treatment strategies.

A comprehensive dataset of breast cancer gene expression profiles was utilized, comprising numerous gene expression levels across multiple samples. To address class imbalance, resampling techniques such as Synthetic Minority Over-sampling Technique (SMOTE) and Random Under-sampling, were employed. The machine learning algorithms employed included Support Vector Machines (SVM), Random Forests, and Neural Networks. The algorithms were trained and evaluated using cross-validation to identify the most effective model. The performance of these models was assessed based on metrics such as accuracy, precision, recall, and F1-score. The results indicate that the use of SMOTE in combination with SVM provided the most balanced and accurate predictions, with an F1-score significantly higher than models without resampling. The Random Forest algorithm also showed promising results, particularly in handling the high-dimensionality aspect of the data.

The study demonstrates that addressing class imbalance through advanced resampling techniques can significantly enhance the predictive accuracy of models trained on high-dimensional gene expression data. The findings underscore the potential of machine learning models, particularly SVMs and Random Forests, in improving breast cancer subtype classification.

Downloads

How to Cite

Gideon Nyatuga Nyakundi, John Ndiritu, Mwaniki Ivivi, J., & Kamanu, T. (2024). Class Prediction of High-Dimensional Data with Class Imbalance: Breast Cancer Gene Expression Data. International Journal of Advances in Scientific Research and Engineering (IJASRE), ISSN:2454-8006, DOI: 10.31695/IJASRE, 10(11), 28–46. https://doi.org/10.31695/IJASRE.2024.11.3