An Evaluation of CNN and Vision Transformer Models for Mars Surface Image Classification  
Author Kehan Gao

 

Co-Author(s) Sarah Tasneem; Taghi M. Khoshgoftaar

 

Abstract This paper examines the effectiveness of three Convolutional Neural Network (CNN) architectures (InceptionNet, DenseNet, and EfficientNet) and a Vision Transformer (ViT) model in classifying Mars surface images, with a particular focus on their performance under varying degrees of class imbalance. Using NASA’s HiRISE imagery, we evaluate model robustness under two scenarios: one with severe imbalance across six terrain categories and another with moderate imbalance across four. While CNNs capture spatial hierarchies, ViTs treat images as sequences of patches and leverage self-attention, offering a contrasting approach to handling imbalanced planetary datasets. Experimental results demonstrate that all models perform well under moderate imbalance. However, a clear decline in classification performance is observed as imbalance becomes more severe. Among the models, the ViT consistently demonstrates greater robustness compared to the CNN architectures, achieving higher F1-scores and accuracy, particularly in the severely imbalanced setting. These findings highlight the potential of transformerbased models in addressing challenges posed by imbalanced datasets in image classification tasks.

 

Keywords Convolutional Neural Networks (CNNs), InceptionNet, DenseNet, EfficientNet, Vision Transformer (ViT), Image Classification, Class Imbalance
   
    Article #:  RQD2025-167
 

Proceedings of 30th ISSAT International Conference on Reliability & Quality in Design
August 6-8, 2025