An Evaluation of CNN and Vision Transformer Models for Mars Surface Image Classification

Kehan Gao

An Evaluation of CNN and Vision Transformer Models for Mars Surface Image Classification
Author	Kehan Gao
Co-Author(s)	Sarah Tasneem; Taghi M. Khoshgoftaar
Abstract	This paper examines the effectiveness of three Convolutional Neural Network (CNN) architectures (InceptionNet, DenseNet, and EfficientNet) and a Vision Transformer (ViT) model in classifying Mars surface images, with a particular focus on their performance under varying degrees of class imbalance. Using NASA’s HiRISE imagery, we evaluate model robustness under two scenarios: one with severe imbalance across six terrain categories and another with moderate imbalance across four. While CNNs capture spatial hierarchies, ViTs treat images as sequences of patches and leverage self-attention, offering a contrasting approach to handling imbalanced planetary datasets. Experimental results demonstrate that all models perform well under moderate imbalance. However, a clear decline in classification performance is observed as imbalance becomes more severe. Among the models, the ViT consistently demonstrates greater robustness compared to the CNN architectures, achieving higher F1-scores and accuracy, particularly in the severely imbalanced setting. These findings highlight the potential of transformerbased models in addressing challenges posed by imbalanced datasets in image classification tasks.
Keywords	Convolutional Neural Networks (CNNs), InceptionNet, DenseNet, EfficientNet, Vision Transformer (ViT), Image Classification, Class Imbalance

		Article #: RQD2025-167

Proceedings of 30th ISSAT International Conference on Reliability & Quality in Design
August 6-8, 2025

	International Society of Science and Applied Technologies