ViTFuser is an attention-based deep learning architecture designed to improve decision-making in autonomous vehicles. It addresses limitations in prior state-of-the-art models (like TransFuser) that suffer from loss of global context during feature fusion. ViTFuser introduces multi-stage Vision Transformers (ViTs) and a Feature Pyramid Network (FPN) to better understand the scene and reduce traffic violations.
The project outperforms comparative existing models on CARLA simulation benchmarks, achieving higher Driving Scores (DS) and lower traffic infractions.
ViTFuser was evaluated on the CARLA 0.9.10 simulator, using data recorded at 2 FPS across 3,500 driving routes in varying towns and weather conditions.
The goal is to predict waypoints accurately while minimizing infractions during navigation.
The model consists of two main components:
These tasks improve interpretability and overall navigation performance.
Model | Parameters |
---|---|
TransFuser | 168 million |
Swin Transformer | 688 million |
ViTFuser | 55 million |
ViTFuser reduces parameters by ~67% compared to TransFuser and ~91% compared to Swin Transformer, making it much more efficient while maintaining strong performance.
ViTFuser was benchmarked on Longest6 and Town05 datasets from CARLA.
Model | DS | RC | IS |
---|---|---|---|
TransFuser | 43.48 | 77.72 | 0.60 |
ViTFuser | 51.96 | 80.70 | 0.65 |
ViTFuser + FPN | 55.15 | 81.43 | 0.69 |
Model | DS (Short) | RC (Short) | DS (Long) | RC (Long) |
---|---|---|---|---|
TransFuser | 87.48 | 92.78 | 67.85 | 91.57 |
ViTFuser | 90.04 | 94.79 | 74.82 | 92.40 |
ViTFuser + FPN | 91.07 | 94.67 | 74.95 | 94.90 |
DS - Driving Score, RC - Route Completion, IS - Infraction Score