ViTFuser: Advancements in Global Context Understanding for Autonomous Vehicles

Overview

ViTFuser is an attention-based deep learning architecture designed to improve decision-making in autonomous vehicles. It addresses limitations in prior state-of-the-art models (like TransFuser) that suffer from loss of global context during feature fusion. ViTFuser introduces multi-stage Vision Transformers (ViTs) and a Feature Pyramid Network (FPN) to better understand the scene and reduce traffic violations.

The project outperforms comparative existing models on CARLA simulation benchmarks, achieving higher Driving Scores (DS) and lower traffic infractions.

Dataset & Task

ViTFuser was evaluated on the CARLA 0.9.10 simulator, using data recorded at 2 FPS across 3,500 driving routes in varying towns and weather conditions.

Input sensors:
- RGB images from 3 front-facing cameras (stitched into a wide-view image)
- LiDAR point cloud converted to 2D Bird’s-Eye-View (BEV) grid
Output: 4 future waypoints for the ego vehicle

The goal is to predict waypoints accurately while minimizing infractions during navigation.

Model Architecture

The model consists of two main components:

1. Perception Module (Encoder)

Processes RGB and LiDAR inputs using CNNs + multi-resolution Vision Transformers (ViT)
RGB and LiDAR branches run in parallel, each extracting hierarchical features
Cross-modal attention enables global context fusion
Uses Feature Pyramid Network (FPN) for multi-scale feature extraction, boosting object detection accuracy

2. Decision Module (Decoder)

Predicts vehicle waypoints using a GRU-based network
Handles auxiliary tasks such as:
- Semantic segmentation
- HD map generation
- Depth estimation
- Object detection (bounding boxes)

These tasks improve interpretability and overall navigation performance.

Parameters Comparison

Model	Parameters
TransFuser	168 million
Swin Transformer	688 million
ViTFuser	55 million

ViTFuser reduces parameters by ~67% compared to TransFuser and ~91% compared to Swin Transformer, making it much more efficient while maintaining strong performance.

Results

ViTFuser was benchmarked on Longest6 and Town05 datasets from CARLA.

Longest6 Benchmark

Model	DS	RC	IS
TransFuser	43.48	77.72	0.60
ViTFuser	51.96	80.70	0.65
ViTFuser + FPN	55.15	81.43	0.69

Town05 Benchmark

Model	DS (Short)	RC (Short)	DS (Long)	RC (Long)
TransFuser	87.48	92.78	67.85	91.57
ViTFuser	90.04	94.79	74.82	92.40
ViTFuser + FPN	91.07	94.67	74.95	94.90

DS - Driving Score, RC - Route Completion, IS - Infraction Score

Back