Back to Projects

ViTFuser: Advancements in Global Context Understanding for Autonomous Vehicles

Overview

ViTFuser is an attention-based deep learning architecture designed to improve decision-making in autonomous vehicles. It addresses limitations in prior state-of-the-art models (like TransFuser) that suffer from loss of global context during feature fusion. ViTFuser introduces multi-stage Vision Transformers (ViTs) and a Feature Pyramid Network (FPN) to better understand the scene and reduce traffic violations.

The project outperforms comparative existing models on CARLA simulation benchmarks, achieving higher Driving Scores (DS) and lower traffic infractions.

High Level Overview


Dataset & Task

ViTFuser was evaluated on the CARLA 0.9.10 simulator, using data recorded at 2 FPS across 3,500 driving routes in varying towns and weather conditions.

  • Input sensors:
    • RGB images from 3 front-facing cameras (stitched into a wide-view image)
    • LiDAR point cloud converted to 2D Bird’s-Eye-View (BEV) grid
  • Output: 4 future waypoints for the ego vehicle

Input Modality

The goal is to predict waypoints accurately while minimizing infractions during navigation.


Model Architecture

The model consists of two main components:

1. Perception Module (Encoder)

  • Processes RGB and LiDAR inputs using CNNs + multi-resolution Vision Transformers (ViT)
  • RGB and LiDAR branches run in parallel, each extracting hierarchical features
  • Cross-modal attention enables global context fusion
  • Uses Feature Pyramid Network (FPN) for multi-scale feature extraction, boosting object detection accuracy

Encoder

2. Decision Module (Decoder)

  • Predicts vehicle waypoints using a GRU-based network
  • Handles auxiliary tasks such as:
    • Semantic segmentation
    • HD map generation
    • Depth estimation
    • Object detection (bounding boxes)

These tasks improve interpretability and overall navigation performance.


Parameters Comparison

ModelParameters
TransFuser168 million
Swin Transformer688 million
ViTFuser55 million

ViTFuser reduces parameters by ~67% compared to TransFuser and ~91% compared to Swin Transformer, making it much more efficient while maintaining strong performance.


Results

ViTFuser was benchmarked on Longest6 and Town05 datasets from CARLA.

Longest6 Benchmark

ModelDSRCIS
TransFuser43.4877.720.60
ViTFuser51.9680.700.65
ViTFuser + FPN55.1581.430.69

Town05 Benchmark

ModelDS (Short)RC (Short)DS (Long)RC (Long)
TransFuser87.4892.7867.8591.57
ViTFuser90.0494.7974.8292.40
ViTFuser + FPN91.0794.6774.9594.90

DS - Driving Score, RC - Route Completion, IS - Infraction Score