Elastic-Depth Pretraining (EDP)

๐Ÿงช Overview

Elastic-Depth Pretraining (EDP) is a framework that integrates adaptive depth allocation directly into auto-regressive transformer pretraining to address the inefficiency of uniform computational depth for all tokens.

๐Ÿ”ฌ Methodology

EDP dynamically allocates transformer depth per token using a second-order residual signal (acceleration). Easy tokens skip layers, hard tokens use full depth.

๐Ÿ“Š Results

Result: 42% compute savings with comparable perplexity.