Next events
Wednesday, February 04, 26 2026 17:00 CET
Click here to register!
Click here to register!
Improving Capabilities of Efficient Foundation Models beyond Language
In this talk, we discuss recent alternatives to transformers based on linear RNNs and linear attention. While these new models (Mamba, Hyena, DeltaNet, RWKV) offer improved throughput and extremely long context processing abilities (e.g., all human genome), they often suffer from poor trainability and compute/expressivity tradeoffs. After providing an introduction to these architectures, we present central results on universality and complexity in the literature (e.g., https://arxiv.org/abs/2402.19047). Building on the presented insights and motivated by the need to reason beyond next-token prediction, we discuss new architectures, including Fixed-Point Mamba (spotlight at NeurIPS this year; https://arxiv.org/abs/2503.10799), that can automatically enhance expressivity through an adaptive compute mechanism.
Speaker: Antonio Orvieto (ELLIS Institute Tübingen - MPI)
Antonio studied Control Engineering in Italy and Switzerland. He holds a PhD in Computer Science from ETH Zürich and spent time at Google Deepmind (UK), Meta (US), MILA (CA), INRIA (FR), and HILTI (LI). He is currently a Hector Endowed Fellow and Principal Investigator (PI) at the ELLIS Institute Tübingen and Independent Group Leader of the MPI for Intelligent Systems, where he leads the Deep Models and Optimization group. He received the ETH medal for outstanding doctoral theses and the Schmidt Sciences AI2050 Early Career Fellowship. In his research, Antonio strives to improve the efficiency of deep learning technologies by pioneering new architectures and training techniques grounded in theoretical knowledge. His work encompasses two main areas, i.e., understanding the intricacies of large-scale optimization dynamics and designing innovative architectures and powerful optimizers capable of handling complex data.In this talk, we discuss recent alternatives to transformers based on linear RNNs and linear attention. While these new models (Mamba, Hyena, DeltaNet, RWKV) offer improved throughput and extremely long context processing abilities (e.g., all human genome), they often suffer from poor trainability and compute/expressivity tradeoffs. After providing an introduction to these architectures, we present central results on universality and complexity in the literature (e.g., https://arxiv.org/abs/2402.19047). Building on the presented insights and motivated by the need to reason beyond next-token prediction, we discuss new architectures, including Fixed-Point Mamba (spotlight at NeurIPS this year; https://arxiv.org/abs/2503.10799), that can automatically enhance expressivity through an adaptive compute mechanism.
