Nous开源Lighthouse Attention:单B200跑512K提速17倍

robot
Abstract generation in progress
AIMPACT News, May 16 (UTC+8), according to Beating Monitoring, Nous Research open-sourced the long-context pretraining mechanism Lighthouse Attention. When processing 512K length text on a single B200 GPU, this scheme is approximately 17 times faster than traditional mechanisms, and achieves a 1.4 to 1.7 times speedup in end-to-end training at 98K length. Traditional attention mechanisms require calculating pairwise relationships between all words, and as the text lengthens, the computational cost increases quadratically. Lighthouse Attention adopts a coarse-to-fine approach. It quickly scans the compressed summaries of the text at different levels, scores and selects core segments to form short texts, which are then directly processed by the existing efficient operator FlashAttention. Because the filtering logic is completely separated from the core, developers are directly freed from the hassle of writing low-level code and do not need to add extra training objectives. Past acceleration schemes using similar ideas often had side effects; models accustomed to jumping over text tend to lose their original word-by-word reading ability. To avoid this trap, the development team first runs most of the process in acceleration mode, only briefly switching back to traditional full attention calculation at the end of training for adaptation. In practical tests with a 530 million parameter model trained on 50 billion tokens, this training method not only significantly shortened the training time but also achieved performance that fully matched or even surpassed the baseline trained entirely with traditional methods. (Source: BlockBeats)
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 8
  • 2
  • Share
Comment
Add a comment
Add a comment
ACalmnessWithAHintOfPomelo
· 10m ago
Multi-level summary concatenation of short texts then dropping FlashAttention, this engineering trick is very clever.
View OriginalReply0
GateUser-8ca669fd
· 30m ago
Long-context competitions have entered the engineering optimization stage, which is more interesting than parameter stacking.
View OriginalReply0
TidalShell
· 33m ago
It's a bit surprising that the traditional baseline has been surpassed; I thought acceleration would always come at the expense of quality.
View OriginalReply0
GateUser-318a7dc8
· 38m ago
With 5.3 billion parameters, verification is possible, and small teams can also keep up.
View OriginalReply0
GateUser-d6fb8ff1
· 43m ago
Let's release the code and test how much K my 4090 can handle.
View OriginalReply0
Glass-HeartMarketMaker
· 43m ago
Removing the additional training target is too critical; otherwise, even if it's open source, no one will be able to train it properly.
View OriginalReply0
OrderbookOtter
· 44m ago
Lighthouse is a good name, first illuminate the key points and then look closely
View OriginalReply0
TokenTinkerTao
· 44m ago
B200 single card 512K, in the future individual runs of long documents RAG costs will decrease.
View OriginalReply0
  • Pinned