AIMPACT News, May 16 (UTC+8), according to Beating Monitoring, Nous Research open-sourced the long-context pretraining mechanism Lighthouse Attention. When processing 512K length text on a single B200 GPU, this scheme is approximately 17 times faster than traditional mechanisms, and achieves a 1.4 to 1.7 times speedup in end-to-end training at 98K length. Traditional attention mechanisms require calculating pairwise relationships between all words, and as the text lengthens, the computational cost increases quadratically. Lighthouse Attention adopts a coarse-to-fine approach. It quickly scans the compressed summaries of the text at different levels, scores and selects core segments to form short texts, which are then directly processed by the existing efficient operator FlashAttention. Because the filtering logic is completely separated from the core, developers are directly freed from the hassle of writing low-level code and do not need to add extra training objectives. Past acceleration schemes using similar ideas often had side effects; models accustomed to jumping over text tend to lose their original word-by-word reading ability. To avoid this trap, the development team first runs most of the process in acceleration mode, only briefly switching back to traditional full attention calculation at the end of training for adaptation. In practical tests with a 530 million parameter model trained on 50 billion tokens, this training method not only significantly shortened the training time but also achieved performance that fully matched or even surpassed the baseline trained entirely with traditional methods. (Source: BlockBeats)

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

6 Likes

Reward
6
8
2
Share

Comment

Add a comment

ACalmnessWithAHintOfPomelo

· 10m ago

Multi-level summary concatenation of short texts then dropping FlashAttention, this engineering trick is very clever.

View OriginalReply0

GateUser-8ca669fd

· 30m ago

Long-context competitions have entered the engineering optimization stage, which is more interesting than parameter stacking.

View OriginalReply0

TidalShell

· 33m ago

It's a bit surprising that the traditional baseline has been surpassed; I thought acceleration would always come at the expense of quality.

View OriginalReply0

GateUser-318a7dc8

· 38m ago

With 5.3 billion parameters, verification is possible, and small teams can also keep up.

View OriginalReply0

GateUser-d6fb8ff1

· 43m ago

Let's release the code and test how much K my 4090 can handle.

View OriginalReply0

Glass-HeartMarketMaker

· 43m ago

Removing the additional training target is too critical; otherwise, even if it's open source, no one will be able to train it properly.

View OriginalReply0

OrderbookOtter

· 44m ago

Lighthouse is a good name, first illuminate the key points and then look closely

View OriginalReply0

TokenTinkerTao

· 44m ago

B200 single card 512K, in the future individual runs of long documents RAG costs will decrease.

View OriginalReply0

Trending Topics
View More
#
WinGoldBarsWithGrowthPoints
1.25M Popularity
#
WTICrudeFallsBelow90Dollars
1.57M Popularity
#
StockTradingChallengeUpTo17000U
209.49K Popularity
#
USIranNegotiationGame
9.36M Popularity
#
TradeCFDWinGold
3.22M Popularity

Pinned

Sitemap

Nous开源Lighthouse Attention：单B200跑512K提速17倍

Trending Topics

WinGoldBarsWithGrowthPoints

WTICrudeFallsBelow90Dollars

StockTradingChallengeUpTo17000U

USIranNegotiationGame

TradeCFDWinGold

Pinned