Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
自從GPT、BERT等pre-train流行起來後,其中一個明顯的趨勢是model越做越大,因為更大的model配合更充分的pre-train通常能更有效地刷榜,以下是一些範例模型大小。 VGG最大版本有1.3億參數 GPT2最大的版本有15億參數 T5最大的版本有110億參數
tags:
paper explore
Abstract
For the case of neural network weight matrices:
- We propose maintaining only the per-row and per-column sums of these moving averages, and estimating the per-parameter second moments based on these sums. We demonstrate empirically that this method produces similar results to the baseline.
- We show that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow. We propose update clipping and a gradually increasing decay rate scheme as remedies. (and dropping momentum)
- We propose scaling the parameter updates based on the scale of the parameters themselves.
A Brief Review of Optimizer
SGD-準確率梯度下降法 (stochastic gradient decent)
SGD 也就是最單純的gradient decent 方法,找出參數的梯度,往梯度的方向去更新參數。 ,with W 為權重(weight)參數,L 為損失函數(loss function), η 是學習率(learning rate)
Momentum
此優化器為模擬動量的概念,在同方向的維度上學習速度會變快,方向改變的時候學習速度會變慢。 ,with β為參數
AdaGrad
前面幾種Optimizer的learning rate η,都為固定值,而AdaGrad就是會依照梯度去調整 η ,with ϵ 極小參數
RMSProp
AdaGrad後期梯度較大的時候,n較大,能夠約束學習率,但分母上梯度平方的累加會越來越大,會使梯度趨近於0,訓練便會結束,為了防止這個情況,後面有開發出 RMSprop Optimizer,在學習率調整上面多了一個參數α,可以在新舊梯度上面做調節 ,with α 為參數
Adam
Adam 保留了 Momentum 對過去梯度的方向做梯度速度調整與RMSProp對過去梯度的平方值做learning rate的調整,再加上Adam有做參數的”偏離校正”,使得每一次的學習率都會有個確定的範圍,會讓參數的更新較為平穩(by Exponential Moving Average)。
計算量和記憶體的大頭肯定都是在gradient;除此之外,記憶體的消耗主要是m , v了,我們要維護兩組變量,來滑動計算梯度的前兩階矩(也就是m和v),用以計算參數的更新量。這兩組變量每一組都跟訓練參數本身一樣大,因此對於參數比較多的模型,兩組緩存變量所消耗的顯存也不少。
Adafactor
- No Momentum:
CV模型很多時候要靠“SGD+動量”來煉出最優效果來,自適應學習率優化器通常訓練不出最好的效果。但對於NLP模型來說,情況有點相反,自適應學習率顯得更重要一些,很少聽到由純靠SGD調NLP模型的案例。
- Factored Second Moment Estimation
Low-rank approximation matrices
\(AB \approx C\)
,where $A$ is $m \times 1$, $B$ is $1 \times n$, $C$ is $m \times n$
- This is known to give the optimal projection onto the space of rank-k matrices with respect to the Frobenius norm. (x)
- Another popular cost function in the literature is the generalized Kullback-Leibler divergence,also known as the I-divergence.(o)
- Increasing Decay Parameter: In Adam: Proposed Alternative: with 0<$c$<1 (c = 0.8 in Adafactor default)
- Update Clipping Reddi et al. (2018) discuss non-convergence issues when using a fast decay of the second-moment estimator (low β2).
observe the root-mean-square over all parameters x for second-moment estimator We define the clipped unscaled update $\tilde{U}_{t}$ as:
- Relative Step Size get multiplied by the scale of the parameters
Adafactor Algorithm
Result
BLEU:bilingual evaluation understudy visual result https://github.com/jettify/pytorch-optimizer
ref : **Adfactor ** https://arxiv.org/pdf/1804.04235.pdf Adam: https://arxiv.org/pdf/1412.6980.pdf https://kexue.fm/archives/7302 https://juejin.cn/post/6844903600943005710