2021年1月10日 / 246次阅读 / Last Modified 2021年6月28日
神经网络
本文尝试汇总个人遇到的各类常用的神经元激活函数的信息。文本公式中的z,都是 weighted input,神经元激活之前的值:
$$ \begin{align}
z & = w^{T}a + b \\
& = \sum_{j}w_{j}a_{j} + b \end{align} $$
a是前一层神经元激活值,作为这一层的输入,或者就是最外层的输入,w是对应的weight。以上公式所所有值都是vector。激活函数使用 \( \sigma \) 来表示。
也叫logistic函数,教科书都是从sigmoid神经元激活函数开始讲起。
sigmoid函数公式如下:
$$ \sigma (z) = \text{sigmoid}(z) = \frac{1}{1+e^{-z}} $$
对z进行求导,导函数为:
$$ \sigma'(z) = \sigma(z) \cdot (1-\sigma(z)) $$
理论上使用sigmoid激活函数的神经网络,可以学习任意类型的函数,只是在实践中,sigmoid存在一定的短板,比如它的导数最大值才0.25,容易导致梯度消失,而且网络的学习速度也不快。因此,科学家才想办法寻找其它激活函数,希望解决梯度不稳定的问题,以及加快网络的学习,和提高网络的泛化程度。
也叫hyperbolic tangent function,公式如下:
$$ \sigma(z) = \text{tanh}(z) = \frac{e^z-e^{-z}}{e^z+e^{-z}} $$
tanh函数对z求导的结果为:
$$ \sigma'(z) = 1 - \sigma(z)^2 $$
tanh激活函数与sigmoid函数有着相似的曲线,不同的是导数,tanh的导数最大值为1,输出也不同,sigmoid输出在0到1之间,tanh的输出在-1到1之间,因此使用tanh对数据的normailization会有所不同。
如下是一段关于在sigmoid和tanh之间选择的论述:
Which type of neuron should you use in your networks, the tanh or sigmoid? A priori the answer is not obvious, to put it mildly! However, there are theoretical arguments and some empirical evidence to suggest that the tanh sometimes performs better* (*See, for example, Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998), and Understanding the difficulty of training deep feedforward networks, by Xavier Glorot and Yoshua Bengio (2010)..) Let me briefly give you the flavor of one of the theoretical arguments for tanh neurons. Suppose we're using sigmoid neurons, so all activations in our network are positive. Let's consider the weights \(w^{l+1}_{jk}\) input to the jth neuron in the (l+1)th layer. The rules for backpropagation tell us that the associated gradient will be \(a^l_{k}\delta^{l+1}_j\). Because the activations are positive the sign of this gradient will be the same as the sign of \(\delta^{l+1}_j\). What this means is that if \(\delta^{l+1}_j\) is positive then all the weights \(w^{l+1}_{jk}\) will decrease during gradient descent, while if \(\delta^{l+1}_j\) is negative then all the weights \(w^{l+1}_{jk}\) will increase during gradient descent. In other words, all weights to the same neuron must either increase together or decrease together. That's a problem, since some of the weights may need to increase while others need to decrease. That can only happen if some of the input activations have different signs. That suggests replacing the sigmoid by an activation function, such as tanh, which allows both positive and negative activations. Indeed, because tanh is symmetric about zero, tanh(−z)=−tanh(z), we might even expect that, roughly speaking, the activations in hidden layers would be equally balanced between positive and negative. That would help ensure that there is no systematic bias for the weight updates to be one way or the other.
How seriously should we take this argument? While the argument is suggestive, it's a heuristic, not a rigorous proof that tanh neurons outperform sigmoid neurons. Perhaps there are other properties of the sigmoid neuron which compensate for this problem? Indeed, for many tasks the tanh is found empirically to provide only a small or no improvement in performance over sigmoid neurons. Unfortunately, we don't yet have hard-and-fast rules to know which neuron types will learn fastest, or give the best generalization performance, for any particular application.
这段论述没有坚实的理论证明,我也准备自己实践对比测试一下tanh和sigmoid的优劣。
补充一个等式:
$$ \text{sigmoid}(z) = \frac{1+\text{tanh}(\frac{z}{2})}{2} $$
对tanh进行rescale,就是sigmoid。
Rectified Linear Unit,公式为:
$$ \sigma(z) = \text{relu}(z) = \text{max}(0, z) $$
导数:
$$ \sigma'(z) = \begin{cases}
1, & \text{if } z > 0 \\
0, & \text{if } z \le 0 \end{cases} $$
relu很不一样,跟sigmoid和tanh相比,因为它会“死掉”。
当relu神经元的输出小于等于0的时候,它内含的w和b的梯度就是0,此时它不会再有任何学习(更新w和b),在网络中,停止了学习。我看到的资料说是“永远”地停止学习,对此有些疑问,w和b得不到更新,是因为输出小于等于0,此时relu导数为0,但是此relu前一层的神经元,在学习过程中,它们都还在更新自己的输出,这些输出会传递给死掉的relu,是否会突然在某一个时刻,这个relu又活过来了(输出z = \(w^T+b\))呢?
ReLU的输入层也是ReLU,因此输入一定大于等于0,如果输出为0,可以想象很多w都是负数,此时w和b得不到更新,输入符号也不会改变,因此,可能永远活不过来......
实践证明relu在MLP和CNN网络中,表现良好,计算速度和学习速度都更快,而且,死掉一部分也没关系,正好符合了在生物神经领域发现的神经元的稀疏性!
多画了一个lu(z),不知道有没有这个名字,就是让a直接等于z,导数恒为1。
值得注意的一个细节,sigmoid函数导数最大值为0.25,这个值其实很小,这是导致梯度消失的一个因素。
关于神经元饱和(saturate)的概念:当sigmoid在输出在0和1附近时,当tanh的输出在-1和1附近时,此时导数很小,接近0,我们称神经元处于饱和状态。此时,除非对应的Learning Rate比较大,否则学习效果会很差很慢。relu不会饱和,当输入小于等于0时,直接梯度完全消失,完全停止学习。
softmax神经元常用在网络的最后一层,在分类问题中,它可以输出一个概率分布,计算公式为:
$$a_j^L=\frac{e^{z_j^L}}{\sum_k e^{z_k^L}}$$
softmax的输出既然是概率分布,那必然就是在 0-1 之间了!
当然不是一定要输出概率分布,分类问题一般支取输出最大的那个类别。不过,当在hidden layer使用tanh的时候,由于tanh的输出为 -1 -- 1,在output layer,配合softmax,可以将其转换成 0 -- 1的输出,更利于cost的计算。output layer 配合 sigmoid 也可以实现同样的效果。(不同的output layer neuron,需要配合不同的cost function,以消除output layer 学习缓慢的问题)
下面推导一下softmax对z的导数,下面的计算,L就不写了,都知道是output layer:
$$\begin{align}
\frac{\partial a_j}{\partial z_j} &= \frac{e^{z_j} (\sum_k e^{z_k}) - {e^{z_j}}^2}{(\sum_k e^{z_k})^2} \\
&= \frac{e^{z_j}}{\sum_k e^{z_k}} - \left(\frac{e^{z_j}}{\sum_k e^{z_k}}\right)^2 \\
&= a_j - (a_j)^2 \\
&= a_j(1-a_j)
\end{align}$$
当 \(i \neq j\) 时:
$$\begin{align}
\frac{\partial a_j}{\partial z_i} &= \frac{- e^{z_j} e^{z_i}}{(\sum_k e^{z_k})^2} \\
&= - a_i a_j
\end{align}$$
别特注意:\(a_k\) 组成一个概率分布,这就意味着,每一个 \(a_i\) 不是独立的变量,某一个变,其它所有都要跟着变!在推导计算的时候,要小心。
-- EOF --
本文链接:https://www.pynote.net/archives/3109
《神经元激活函数》有1条留言
前一篇:不稳定的梯度
后一篇:当python函数返回tuple或list时...
©Copyright 麦新杰 Since 2019 Python笔记
tanh拥有更深的饱和程度,从学习卡壳的状态走出来更加缓慢... [ ]