Python笔记 ＋

# 神经元激活函数

\begin{align} z & = w^{T}a + b \\ & = \sum_{j}w_{j}a_{j} + b \end{align}

a是前一层神经元激活值，作为这一层的输入，或者就是最外层的输入，w是对应的weight。以上公式所所有值都是vector。激活函数使用 $$\sigma$$ 来表示。

## sigmoid

sigmoid函数公式如下：

$$\sigma (z) = \text{sigmoid}(z) = \frac{1}{1+e^{-z}}$$

$$\sigma'(z) = \sigma(z) \cdot (1-\sigma(z))$$

## tanh（读tanch）

$$\sigma(z) = \text{tanh}(z) = \frac{e^z-e^{-z}}{e^z+e^{-z}}$$

tanh函数对z求导的结果为：

$$\sigma'(z) = 1 - \sigma(z)^2$$

tanh激活函数与sigmoid函数有着相似的曲线，不同的是导数，tanh的导数最大值为1，输出也不同，sigmoid输出在0到1之间，tanh的输出在-1到1之间，因此使用tanh对数据的normailization会有所不同。

Which type of neuron should you use in your networks, the tanh or sigmoid? A priori the answer is not obvious, to put it mildly! However, there are theoretical arguments and some empirical evidence to suggest that the tanh sometimes performs better* （*See, for example, Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998), and Understanding the difficulty of training deep feedforward networks, by Xavier Glorot and Yoshua Bengio (2010)..） Let me briefly give you the flavor of one of the theoretical arguments for tanh neurons. Suppose we're using sigmoid neurons, so all activations in our network are positive. Let's consider the weights $$w^{l+1}_{jk}$$ input to the jth neuron in the (l+1)th layer. The rules for backpropagation tell us that the associated gradient will be $$a^l_{k}\delta^{l+1}_j$$. Because the activations are positive the sign of this gradient will be the same as the sign of $$\delta^{l+1}_j$$. What this means is that if $$\delta^{l+1}_j$$ is positive then all the weights $$w^{l+1}_{jk}$$ will decrease during gradient descent, while if $$\delta^{l+1}_j$$ is negative then all the weights $$w^{l+1}_{jk}$$ will increase during gradient descent. In other words, all weights to the same neuron must either increase together or decrease together. That's a problem, since some of the weights may need to increase while others need to decrease. That can only happen if some of the input activations have different signs. That suggests replacing the sigmoid by an activation function, such as tanh, which allows both positive and negative activations. Indeed, because tanh is symmetric about zero, tanh(−z)=−tanh(z), we might even expect that, roughly speaking, the activations in hidden layers would be equally balanced between positive and negative. That would help ensure that there is no systematic bias for the weight updates to be one way or the other.

How seriously should we take this argument? While the argument is suggestive, it's a heuristic, not a rigorous proof that tanh neurons outperform sigmoid neurons. Perhaps there are other properties of the sigmoid neuron which compensate for this problem? Indeed, for many tasks the tanh is found empirically to provide only a small or no improvement in performance over sigmoid neurons. Unfortunately, we don't yet have hard-and-fast rules to know which neuron types will learn fastest, or give the best generalization performance, for any particular application.

$$\text{sigmoid}(z) = \frac{1+\text{tanh}(\frac{z}{2})}{2}$$

## ReLU

Rectified Linear Unit，公式为：

$$\sigma(z) = \text{relu}(z) = \text{max}(0, z)$$

$$\sigma'(z) = \begin{cases} 1, & \text{if } z > 0 \\ 0, & \text{if } z \le 0 \end{cases}$$

relu很不一样，跟sigmoid和tanh相比，因为它会“死掉”。

ReLU的输入层也是ReLU，因此输入一定大于等于0，如果输出为0，可以想象很多w都是负数，此时w和b得不到更新，输入符号也不会改变，因此，可能永远活不过来......

## softmax

softmax神经元常用在网络的最后一层，在分类问题中，它可以输出一个概率分布，计算公式为：

$$a_j^L=\frac{e^{z_j^L}}{\sum_k e^{z_k^L}}$$

softmax的输出既然是概率分布，那必然就是在 0-1 之间了！

\begin{align} \frac{\partial a_j}{\partial z_j} &= \frac{e^{z_j} (\sum_k e^{z_k}) - {e^{z_j}}^2}{(\sum_k e^{z_k})^2} \\ &= \frac{e^{z_j}}{\sum_k e^{z_k}} - \left(\frac{e^{z_j}}{\sum_k e^{z_k}}\right)^2 \\ &= a_j - (a_j)^2 \\ &= a_j(1-a_j) \end{align}

\begin{align} \frac{\partial a_j}{\partial z_i} &= \frac{- e^{z_j} e^{z_i}}{(\sum_k e^{z_k})^2} \\ &= - a_i a_j \end{align}

-- EOF --

### 留言区

《神经元激活函数》有1条留言

• 麦新杰

tanh拥有更深的饱和程度，从学习卡壳的状态走出来更加缓慢... [回复]

Ctrl+D 收藏本页