# LSTM入门【时时app平台注册网站】

*【翻译】精通 LSTM
及其图示
也许能够特别援救通晓。*

本文紧要依照Understanding LSTM Networks-colah's blog 编写，包罗翻译并追加了和煦浅薄的明白。

*Graves, Alex, Greg Wayne, and Ivo Danihelka. “Neural turing machines.”
arXiv preprint arXiv:1410.5401 (2014).*

Original Paper PDF

- 理解 LSTM
网络
- 递归神经网络
- 长此以往依赖难点
- LSTM 网络
- LSTM 的宗旨主见
- 渐渐拆解解析 LSTM 的流水生产线
- 长短时间回想的变种
- 结论
- 鸣谢

## LSTM为啥暴发？

*Bourlard, Hervé, and Yves Kamp. “Auto-association by multilayer
perceptrons and singular value decomposition.” Biological cybernetics
59.4-5 (1988): 291-294.*

Original Paper
PDF

## 结论

Conclusion

在此之前，笔者介怀到某个人选择 GL450NN 得到了明显的成果，那么些差相当少都以由此 LSTM 互联网做到的。对于绝一大半标题，LSTM 真的越来越好用！

陈列一大堆公式之后，LSTM 看起来令人生畏。幸好，文章中慢慢的剖判让它们更轻易选取。

LSTM 是 LX570NN 得到的第一次全国代表大会升高。很当然地要问：还会有其余的迈入空间吗？切磋人口的周边答案是：Yes！还应该有升高的长空，那正是注意力（attention卡塔尔！注意力的主见是让 驭胜NN 中的每一步都从新闻越发丰裕的地点领取新闻。举个例子，你想行使 ENCORENN 对风华正茂幅图片生成描述，它也供给领取图片中的生机勃勃局部来生成输出的文字。事实上，Xu 等人便是那样做的，若是您想追究集中力，那会是一个极其不错的开首点。还应该有超级多绝妙的硕果选择了专注力，专注力今后还将发布更加大的威力...

集中力并不是 大切诺基NN 钻探中唯生机勃勃二个激动的思绪。Kalchbrenner 等人提出的 Grid LSTM 看起来极具潜能。Gregor等人、Chung 等人，恐怕 Bayer 与 Osendorfer 在扭转模型中应用 GL450NN 的主张也至极风趣。最近几来是递归神经网络的超新星时间，新出的名堂只会更具前途。

Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

LSTMs were a big step in what we can accomplish with RNNs. It's natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it's attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu,

et al.(2015) do exactly this – it might be a fun starting point if you want to explore attention! There's been a number of really exciting results using attention, and it seems like a lot more are around the corner…Attention isn't the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner,

et al.(2015) seem extremely promising. Work using RNNs in generative models – such as Gregor,et al.(2015), Chung,et al.(2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

## 参考

- Understanding LSTM Networks-colah's blog

**Convolutional neural networks (CNN or deep convolutional neural
networks, DCNN)** are quite different from most other networks. They are
primarily used for image processing but can also be used for other types
of input such as as audio. A typical use case for CNNs is where you feed
the network images and the network classifies the data, e.g. it
outputs “cat” if you give it a cat picture and “dog” when you give it a
dog picture. CNNs tend to start with an input “scanner” which is not
intended to parse all the training data at once. For example, to input
an image of 200 x 200 pixels, you wouldn’t want a layer with 40 000
nodes. Rather, you create a scanning input layer of say 20 x 20 which
you feed the first 20 x 20 pixels of the image (usually starting in the
upper left corner). Once you passed that input (and possibly use it for
training) you feed it the next 20 x 20 pixels: you move the scanner one
pixel to the right. Note that one wouldn’t move the input 20 pixels (or
whatever scanner width) over, you’re not dissecting the image into
blocks of 20 x 20, but rather you’re crawling over it. This input data
is then fed through convolutional layers instead of normal layers, where
not all nodes are connected to all nodes. Each node only concerns itself
with close neighbouring cells (how close depends on the implementation,
but usually not more than a few). These convolutional layers also tend
to shrink as they become deeper, mostly by easily divisible factors of
the input (so 20 would probably go to a layer of 10 followed by a layer
of 5). Powers of two are very commonly used here, as they can be divided
cleanly and completely by definition: 32, 16, 8, 4, 2, 1. Besides these
convolutional layers, they also often feature pooling layers. Pooling is
a way to filter out details: a commonly found pooling technique is max
pooling, where we take say 2 x 2 pixels and pass on the pixel with the
most amount of red. To apply CNNs for audio, you basically feed the
input audio waves and inch over the length of the clip, segment by
segment. Real world implementations of CNNs often glue an FFNN to the
end to further process the data, which allows for highly non-linear
abstractions. These networks are called DCNNs but the names and
abbreviations between these two are often used interchangeably.

## 递归神经互连网

Recurrent Neural Networks

人类并非每天都从头最早思量。假设您读书那篇文章，你是在前面词汇的底子上掌握每二个词汇，你无需放任一切从头开头酌量。你的思索具备可持续性。

古板的神经互联网不能够到位那样，並且那成为了一个最首要的劣点。比如，想像一下你供给对豆蔻梢头部电影中正在爆发的风浪做出推断。近来还不清楚古板的神经网络怎么样依据在此之前时有发生的平地风波来猜度之后发生的事件。

递归神经互连网恰恰用来解除这几个主题素材。递归神经互连网的内部设有着循环，用来保障信息的可持续性。

Humans don't start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don't throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can't do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It's unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.

上海体育场所中有一点神经网络——(A)，输入值 (x_t)，和输出值 (h_t) 。两个周而复始保险音信一步一步在互联网中传送。

这一个循环让递归神经互连网难以知晓。然则，假使言之有序揣摩就能意识，它们和常常的神经网络没什么区别。叁个递归神经互联网可以看做是生龙活虎组同样的网络，每叁个网络都将音讯传递给下一个。要是进行循环就走访到：

In the above diagram, a chunk of neural network, class="math inline">(A), looks at some input class="math inline">(x_t) and outputs a value class="math inline">(h_t). A loop allows information to be passed from one step of the network to the next.

These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren't all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

本条链式结构自然地披揭露递归神经互联网和连串与列表紧凑有关。那是用于拍卖连串数据的神经互连网的自然架构。

理当如此，也是可用的。前段时间些年，驭胜NN 在语音识别、语言建模、翻译、图像描述等等领域拿到了嘀咕的功成名就。笔者把对 奥迪Q5NN 所拿到成果的钻探留在 安德雷j Karpathy 的博客里。凯雷德NN 真的很好看妙！

那几个成功的入眼是 “LSTM” ——大器晚成种新鲜的递归神经网络，在数不尽主题素材上比标准版本的 LacrosseNN 好得多。大概具备递归神经互联网得到的绝妙成果均来源于 LSTM 的应用。那篇文章要介绍的正是 LSTM。

This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They're the natural architecture of neural network to use for such data.

And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I'll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy's excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It's these LSTMs that this essay will explore.

#### Variants on Long Short Term Memory

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

"peephole connections"

Another variation is to use coupled(耦合) forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

coupled

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

GRU

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

It should be noted that while most of the abbreviations used are generally accepted, not all of them are. RNNs sometimes refer to recursive neural networks, but most of the time they refer to recurrent neural networks. That’s not the end of it though, in many places you’ll find RNN used as placeholder for any recurrent architecture, including LSTMs, GRUs and even the bidirectional variants. AEs suffer from a similar problem from time to time, where VAEs and DAEs and the like are called simply AEs. Many abbreviations also vary in the amount of “N”s to add at the end, because you could call it a convolutional neural network but also simply a convolutional network (resulting in CNN or CN).

*正文翻译自 Christopher Olah 的博文 [*Understanding LSTM
Networks*](
LSTM 网络。*

两个至关心珍视要难题：

## 久远依附难点

The Problem of Long-Term Dependencies

LANDNN 的重力之一是它们能够将原先的新闻与当下的难题总是，比方使用早前的录制画面能够启示对当下画面包车型客车敞亮。倘诺奥迪Q5NN 能够造成这点，它们会十分有效。但它能够啊？嗯，那是有标准的。

突发性，我们只必要查阅近来的新闻来应对近些日子的主题素材。举例，三个言语模型计算凭借早先的词汇预测下贰个词汇。假诺大家准备预测
“the clouds are in the *sky*”
中的最后一个词，大家无需任何更进一层的上下文背景，很显眼，下一个词将是
*sky*。在这里种情况下，相关新闻与它所在地点之间的偏离超级小，凯雷德NN
能够学习使用过去的音讯。

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they'd be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the

sky,” we don't need any further context –– it's pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it's needed is small, RNNs can learn to use the past information.

但也有些情形下我们需求更多的上下文。考虑尝试预测 “I grew up in France… I
speak fluent *French*.”
中的最终一个词。近些日子的音讯注明，下一个单词可能是黄金年代种语言的名目，但如果大家想要具体到哪一种语言，大家须要从更远的地点得到上下文——弗兰ce。因而，相关音讯与它所在地方之间的相距超级大是全然只怕的。

不满的是，随着间隔的叠合，奥德赛NN 发轫不可能将消息连接起来。

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent

French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It's entirely possible for the gap between the relevant information and the point where it is needed to become very large.Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

斟酌上，本田UR-VNN 相对有力量管理这种“短时间正视”。人类可透过缜密筛选参数来消除这种样式的“玩具难点”。缺憾的是在推行中，大切诺基NN 如同无艺术学习它们。那一个难点是由 Hochreiter 和 Bengio 等人深切商量。他意识了难点变困难的根本原因。

眼观四处，LSTM 没这种主题素材！

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don't seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don't have this problem!

## LSTM做什么？

# 理解 LSTM 网络

Understanding LSTM Networks

## LSTM是什么？

以下定义摘自百度周详

LSTM(Long Short-Term Memory) 长短时间记念互联网，是大器晚成种时光递归神经网络，符合于管理和展望时间体系中中间隔和延期对峙较长的要紧事件。

目录

### 应用

依据 LSTM 的系统可以学学翻译语言、调整机器人、图像深入分析、文书档案摘要、语音识别、图像识别、手写识别、调整闲聊机器人、预测病魔、点击率和期货、合成音乐等等职责

*Kulkarni, Tejas D., et al. “Deep convolutional inverse graphics
network.” Advances in Neural Information Processing Systems. 2015.*

Original Paper PDF

## LSTM 网络

LSTM Networks

长长时间记念互连网——常常被称之为 LSTM，是大器晚成种独特的 奥迪Q5NN，能够学习长时间依据。由 Hochreiter 和 施密德huber（1997卡塔尔建议的，並且在接下去的工作中被许六人改良和放手。LSTM 在有滋有味的题目上显示极度非凡，将来被广泛利用。

LSTM 被显眼规划用来幸免长时间依靠难题。长日子记住音信实际是 LSTM 的私下认可行为，实际不是须要努力学习的事物！

具有递归神经网络都抱有神经互连网的链式重复模块。在职业的 RAV4NN 中，这几个重复模块具有特别轻巧的组织，比方唯有单个 tanh 层。

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.

^{1}They work tremendously well on a large variety of problems, and are now widely used.LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

LSTM 也具有这种近乎的链式结构，但再一次模块具有不一样的结构。不是三个单身的神经互连网层，而是多少个，並且以极度例外的方法举办交互作用。

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

无须顾忌细节。稍后大家将逐级浏览 LSTM 的图解。今后，让大家试着去熟识大家将接纳的符号。

Don't worry about the details of what's going on. We'll walk through the LSTM diagram step by step later. For now, let's just try to get comfortable with the notation we'll be using.

在下面的图中，每行李包裹蕴叁个总体的向量，从二个节点的出口到任何节点的输入。橄榄黄圆圈表示逐点运算，如向量加法；而深黄框代表学习的神经网络层。行统一表示串联，而分支表示其内容正在被复制，并且别本将转到差别的任务。

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

## LSTM怎么做？

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997).

LSTMs are explicitly designed to avoid the long-term dependency problem.Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

所以正式的奥迪Q7NN模型具备神经互联网模块链式结构，模块结构得以特别轻巧，比如只饱含一个tanh layer，如下图所示：

LSTM

模块结构也得以特别复杂，如下图所示：

[图形上传失利...(image-72c315-1521165904331)]

接下去将遍历LSTM图示中的各种环节，在遍历早前，首先要领悟图示中种种图形、符号的情致。

图示符号

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

## 鸣谢

Acknowledgments

自身非常多谢有广大人扶植自个儿越来越好地驾驭 LSTM 网络，无论是可视化上面包车型客车评注，依然小说前面包车型大巴陈述。

小编特别多谢小编在 谷歌（Google卡塔尔 的同事提供了便利的报告，非常是 奥Rio尔 Vinyals、GregCorrado、乔恩 Shlens、Luke Vilnis 和 Ilya Sutskever。作者也非常多谢别的花时间扶助本身的同事，包含 达Rio Amodei 和 Jacob Steinhardt。作者要非常感激 Kyunghyun Cho 针对小说图解的极具关怀的来信。

在这里篇博客在此之前，作者风流倜傥度在多少个俯拾皆已经钻探班上演说过 LSTM 互联网，当时自家正在做神经网络方面包车型客车传授。感激全部插足过钻探望上班者的人以至她们提议的反馈。

I'm grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.

I'm very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I'm also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I'm especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.

Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.

- In addition to the original authors, a lot of people contributed to the modern LSTM. A non-comprehensive list is: Felix Gers, Fred Cummins, Santiago Fernandez, Justin Bayer, Daan Wierstra, Julian Togelius, Faustino Gomez, Matteo Gagliolo, and Alex Graves.↩

### 现状

在 二〇一五 年，谷歌（Google卡塔尔国因此依据CTC 训练的 LSTM 程序大幅度升高了安卓手提式有线电话机和任何装置中语音识别的手艺，当中就动用了Jürgen Schmidhuber的实验室在 2005 年宣布的方法。百度也接收了 CTC；苹果的 中兴在 QucikType 和 Siri 中运用了 LSTM；微软不但将 LSTM 用于语音识别，还将这一能力用于设想对话形象调换和编写程序代码等等。亚马逊（亚马逊（Amazon卡塔尔国卡塔尔国亚历克斯a 通过双向 LSTM 在家庭与您沟通，而Google使用 LSTM 的限量进一层布满，它能够生成图像字幕，自动回复电子邮件，它饱含在新的智能帮手Allo 中，也刚强地增进了谷歌（Google卡塔 尔（阿拉伯语：قطر翻译的成色（从 二零一四年最初卡塔 尔（英语：State of Qatar）。近年来，Google数据基本的非常大片段计量能源未来都在实践 LSTM 任务。

*Cortes, Corinna, and Vladimir Vapnik. “Support-vector networks.”
Machine learning 20.3 (1995): 273-297.*

Original Paper
PDF

## 逐步分析 LSTM 的流水生产线

Step-by-Step LSTM Walk Through

LSTM 的第一步要调节从细胞状态中抛弃哪些音信。那生机勃勃操纵由所谓“遗忘门层”的 S 形网络层做出。它选拔 (h_{t-1}) 和 (x_t)，何况对细胞状态 (C_{t-1}) 中的每个数来讲输出值都在于 0 和 1 之间。1 表示“完全接受那些”，0 代表“完全忽视那么些”。

让大家重回语言模型的事例，试图用先前的词汇预测下三个。在此个难点中，细胞状态或许包罗近些日子主语的词性，因而能够运用精确的代词。当大家看看三个新的主语时，大家供给忘记先前主语的词性。

The first step in our LSTM is to decide what information we're going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at class="math inline">(h_{t-1}) and class="math inline">(x_t), and outputs a number between (0) and class="math inline">(1) for each number in the cell state (C_{t-1}). A class="math inline">(1) represents “completely keep this” while a (0) represents “completely get rid of this.”

Let's go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

下一步便是要分明须求在细胞状态中保留哪些新音信。这里分成两有的。第大器晚成有的，多少个所谓“输入门层”的 S 形互联网层明确如何新闻需求更新。第二部分，八个 tanh 形网络层制造四个新的备选值向量—— (tilde{C}_t)，能够用来增加到细胞状态。在下一步中我们将方面的两有的构成起来，发生对情况的翻新。

在我们的言语模型中，我们要把新主语的词性插足状态，替代必要遗忘的旧主语。

The next step is to decide what new information we're going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we'll update. Next, a tanh layer creates a vector of new candidate values, class="math inline">(tilde{C}_t), that could be added to the state. In the next step, we'll combine these two to create an update to the state.

In the example of our language model, we'd want to add the gender of the new subject to the cell state, to replace the old one we're forgetting.

以后更新旧的细胞状态 (C_{t-1}) 更新到 (C_{t})。先前的步骤已经决定要做什么，大家只须求照做就好。

我们对旧的状态乘以 (f_t)，用来忘记大家决定忘记的事。然后我们抬高 (i_t*tilde{C}_t)，那是新的候选值，依照大家对每一个情状调控的更新值按比例实行缩放。

言语模型的例子中，就是在那大家依据从前的步调遗弃旧主语的词性，加多新主语的词性。

It's now time to update the old cell state, class="math inline">(C_{t-1}), into the new cell state (C_t). The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by class="math inline">(f_t), forgetting the things we decided to forget earlier. Then we add class="math inline">(i_t*tilde{C}_t). This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we'd actually drop the information about the old subject's gender and add the new information, as we decided in the previous steps.

最后，大家须要明显输出值。输出依赖于大家的细胞状态，但会是三个“过滤的”版本。首先大家运转S 形互联网层，用来规定细胞状态中的哪些部分能够输出。然后，大家把细胞状态输入 (tanh)（把数值调解到 (-1) 和 (1) 之间卡塔 尔（英语：State of Qatar）再和 S 形网络层的出口数量相乘，那样大家就能够输出想要输出的一些。

以语言模型为例子，大器晚成旦现身一个主语，主语的新闻会影响到跟着现身的动词。举个例子，知道主语是单数依然复数，就能够知道跟着动词的情势。

Finally, we need to decide what we're going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we're going to output. Then, we put the cell state through class="math inline">(tanh) (to push the values to be between (-1) and class="math inline">(1)) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that's what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that's what follows next.

#### Step-by-Step LSTM Walk Through

第一步是选拔cell state中要被撇下的音讯，这一步由被称之为“forget gate
layer”的sigmoid layer达成。sigmoid
layer依照输入h_{t-1}和x_{t}，并为cell state
C_{t-1}中种种值输出一个介怀0-1里边的值。当输出为 1
表示完全保存那几个cell state音信，当输出为 0
表示完全放任。举个例子说假诺我们品尝选择言语模型，依据此前全体的背景消息来预测下二个词。在这里样的主题素材中，cell
state也许包涵这两天大旨的性别，因而能够利用正确的代词。
当我们看见三个新的基本点时，大家想忘记旧主体的性别。

下图即为“forget gate layer”示图：

[图表上传失利...(image-4aad78-1521165904331)]

接下去选取/决定要存入到cell
state的新消息。那步有七个部分。首先，被叫做“input gate layer”的sigmoid
layer决定大家将立异哪些值。接下来，tanh层成立三个新的候选值向量C^{〜}_{t}，能够拉长到状态state中。在下一步中，我们将构成那二者来实现细胞状态cell
state的立异。

在大家的语言模型的例子中，大家盼望将新器重的性别增添到cell
state中，以替换大家抛开的旧主体性别音讯。

下图为“input gate layer” tanh layer示图：

input gate layer tanh layer

今昔是时候将事先的cell state C_{t-1}更新为cell status
C_{t}。 在此以前的手续已经调控要做什么样，大家只须要真正做到这点。

咱俩将旧状态C_{t-1}乘以f_{t}，忘记/丢弃我们原先决定扬弃的音讯。
然后拉长i_{t}*C^{〜}_{t}。
这是新的候选值，依据我们决定更新每一种状态值的百分比进行缩放。

就语言模型来讲，那得以达成了大家实际上吐弃旧主体性别音讯并增加新中央消息的操作。进程如下图所示：

更新cell state

聊起底，我们须求调控大家要出口的内容。 那些输出将依靠大家的cell state，但将是贰个过滤版本。 首先，大家运营八个sigmoid layer，它决定大家要出口的cell state的怎么部分。 然后，将cell state 通过tanh（将值推到-1和1里头卡塔尔国并将其乘以sigmoid gate的输出，以便大家只输出决定输出的风姿洒脱部分。

对于语言模型示例，由于它无独有偶见到了二个主体，由此它恐怕需求输出与动词相关的音信，防止接下来会生出什么样。
举例，它或然会输出主体是单数照旧复数，以便我们驾驭要是接下去是如何，应该将动词的方式组合到一起。这一个片段是由此sigmoid
layer完成cell state的过滤，依照过滤版本的cell
state改善输出h_{t}.

上述进度如下图所示：

模型输出

**Sparse autoencoders (SAE)** are in a way the opposite of AEs. Instead
of teaching a network to represent a bunch of information in less
“space” or nodes, we try to encode information in more space. So instead
of the network converging in the middle and then expanding back to the
input size, we blow up the middle. These types of networks can be used
to extract many small features from a dataset. If one were to train a
SAE the same way as an AE, you would in almost all cases end up with a
pretty useless identity network (as in what comes in is what comes out,
without any transformation or decomposition). To prevent this, instead
of feeding back the input, we feed back the input plus a sparsity
driver. This sparsity driver can take the form of a threshold filter,
where only a certain error is passed back and trained, the other error
will be “irrelevant” for that pass and set to zero. In a way this
resembles spiking neural networks, where not all neurons fire all the
time (and points are scored for biological plausibility).

## 长短时间记忆的变种

Variants on Long Short Term Memory

脚下作者所叙述的还只是三个十三分经常化的 LSTM 互连网。但实际不是全数 LSTM 网络都和事先描述的同意气风发。事实上，差不离具有小说都会改革 LSTM 网络获取一个一定版本。差距是次要的，但有供给认知一下那些变种。

八个盛行的 LSTM 变种由 Gers 和 Schmidhuber 建议，在 LSTM 的底子上增多了二个“窥视孔连接”，那象征大家能够让门互连网层输入细胞状态。

What I've described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it's worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

上海教室中我们为保有门增加窥视孔，但为数不菲舆论只为部分门增加。

另二个变种把遗忘和输入门结合起来。同期分明要忘记的消息和要增多的新音讯，而不再是分离鲜明。当输入的时候才会遗忘，当遗忘旧音信的时候才会输入新数据。

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we're going to input something in its place. We only input new values to the state when we forget something older.

贰个更风趣的 LSTM 变种称为 Gated Recurrent Unit（GRU卡塔尔国，由 Cho 等人提出。GRU 把遗忘门和输入门归拢成为二个“更新门”，把细胞状态和饱含状态合并，还会有其他变化。这样做使得 GRU 比正规的 LSTM 模型更简约，由此正在变得流行起来。

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

这么些只是多少著名 LSTM 变种中的一小部分。还恐怕有别的变种，比如 Yao 等人提议的 Depth Gated CRUISERNN。也会有部分完全两样的方式管理长时间依靠，比如Koutnik 等人建议的 Clockwork MuranoNN。

那几个变种哪三个是最佳的？它们中间的分别主要呢？Greff 等人做了探讨，细致的比较流行的变种，结果发现它们大约都千篇一律。若泽fowicz 等人测量检验了风姿洒脱万余种 QashqaiNN 架构，发掘在特定难点上稍加架构的变现好于 LSTM。

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There's also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they're all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

#### The Core Idea Behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt(输送带). It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

LSTM能够去除或充实cell state的音讯,并被称为门(gates)的构造致密调整。

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation(逐点乘法运算).

forget gate layer

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

## LSTM 的骨干主见

The Core Idea Behind LSTMs

LSTM 的基本点是细胞状态，即图中上方的水平线。

细胞状态有一点点像传送带。它贯穿整个链条，只有点说不上的线性人机联作功效。新闻非常轻松以不改变的法子流过。

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It's very easy for information to just flow along it unchanged.

LSTM 可以通过所谓“门”的精细结构向细胞状态增加或移除新闻。

门能够选用性地以让新闻经过。它们由 S 形神经互连网层和逐点乘法运算组成。

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

S 形互连网的输出值介于 0 和 1 之间，表示有多大比重的消息透过。0 值表示“未有音信透过”，1 值表示“全体音信通过”。

叁个 LSTM 有三种那样的门用来保持和调节细胞状态。

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

### The Problem of Long-Term Dependencies[1]

奥迪Q5NNs模型能够`connect previous information to the present task,such as using previous video frames might inform the understanding of the present frame.`

路虎极光NNs怎么样完毕上述目的呢？那必要按处境而定。

不常候，大家只需求查阅前段时间的消息来施行业前的职责。
比方，思考三个语言模型总结依赖在此之前的单词预测下多少个词。
假诺大家筹划预测“the clouds are in the sky
”的最终三个词，我们无需其余进一步的背景(上下文) -
很显眼，下二个词将是sky。
在这里种景色下，当前任务演习时ENCORENNs模型须要过去n个新闻且n十分小。`the gap between the relevant information and the place that it’s needed is small`

不过也会有必要广大上下文音信的事态。倘诺大家试图预测长句的末段二个单词：`Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.”`

，方今的新闻`I speak fluent French`

意味着/提醒下一个单词可能是某种语言的称号，可是即使大家缩短范围到现实某种语言时，大家必要有关France的背景新闻。那么使用EnclaveNNs训练时需求过去n个新闻，且n要丰硕大。`the gap between the relevant information and the point where it is needed to become very large`

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

答辩上，本田UR-VNNs能够拍卖“long-term dependencies.”,不过实际操作中，君越NNs不可能读书/练习那样的难题，即须求的一瞑不视音讯n数量过大的景观下，HighlanderNNs将不再适用。The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

**LSTM模型能够拍卖“long-term dependencies”的标题**

[Update 30 november 2017] Looking for a poster of the neural network zoo? Click here

### RNN

平时神经互连网没有酌量数据的四处影响。日常，前边输入神经元的数额对后输入的数额有震慑。思虑到那点只怕说为了减轻传统神经网络不可能捕捉/利用`previous event affect the later ones`

，提议了EnclaveNN，网络中投入循环。下图是宝马X3NN互连网示图。

RNN

RAV4NN互连网精气神儿上是多少个常备神经网络的连接，每一个神经元向下三个传递新闻，如下图所示:

HighlanderNN链式结构

"LSTMs",a very special kind of recurrent neural network which works,for many tasks,much much better tahn the standard version.

- 是什么？
- 为什么？
- 做什么？
- 怎么做？

**Neural Turing machines (NTM)** can be understood as an abstraction of
LSTMs and an attempt to un-black-box neural networks (and give us some
insight in what is going on in there). Instead of coding a memory cell
directly into a neuron, the memory is separated. It’s an attempt to
combine the efficiency and permanency of regular digital storage and the
efficiency and expressive power of neural networks. The idea is to have
a content-addressable memory bank and a neural network that can read and
write from it. The “Turing” in Neural Turing Machines comes from them
being Turing complete: the ability to read and write and change state
based on what it reads means it can represent anything a Universal
Turing Machine can represent.

#### 升高远望

LSTM以往的升高趋势：

- Attention:Xu, et al. (2015)
- Grid LSTMs:Kalchbrenner, et al. (2015)
- RNN in generative models:Gregor, et al. (2015),Chung, et al. (2015),Bayer & Osendorfer (2015)

**Deconvolutional networks (DN)**, also called inverse graphics networks
(IGNs), are reversed convolutional neural networks. Imagine feeding a
network the word “cat” and training it to produce cat-like pictures, by
comparing what it generates to real pictures of cats. DNNs can be
combined with FFNNs just like regular CNNs, but this is about the point
where the line is drawn with coming up with new abbreviations. They may
be referenced as deep deconvolutional neural networks, but you could
argue that when you stick FFNNs to the back and the front of DNNs that
you have yet another architecture which deserves a new name. Note that
in most applications one wouldn’t actually feed text-like input to the
network, more likely a binary classification input vector. Think <0,
1> being cat, <1, 0> being dog and <1, 1> being cat and
dog. The pooling layers commonly found in CNNs are often replaced with
similar inverse operations, mainly interpolation and extrapolation with
biased assumptions (if a pooling layer uses max pooling, you can invent
exclusively lower new data when reversing it).

**Deep convolutional inverse graphics networks (DCIGN)** have a somewhat
misleading name, as they are actually VAEs but with CNNs and DNNs for
the respective encoders and decoders. These networks attempt to model
“features” in the encoding as probabilities, so that it can learn to
produce a picture with a cat and a dog together, having only ever seen
one of the two in separate pictures. Similarly, you could feed it a
picture of a cat with your neighbours’ annoying dog on it, and ask it to
remove the dog, without ever having done such an operation. Demo’s have
shown that these networks can also learn to model complex
transformations on images, such as changing the source of light or the
rotation of a 3D object. These networks tend to be trained with
back-propagation.

**Echo state networks (ESN)** are yet another different type of
(recurrent) network. This one sets itself apart from others by having
random connections between the neurons (i.e. not organised into neat
sets of layers), and they are trained differently. Instead of feeding
input and back-propagating the error, we feed the input, forward it
and update the neurons for a while, and observe the output over time.
The input and the output layers have a slightly unconventional
role as the input layer is used to prime the network and the output
layer acts as an observer of the activation patterns that unfold over
time. During training, only the connections between the observer and the
(soup of) hidden units are changed.

*Kohonen, Teuvo. “Self-organized formation of topologically correct
feature maps.” Biological cybernetics 43.1 (1982): 59-69.*

Original Paper
PDF

*Rosenblatt, Frank. “The perceptron: a probabilistic model for
information storage and organization in the brain.” Psychological review
65.6 (1958): 386.*

Original Paper
PDF

**Bidirectional recurrent neural networks, bidirectional long / short
term memory networks and bidirectional gated recurrent units (BiRNN,
BiLSTM and BiGRU respectively)** are not shown on the chart because they
look exactly the same as their unidirectional counterparts. The
difference is that these networks are not just connected to the past,
but also to the future. As an example, unidirectional LSTMs might be
trained to predict the word “fish” by being fed the letters one by one,
where the recurrent connections through time remember the last value. A
BiLSTM would also be fed the next letter in the sequence on the backward
pass, giving it access to future information. This trains the network to
fill in gaps instead of advancing information, so instead of expanding
an image on the edge, it could fill a hole in the middle of an image.

*LeCun, Yann, et al. “Gradient-based learning applied to document
recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.*

Original Paper PDF

*Schuster, Mike, and Kuldip K. Paliwal. “Bidirectional recurrent neural
networks.” IEEE Transactions on Signal Processing 45.11 (1997):
2673-2681.*

Original Paper
PDF

One problem with drawing them as node maps: it doesn’t really show how they’re used. For example, variational autoencoders (VAE) may look just like autoencoders (AE), but the training process is actually quite different. The use-cases for trained networks differ even more, because VAEs are generators, where you insert noise to get a new sample. AEs, simply map whatever they get as input to the closest training sample they “remember”. I should add that this overview is in no way clarifying how each of the different node types work internally (but that’s a topic for another day).

And finally, **Kohonen networks (KN, also self organising (feature) map,
SOM, SOFM)** “complete” our zoo. KNs utilise competitive learning to
classify data without supervision. Input is presented to the network,
after which the network assesses which of its neurons most closely match
that input. These neurons are then adjusted to match the input even
better, dragging along their neighbours in the process. How much the
neighbours are moved depends on the distance of the neighbours to the
best matching units. KNs are sometimes not considered neural networks
either.

*Hayes, Brian. “First links in the Markov chain.” American Scientist
101.2 (2013): 252.*

Original Paper
PDF

*Bengio, Yoshua, et al. “Greedy layer-wise training of deep networks.”
Advances in neural information processing systems 19 (2007): 153.*

Original Paper
PDF

**Gated recurrent units (GRU)** are a slight variation on LSTMs. They
have one less gate and are wired slightly differently: instead of an
input, output and a forget gate, they have an update gate. This update
gate determines both how much information to keep from the last state
and how much information to let in from the previous layer. The reset
gate functions much like the forget gate of an LSTM but it’s located
slightly differently. They always send out their full state, they don’t
have an output gate. In most cases, they function very similarly to
LSTMs, with the biggest difference being that GRUs are slightly faster
and easier to run (but also slightly less expressive). In practice these
tend to cancel each other out, as you need a bigger network to regain
some expressiveness which then in turn cancels out the performance
benefits. In some cases where the extra expressiveness is not needed,
GRUs can outperform LSTMs.

**Deep belief networks (DBN)** is the name given to stacked
architectures of mostly RBMs or VAEs. These networks have been shown to
be effectively trainable stack by stack, where each AE or RBM only has
to learn to encode the previous network. This technique is also known as
greedy training, where greedy means making locally optimal solutions to
get to a decent but possibly not optimal answer. DBNs can be trained
through contrastive divergence or back-propagation and learn to
represent the data as a probabilistic model, just like regular RBMs or
VAEs. Once trained or converged to a (more) stable state through
unsupervised learning, the model can be used to generate new data. If
trained with contrastive divergence, it can even classify existing data
because the neurons have been taught to look for different features.

Composing a complete list is practically impossible, as new architectures are invented all the time. Even if published it can still be quite challenging to find them even if you’re looking for them, or sometimes you just overlook some. So while this list may provide you with some insights into the world of AI, please, by no means take this list for being comprehensive; especially if you read this post long after it was written.

**Feed forward neural networks (FF or FFNN) and perceptrons (P)** are
very straight forward, they feed information from the front to the back
(input and output, respectively). Neural networks are often described as
having layers, where each layer consists of either input, hidden or
output cells in parallel. A layer alone never has connections and in
general two adjacent layers are fully connected (every neuron form one
layer to every neuron to another layer). The simplest somewhat practical
network has two input cells and one output cell, which can be used to
model logic gates. One usually trains FFNNs through back-propagation,
giving the network paired datasets of “what goes in” and “what we want
to have coming out”. This is called supervised learning, as opposed to
unsupervised learning where we only give it input and let the network
fill in the blanks. The error being back-propagated is often some
variation of the difference between the input and the output (like MSE
or just the linear difference). Given that the network has enough hidden
neurons, it can theoretically always model the relationship between the
input and output. Practically their use is a lot more limited but they
are popularly combined with other networks to form new networks.

*Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in
Neural Information Processing Systems. 2014.*

Original Paper PDF

So I decided to compose a cheat sheet containing many of those architectures. Most of these are neural networks, some are completely different beasts. Though all of these architectures are presented as novel and unique, when I drew the node structures… their underlying relations started to make more sense.

[Update 29 september 2016] Added links and citations to all the original papers. A follow up post is planned, since I found at least 9 more architectures. I will not include them in this post for better consistency in terms of content.

*He, Kaiming, et al. “Deep residual learning for image recognition.”
arXiv preprint arXiv:1512.03385 (2015).*

Original Paper PDF

*Elman, Jeffrey L. “Finding structure in time.” Cognitive science 14.2
(1990): 179-211.*

Original Paper PDF

**Support vector machines (SVM)** find optimal solutions for
classification problems. Classically they were only capable of
categorising linearly separable data; say finding which images are of
Garfield and which of Snoopy, with any other outcome not being possible.
During training, SVMs can be thought of as plotting all the data
(Garfields and Snoopys) on a graph (2D) and figuring out how to draw a
line between the data points. This line would separate the data, so that
all Snoopys are on one side and the Garfields on the other. This line
moves to an optimal line in such a way that the margins between the data
points and the line are maximised on both sides. Classifying new data
would be done by plotting a point on this graph and simply looking on
which side of the line it is (Snoopy side or Garfield side). Using the
kernel trick, they can be taught to classify n-dimensional data. This
entails plotting points in a 3D plot, allowing it to distinguish between
Snoopy, Garfield AND Simon’s cat, or even higher dimensions
distinguishing even more cartoon characters. SVMs are not always
considered neural networks.

*Chung, Junyoung, et al. “Empirical evaluation of gated recurrent neural
networks on sequence modeling.” arXiv preprint arXiv:1412.3555
(2014).*

Original Paper PDF

**Long / short term memory (LSTM)** networks try to combat the vanishing
/ exploding gradient problem by introducing gates and an explicitly
defined memory cell. These are inspired mostly by circuitry, not so much
biology. Each neuron has a memory cell and three gates: input, output
and forget. The function of these gates is to safeguard the information
by stopping or allowing the flow of it. The input gate determines how
much of the information from the previous layer gets stored in the cell.
The output layer takes the job on the other end and determines how much
of the next layer gets to know about the state of this cell. The forget
gate seems like an odd inclusion at first but sometimes it’s good to
forget: if it’s learning a book and a new chapter begins, it may be
necessary for the network to forget some characters from the previous
chapter. LSTMs have been shown to be able to learn complex sequences,
such as writing like Shakespeare or composing primitive music. Note that
each of these gates has a weight to a cell in the previous neuron, so
they typically require more resources to run.

**Generative adversarial networks (GAN)** are from a different breed of
networks, they are twins: two networks working together. GANs consist of
any two networks (although often a combination of FFs and CNNs), with
one tasked to generate content and the other has to judge content. The
discriminating network receives either training data or generated
content from the generative network. How well the discriminating network
was able to correctly predict the data source is then used as part of
the error for the generating network. This creates a form of competition
where the discriminator is getting better at distinguishing real data
from generated data and the generator is learning to become less
predictable to the discriminator. This works well in part because even
quite complex noise-like patterns are eventually predictable but
generated content similar in features to the input data is harder to
learn to distinguish. GANs can be quite difficult to train, as you don’t
just have to train two networks (either of which can pose it’s own
problems) but their dynamics need to be balanced as well. If prediction
or generation becomes to good compared to the other, a GAN won’t
converge as there is intrinsic divergence.

*Cambria, Erik, et al. “Extreme learning machines [trends &
controversies].” IEEE Intelligent Systems 28.6 (2013): 30-59.*

Original Paper
PDF

*Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.”
Neural computation 9.8 (1997): 1735-1780.*

Original Paper
PDF

**Radial basis function (RBF)** networks are FFNNs with radial basis
functions as activation functions. There’s nothing more to it. Doesn’t
mean they don’t have their uses, but most FFNNs with other activation
functions don’t get their own name. This mostly has to do with inventing
them at the right time.

转自：

**Denoising autoencoders (DAE)** are AEs where we don’t feed just the
input data, but we feed the input data with noise (like making an image
more grainy). We compute the error the same way though, so the output of
the network is compared to the original input without noise. This
encourages the network not to learn details but broader features, as
learning smaller features often turns out to be “wrong” due to it
constantly changing with noise.

**Restricted Boltzmann machines (RBM)** are remarkably similar to BMs
(surprise) and therefore also similar to HNs. The biggest difference
between BMs and RBMs is that RBMs are a better usable because they are
more restricted. They don’t trigger-happily connect every neuron to
every other neuron but only connect every different group of neurons to
every other group, so no input neurons are directly connected to other
input neurons and no hidden to hidden connections are made either. RBMs
can be trained like FFNNs with a twist: instead of passing data forward
and then back-propagating, you forward pass the data and then backward
pass the data (back to the first layer). After that you train with
forward-and-back-propagation.

*Smolensky, Paul. Information processing in dynamical systems:
Foundations of harmony theory. No. CU-CS-321-86. COLORADO UNIV AT
BOULDER DEPT OF COMPUTER SCIENCE, 1986.*

Original Paper
PDF

**Autoencoders (AE)** are somewhat similar to FFNNs as AEs are more like
a different use of FFNNs than a fundamentally different architecture.
The basic idea behind autoencoders is to encode information (as in
compress, not encrypt) automatically, hence the name. The entire network
always resembles an hourglass like shape, with smaller hidden layers
than the input and output layers. AEs are also always symmetrical around
the middle layer(s) (one or two depending on an even or odd amount of
layers). The smallest layer(s) is|are almost always in the middle, the
place where the information is most compressed (the chokepoint of the
network). Everything up to the middle is called the encoding part,
everything after the middle the decoding and the middle (surprise) the
code. One can train them using backpropagation by feeding input and
setting the error to be the difference between the input and what came
out. AEs can be built symmetrically when it comes to weights as well, so
the encoding weights are the same as the decoding weights.

[Update 15 september 2016] I would like to thank everybody for their insights and corrections, all feedback is hugely appreciated. I will add links and a couple more suggested networks in a future update, stay tuned.

**Extreme learning machines (ELM)** are basically FFNNs but with random
connections. They look very similar to LSMs and ESNs, but they are not
recurrent nor spiking. They also do not use backpropagation. Instead,
they start with random weights and train the weights in a single step
according to the least-squares fit (lowest error across all functions).
This results in a much less expressive network but it’s also much faster
than backpropagation.

*Hopfield, John J. “Neural networks and physical systems with emergent
collective computational abilities.” Proceedings of the national academy
of sciences 79.8 (1982): 2554-2558.*

Original Paper
PDF

A **Hopfield network (HN)** is a network where every neuron is connected
to every other neuron; it is a completely entangled plate of spaghetti
as even all the nodes function as everything. Each node is input before
training, then hidden during training and output afterwards. The
networks are trained by setting the value of the neurons to the desired
pattern after which the weights can be computed. The weights do not
change after this. Once trained for one or more patterns, the network
will always converge to one of the learned patterns because the network
is only stable in those states. Note that it does not always conform to
the desired state (it’s not a magic black box sadly). It stabilises in
part due to the total “energy” or “temperature” of the network being
reduced incrementally during training. Each neuron has an activation
threshold which scales to this temperature, which if surpassed by
summing the input causes the neuron to take the form of one of two
states (usually -1 or 1, sometimes 0 or 1). Updating the network can be
done synchronously or more commonly one by one. If updated one by one, a
fair random sequence is created to organise which cells update in what
order (fair random being all options (n) occurring exactly once every n
items). This is so you can tell when the network is stable (done
converging), once every cell has been updated and none of them changed,
the network is stable (annealed). These networks are often called
associative memory because the converge to the most similar state as the
input; if humans see half a table we can image the other half, this
network will converge to a table if presented with half noise and half a
table.

With new neural network architectures popping up every now and then, it’s hard to keep track of them all. Knowing all the abbreviations being thrown around (DCIGN, BiLSTM, DCGAN, anyone?) can be a bit overwhelming at first.

*Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann
LeCun. “Efficient learning of sparse representations with an
energy-based model.” Proceedings of NIPS. 2007.*

Original Paper
PDF

*Broomhead, David S., and David Lowe. Radial basis functions,
multi-variable functional interpolation and adaptive networks. No.
RSRE-MEMO-4148. ROYAL SIGNALS AND RADAR ESTABLISHMENT MALVERN (UNITED
KINGDOM), 1988.*

Original Paper PDF

*Kingma, Diederik P., and Max Welling. “Auto-encoding variational
bayes.” arXiv preprint arXiv:1312.6114 (2013).*

Original Paper PDF

*Zeiler, Matthew D., et al. “Deconvolutional networks.” Computer Vision
and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.*

Original Paper
PDF

*Maass, Wolfgang, Thomas Natschläger, and Henry Markram. “Real-time
computing without stable states: A new framework for neural computation
based on perturbations.” Neural computation 14.11 (2002): 2531-2560.*

Original Paper
PDF

**Recurrent neural networks (RNN)** are FFNNs with a time twist: they
are not stateless; they have connections between passes, connections
through time. Neurons are fed information not just from the previous
layer but also from themselves from the previous pass. This means that
the order in which you feed the input and train the network matters:
feeding it “milk” and then “cookies” may yield different results
compared to feeding it “cookies” and then “milk”. One big problem with
RNNs is the vanishing (or exploding) gradient problem
where, depending on the activation functions used, information
rapidly gets lost over time, just like very deep FFNNs lose information
in depth. Intuitively this wouldn’t be much of a problem because these
are just weights and not neuron states, but the weights through time is
actually where the information from the past is stored; if the weight
reaches a value of 0 or 1 000 000, the previous state won’t be very
informative. RNNs can in principle be used in many fields as most forms
of data that don’t actually have a timeline (i.e. unlike sound or video)
can be represented as a sequence. A picture or a string of text can be
fed one pixel or character at a time, so the time dependent weights are
used for what came before in the sequence, not actually from what
happened x seconds before. In general, recurrent networks are a good
choice for advancing or completing information, such as autocompletion.

*Vincent, Pascal, et al. “Extracting and composing robust features with
denoising autoencoders.” Proceedings of the 25th international
conference on Machine learning. ACM, 2008.*

Original Paper
PDF

*Hinton, Geoffrey E., and Terrence J. Sejnowski. “Learning and releaming
in Boltzmann machines.” Parallel distributed processing: Explorations in
the microstructure of cognition 1 (1986): 282-317.*

Original Paper
PDF

*Jaeger, Herbert, and Harald Haas. “Harnessing nonlinearity: Predicting
chaotic systems and saving energy in wireless communication.” science
304.5667 (2004): 78-80.*

Original Paper
PDF

**Markov chains (MC or discrete time Markov Chain, DTMC)** are kind of
the predecessors to BMs and HNs. They can be understood as follows: from
this node where I am now, what are the odds of me going to any of my
neighbouring nodes? They are memoryless (i.e. Markov Property) which
means that every state you end up in depends completely on the previous
state. While not really a neural network, they do resemble neural
networks and form the theoretical basis for BMs and HNs. MC aren’t
always considered neural networks, as goes for BMs, RBMs and HNs. Markov
chains aren’t always fully connected either.

**Variational autoencoders (VAE)** have the same architecture as AEs but
are “taught” something else: an approximated probability distribution of
the input samples. It’s a bit back to the roots as they are bit more
closely related to BMs and RBMs. They do however rely on Bayesian
mathematics regarding probabilistic inference and independence, as well
as a re-parametrisation trick to achieve this different representation.
The inference and independence parts make sense intuitively, but they
rely on somewhat complex mathematics. The basics come down to this: take
influence into account. If one thing happens in one place and something
else happens somewhere else, they are not necessarily related. If they
are not related, then the error propagation should consider that. This
is a useful approach because neural networks are large graphs (in a
way), so it helps if you can rule out influence from some nodes to other
nodes as you dive into deeper layers.

**Boltzmann machines (BM)** are a lot like HNs, but: some neurons are
marked as input neurons and others remain “hidden”. The input neurons
become output neurons at the end of a full network update. It starts
with random weights and learns through back-propagation, or more
recently through contrastive divergence (a Markov chain is used to
determine the gradients between two informational gains). Compared to a
HN, the neurons mostly have binary activation patterns. As hinted by
being trained by MCs, BMs are stochastic networks. The training and
running process of a BM is fairly similar to a HN: one sets the input
neurons to certain clamped values after which the network is set free
(it doesn’t get a sock). While free the cells can get any value and we
repetitively go back and forth between the input and hidden neurons. The
activation is controlled by a global temperature value, which if lowered
lowers the energy of the cells. This lower energy causes their
activation patterns to stabilise. The network reaches an equilibrium
given the right temperature.

**Liquid state machines (LSM)** are similar soups, looking a lot like
ESNs. The real difference is that LSMs are a type of spiking neural
networks: sigmoid activations are replaced with threshold functions and
each neuron is also an accumulating memory cell. So when updating a
neuron, the value is not set to the sum of the neighbours, but rather
added to itself. Once the threshold is reached, it releases its’ energy
to other neurons. This creates a spiking like pattern, where nothing
happens for a while until a threshold is suddenly reached.

**Deep residual networks (DRN)** are very deep FFNNs with extra
connections passing input from one layer to a later layer (often 2 to 5
layers) as well as the next layer. Instead of trying to find a solution
for mapping some input to some output across say 5 layers, the network
is enforced to learn to map some input to some output some input.
Basically, it adds an identity to the solution, carrying the older input
over and serving it freshly to a later layer. It has been shown that
these networks are very effective at learning patterns up to 150 layers
deep, much more than the regular 2 to 5 layers one could expect to
train. However, it has been proven that these networks are in essence
just RNNs without the explicit time based construction and they’re often
compared to LSTMs without gates.

Any feedback and criticism is welcome. At the Asimov Institute we do deep learning research and development, so be sure to follow us on twitter for future updates and posts! Thank you for reading!

For each of the architectures depicted in the picture, I wrote a *very,
very* brief description. You may find some of these to be useful if
you’re quite familiar with some architectures, but you aren’t familiar
with a particular one.

本文由时时app平台注册网站发布于编程知识,转载请注明出处：LSTM入门【时时app平台注册网站】

关键词：