# 手动编写神经网络前向/反向传递

### 前向传递

a 1 ( 2 ) = g ( Θ 10 ( 1 ) x 0 + Θ 11 ( 1 ) x 1 + Θ 12 ( 1 ) x 2 + Θ 13 ( 1 ) x 3 ) a 2 ( 2 ) = g ( Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 ) a 3 ( 2 ) = g ( Θ 30 ( 1 ) x 0 + Θ 31 ( 1 ) x 1 + Θ 32 ( 1 ) x 2 + Θ 33 ( 1 ) x 3 ) h Θ ( x ) = g ( Θ 10 ( 2 ) a 0 ( 2 ) + Θ 11 ( 2 ) a 1 ( 2 ) + Θ 12 ( 2 ) a 2 ( 2 ) + Θ 13 ( 2 ) a 3 ( 2 ) ) (1) \begin{aligned} &a_{1}^{(2)}=g\left(\Theta_{10}^{(1)} x_{0}+\Theta_{11}^{(1)} x_{1}+\Theta_{12}^{(1)} x_{2}+\Theta_{13}^{(1)} x_{3}\right) \ &a_{2}^{(2)}=g\left(\Theta_{20}^{(1)} x_{0}+\Theta_{21}^{(1)} x_{1}+\Theta_{22}^{(1)} x_{2}+\Theta_{23}^{(1)} x_{3}\right) \ &a_{3}^{(2)}=g\left(\Theta_{30}^{(1)} x_{0}+\Theta_{31}^{(1)} x_{1}+\Theta_{32}^{(1)} x_{2}+\Theta_{33}^{(1)} x_{3}\right) \ &h_{\Theta}(x)=g\left(\Theta_{10}^{(2)} a_{0}^{(2)}+\Theta_{11}^{(2)} a_{1}^{(2)}+\Theta_{12}^{(2)} a_{2}^{(2)}+\Theta_{13}^{(2)} a_{3}^{(2)}\right) \end{aligned} \tag{1}​a 1 (2 )​=g (Θ1 0 (1 )​x 0 ​+Θ1 1 (1 )​x 1 ​+Θ1 2 (1 )​x 2 ​+Θ1 3 (1 )​x 3 ​)a 2 (2 )​=g (Θ2 0 (1 )​x 0 ​+Θ2 1 (1 )​x 1 ​+Θ2 2 (1 )​x 2 ​+Θ2 3 (1 )​x 3 ​)a 3 (2 )​=g (Θ3 0 (1 )​x 0 ​+Θ3 1 (1 )​x 1 ​+Θ3 2 (1 )​x 2 ​+Θ3 3 (1 )​x 3 ​)h Θ​(x )=g (Θ1 0 (2 )​a 0 (2 )​+Θ1 1 (2 )​a 1 (2 )​+Θ1 2 (2 )​a 2 (2 )​+Θ1 3 (2 )​a 3 (2 )​)​(1 )

### ; 反向传递

\qquad从最后一层的误差开始计算，误差是激活单元的预测( a k ( 4 ) ) (a_k^{(4)})(a k (4 )​)与实际值( y k ) (y^k)(y k )之间的误差，( k = 1 : K ) (k=1:K)(k =1 :K )。

δ ( 4 ) = a ( 4 ) − y (2) \delta^{(4)}=a^{(4)}-y \tag{2}δ(4 )=a (4 )−y (2 )
δ ( 3 ) = ( Θ ( 3 ) ) T δ ( 4 ) ⋅ ∗ g ′ ( z ( 3 ) ) (3) \delta^{(3)}=\left(\Theta^{(3)}\right)^{T} \delta^{(4)} \cdot * g^{\prime}\left(z^{(3)}\right) \tag{3}δ(3 )=(Θ(3 ))T δ(4 )⋅∗g ′(z (3 ))(3 )
\qquad其中，g ′ ( z ( 3 ) ) g^{\prime}\left(z^{(3)}\right)g ′(z (3 ))是S i g m o i d Sigmoid S i g m o i d函数的导数，g ′ ( z ( 3 ) ) = a ( 3 ) ⋅ ∗ ( 1 − a ( 3 ) ) g^{\prime}\left(z^{(3)}\right)=a^{(3)} \cdot *\left(1-a^{(3)}\right)g ′(z (3 ))=a (3 )⋅∗(1 −a (3 ))，z z z是未经过S函数的节点。( Θ ( 3 ) ) T δ ( 4 ) \left(\Theta^{(3)}\right)^{T} \delta^{(4)}(Θ(3 ))T δ(4 )是权重导致的误差的和。

δ ( 2 ) = ( Θ ( 2 ) ) T δ ( 3 ) ∗ g ′ ( z ( 2 ) ) (4) \delta^{(2)}=\left(\Theta^{(2)}\right)^{T} \delta^{(3)} * g^{\prime}\left(z^{(2)}\right) \tag{4}δ(2 )=(Θ(2 ))T δ(3 )∗g ′(z (2 ))(4 )
\qquad ( Θ ( n ) ) T δ ( n + 1 ) \left(\Theta^{(n)}\right)^{T} \delta^{(n+1)}(Θ(n ))T δ(n +1 )的解释如下图所示。由第2层的最后一个节点指出2个箭头，他参与到了第三层的这两个节点中，因此δ 2 ( 2 ) = θ 12 ( 2 ) ∗ δ 1 ( 3 ) + θ 22 ( 2 ) ∗ δ 2 ( 3 ) \delta_2^{(2)}=\theta_{12}^{(2)}\delta^{(3)}{1}+\theta{22}^{(2)}\delta^{(3)}_{2}δ2 (2 )​=θ1 2 (2 )​∗δ1 (3 )​+θ2 2 (2 )​∗δ2 (3 )​。δ ( 2 ) \delta^{(2)}δ(2 )可以理解为全部节点的误差，它的长度是3 3 3。

\qquad假设λ = 0 \lambda=0 λ=0，不做任何正则化处理时，有：∂ ∂ Θ i j ( l ) J ( Θ ) = a i j ( l ) δ i l + 1 \frac{\partial}{\partial \Theta_{i j}^{(l)}} J(\Theta)=a_{i j}^{(l)} \delta_{i}^{l+1}∂Θi j (l )​∂​J (Θ)=a i j (l )​δi l +1 ​，i i i是当前层的第i i i个节点的误差，j是对应到θ \theta θ的每一个值(上一层a a a的每一个值)，δ l + 1 \delta^{l+1}δl +1就是和a l a^l a l挂钩的。

\qquad如果我们考虑正则化处理，并且我们的训练集是一个特征矩阵而非向量。在上面的特殊情况中，我们需要计算每一层的误差单元来计算代价函数的偏导数。在更为一般的情况中，我们同样需要计算每一层的误差单元，但是我们需要为整个训练集计算误差单元，此时的误差单元也是一个矩阵。我们用Δ i j ( l ) \Delta_{i j}^{(l)}Δi j (l )​来表示这个误差矩阵，第l l l层的第i i i个激活单元受到第j j j个参数影响而导致的误差。(举例：如果l l l层10个节点，l − 1 l-1 l −1层30个节点，那么这个误差矩阵的维度是( 10 , 30 ) (10,30)(1 0 ,3 0 ))。

f o r i = 1 : m { s e t a ( i ) = x ( i ) p e r f o r m f o r w a r d p r o p a g a t i o n t o c o m p u t e a ( l ) f o r l = 1 , 2 , 3 … . L U s i n g δ ( L ) = a ( L ) − y i p e r f o r m b a c k p r o p a g a t i o n t o c o m p u t e a l l p r e v i o u s l a y e r e r r o r v e c t o r Δ i j ( l ) : = Δ i j ( l ) + a j ( l ) δ i l + 1 } (5) \begin{aligned} for \quad i=1: m {\ &set \quad a^{(i)}=x^{(i)}\ &perform \,\,\, forward \,\,\, propagation \,\,\, to \,\,\, compute\,\,\, a^{(l)}\,\,\, for \,\,\,l=1,2,3 \ldots . L\ &Using \,\,\, \delta^{(L)}=a^{(L)}-y^{i}\ &perform \,\,\,back\,\,\, propagation\,\,\, to\,\,\, compute\,\,\, all\,\,\, previous \,\,\,layer\,\,\, error\,\,\, vector\ &\Delta_{i j}^{(l)}:=\Delta_{i j}^{(l)}+a_{j}^{(l)} \delta_{i}^{l+1}\ &} \end{aligned}\tag{5}f o r i =1 :m {​s e t a (i )=x (i )p e r f o r m f o r w a r d p r o p a g a t i o n t o c o m p u t e a (l )f o r l =1 ,2 ,3 ….L U s i n g δ(L )=a (L )−y i p e r f o r m b a c k p r o p a g a t i o n t o c o m p u t e a l l p r e v i o u s l a y e r e r r o r v e c t o r Δi j (l )​:=Δi j (l )​+a j (l )​δi l +1 ​}​(5 )

\qquad a j ( l ) δ i l + 1 a_{j}^{(l)} \delta_{i}^{l+1}a j (l )​δi l +1 ​这里是a j ( l ) a_{j}^{(l)}a j (l )​意味着是l l l层的所有节点a与l + 1 l+1 l +1层的第i i i个节点相乘->一个矩阵( m , n ) (m,n)(m ,n ) m m m是l + 1 l+1 l +1层节点数，n n n是l l l层节点数。

\qquad δ ( n ) \delta^{(n)}δ(n )是由本层伸出去的θ ( n ) \theta^{(n)}θ(n )算的。Δ i j ( l ) \Delta_{ij}^{(l)}Δi j (l )​的j j j的数目由节点a ( l ) a^{(l)}a (l )得到，i i i的数目由δ l + 1 \delta^{l+1}δl +1得到。计算方法也是本层( l ) (l)(l )的每个节点和( l + 1 ) (l+1)(l +1 )的第i i i误差相乘计算得到。由于a ( l ) a^{(l)}a (l )和δ l + 1 \delta^{l+1}δl +1挂钩，用它俩去算θ \theta θ的误差Δ \Delta Δ矩阵。δ ( l + 1 ) T ∗ a l \delta^{(l+1)T} * a^{l}δ(l +1 )T ∗a l (举例:( 1 , 26 ) T ∗ ( 1 , 401 ) − > ( 26 , 401 ) (1,26)^T * (1,401) -> (26,401)(1 ,2 6 )T ∗(1 ,4 0 1 )−>(2 6 ,4 0 1 ))。

\qquad区别：δ ( n ) \delta^{(n)}δ(n )以本层n n n为中心，看伸出去的去计算，Δ i j ( l ) \Delta_{ij}^{(l)}Δi j (l )​是以( l + 1 ) (l+1)(l +1 )层为中心(l l l层θ \theta θ就对应l + 1 l+1 l +1层)和上层的a a a相乘。

D i j ( l ) : = 1 m Δ i j ( l ) + λ Θ i j ( l ) if j ≠ 0 D i j ( l ) : = 1 m Δ i j ( l ) if j = 0 (6) \begin{aligned} &D_{i j}^{(l)}:=\frac{1}{m} \Delta_{i j}^{(l)}+\lambda \Theta_{i j}^{(l)} \quad \text { if } \quad j \neq 0\ &D_{i j}^{(l)}:=\frac{1}{m} \Delta_{i j}^{(l)} \quad \text { if } j=0 \end{aligned} \tag{6}​D i j (l )​:=m 1 ​Δi j (l )​+λΘi j (l )​if j ​=0 D i j (l )​:=m 1 ​Δi j (l )​if j =0 ​(6 )

### 代码实例

import numpy as np

X = data['X']
y = data['y']


from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
y_onehot = encoder.fit_transform(y)
print(y_onehot.shape)


def sigmoid(z):
return 1 / (1 + np.exp(-z))

def forward_propagate(X, theta1, theta2):

m = X.shape[0]

a1 = np.insert(X, 0, values=np.ones(m), axis=1)

z2 = a1 * theta1.T

a2 = np.insert(sigmoid(z2), 0, values=np.ones(m), axis=1)
z3 = a2 * theta2.T
h = sigmoid(z3)

return a1, z2, a2, z3, h

def cost(params, input_size, hidden_size, num_labels, X, y, learning_rate):
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)

theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))

a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)

J = 0

for i in range(m):
first_term = np.multiply(-y[i, :], np.log(h[i, :]))
second_term = np.multiply((1 - y[i, :]), np.log(1 - h[i, :]))
J += np.sum(first_term - second_term)

J = J / m

return J

def cost_n(params, input_size, hidden_size, num_labels, X, y, learning_rate):
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)

theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))

a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)

J = 0
for i in range(m):
first_term = np.multiply(-y[i, :], np.log(h[i, :]))
second_term = np.multiply((1 - y[i, :]), np.log(1 - h[i, :]))
J += np.sum(first_term - second_term)

J = J / m

J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:, 1:], 2)) + np.sum(np.power(theta2[:, 1:], 2)))

return J



def backprop(params, input_size, hidden_size, num_labels, X, y, learning_rate):
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)

theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))

a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)

J = 0
delta1 = np.zeros(theta1.shape)
delta2 = np.zeros(theta2.shape)

for i in range(m):
first_term = np.multiply(-y[i, :], np.log(h[i, :]))
second_term = np.multiply((1 - y[i, :]), np.log(1 - h[i, :]))
J += np.sum(first_term - second_term)

J = J / m

J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:, 1:], 2)) + np.sum(np.power(theta2[:, 1:], 2)))

for t in range(m):
a1t = a1[t, :]
z2t = z2[t, :]
a2t = a2[t, :]
ht = h[t, :]
yt = y[t, :]

d3t = ht - yt

z2t = np.insert(z2t, 0, values=np.ones(1))
d2t = np.multiply((theta2.T * d3t.T).T, sigmoid_gradient(z2t))

delta1 = delta1 + (d2t[:, 1:]).T * a1t
delta2 = delta2 + d3t.T * a2t

delta1 = delta1 / m
delta2 = delta2 / m

delta1[:, 1:] = delta1[:, 1:] + (theta1[:, 1:] * learning_rate) / m
delta2[:, 1:] = delta2[:, 1:] + (theta2[:, 1:] * learning_rate) / m




J, grad = backprop(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)



input_size = 400
hidden_size = 25
num_labels = 10
learning_rate = 1

params = (np.random.random(size=hidden_size * (input_size + 1) + num_labels * (hidden_size + 1)) - 0.5) * 0.25

m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
print(params.shape)

theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))

print(theta1.shape, theta2.shape)



from scipy.optimize import minimize

fmin = minimize(fun=backprop, x0=params, args=(input_size, hidden_size, num_labels, X, y_onehot, learning_rate),
method='TNC', jac=True, options={'maxiter': 250})
print(fmin)

X = np.matrix(X)
theta1 = np.matrix(np.reshape(fmin.x[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(fmin.x[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))


a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
y_pred = np.array(np.argmax(h, axis=1) + 1)

correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y)]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print ('accuracy = {0}%'.format(accuracy * 100))


Original: https://blog.csdn.net/Mars1533/article/details/124432136
Author: 无情码手
Title: # 机器学习笔记(二)——手动编写神经网络前向反向传播

## Title: 编程语言居然是魔法咒语！

[En]

If you don’t know the Muggle of magic, it doesn’t matter, let me unravel the great mystery of this century little by little.

## 炼金术

[En]

In a magic array made up of simple language tags, all objects in the magic array are attached to specific magic effects, or are converted into other objects.

[En]

In fact, this is also a necessary condition for all magic, which is the basic law of operation of the world in which it lives.

[En]

As we all know, there are different laws of the world in different magical systems, and in our barren plane, there are no basic elements such as “magic” and “spirituality” in magical systems.

[En]

This is a virtual container, as long as the appropriate way, just put the spell into this container, it will achieve its original effect.

[En]

Yes, this is a function to bold and enlarge the text, but of course it’s a trivial trick.

[En]

But if you really want to summon the devil, of course, in terms of complexity, it may take us too far.

## 召唤术

[En]

Construct objects from nothingness, because now this white pigeon can only be said to be a piece of paper pigeon, so we call it static “object” and, of course, dynamic “object”. The spell that makes it work is called “Animation programming.”

[En]

Low-level simple magic spells are short in length, high-level magic is often powerful, but spells can also be very lengthy.

[En]

In actual combat, if all the functions are built from scratch, it will take a lot of time, and in a rapidly changing war, it is equivalent to a fixed target. That’s it

## 魔法卷轴

[En]

These scrolls spend more time and energy in their spare time to engrave some practical complex spells in the scrolls.

[En]

Of course, the makers of these scrolls are not necessarily users, but also may be made by others, users through the purchase, or free download.

[En]

There are some of the more common scroll associations, which we call the open source platform, where scrolls are available for free, and the spell details of these scrolls are open to all viewers for free viewing.

## 魔法阵

[En]

The essence of the magic array is still a spell, so all visual programming can still be achieved by writing code manually, but there will be some advantages in development efficiency.

## 默念施法

[En]

The spell caster can recite the spell in his mind and can release the spell very quickly by being silent.

## 多人协助施法

[En]

At this point, some of my friends may think of a basic question when they see the picture of the wind.

## 元素魔法师

0 与 1 会组合出另外一些逻辑性元素：”与””或””非””异或”等逻辑概念，再往上一层就是由CPU指令与这些逻辑符号组合而成的”汇编语言”元素，左移，右移，存入，复制等

[En]

The destruction magic of the dark system corresponds to the hacker, and the healing magic of the light department corresponds to the white hat of the defense class.

[En]

It can create fantasy, or it can destroy heaven and earth. It can turn stone into gold, or it can be thousands of miles away.

Original: https://www.cnblogs.com/7rhythm/p/10428889.html
Author: 鬼柒
Title: 编程语言居然是魔法咒语！

(0)

### 大家都在看

• #### 【折腾】Jittor在ARM Mac下的安装

前置要求，homebrew 按GitHub的quickstart brew install onednn libomp 注意这里没必要装python3.7了，这玩意也不支持arm …

人工智能 2022年10月9日
0111
• #### PyTorch： 目标检测（object detection）介绍

目标检测（object detection） 一、 介绍 在图像分类任务中，我们假设图像中只有一个主要物体对象，我们只关注如何识别其类别。 然而，很多时候图像里有多个我们感兴趣的目…

人工智能 2022年9月8日
0134
• #### 如何使用分布式训练方式进行DNN训练

你好，这篇文章咱们讨论一下关于「如何使用分布式训练方式进行DNN训练」的事情… 分布式训练方式在DNN训练中的应用 深度学习网络(DNN)近年来在机器学习领域取得了非常…

人工智能 3天前
03
• #### 【PyTorch深度学习项目实战100例】—— Python+OpenCV+MediaPipe手势识别系统 | 第2例

; 前言 大家好，我是阿光。 本专栏整理了《PyTorch深度学习项目实战100例》，内包含了各种不同的深度学习项目，包含项目原理以及源码，每一个项目实例都附带有完整的代码+数据集…

人工智能 2022年12月12日
029
• #### 什么是逻辑回归算法

你好，这篇文章咱们讨论一下关于「什么是逻辑回归算法」的事情… 什么是逻辑回归算法？ 逻辑回归算法是一种基于统计学的分类算法，是在给定一组输入变量的情况下，能够预测一个离…

人工智能 2天前
03
• #### 本科生学深度学习-史上最容易懂的RNN文章

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2022年12月7日
040
• #### Opencv中circle(),line(),cv2.rectangle(),cv2.putText()

Opencv中circle(),line(),cv2.rectangle(),cv2.putText() 一、circle()画圆 cv2.circle() 方法用于在任何图像上绘…

人工智能 2022年12月12日
037
• #### TensorRT INT8量化原理与实现（非常详细）

目录 一、模型量化是什么？ 二、为什么要做模型量化？ 三、模型量化的目标是什么？ 四、模型量化的必要条件 五、模型量化的分类 5.1 线性量化和非线性量化 5.2 逐层量化、逐组量…

2022年8月25日
0205
• #### 手把手教你配置Tensorflow开发环境（一）-五分钟带你完全理解Tensorflow

这是一篇科普贴， 具体的安装教程在这里手把手教你配置Tensorflow开发环境（二）-十分钟配置本地Tensorflow, CUDA, cuDNN Pytorch版本的在这里手把…

人工智能 2022年9月9日
0127
• #### 数据分析与挖掘———SPSS Moderler

数据分析与挖掘———SPSS Moderler 一、Modeler给概述 1、SPSS Modeler基本认识 IBM SPSS Modeler是一组 &#x6570;&a…

人工智能 2022年12月11日
042
• #### Pyhton学习记录（二） 利用鸢尾花进行简单数据分析

本文主要内容 1 导入本文所有需要的库 2 数据预处理 3 数据可视化 4 模型训练与测试 1 导入本文所有需要的库 from sklearn.datasets import lo…

人工智能 2022年12月9日
058
• #### 情感分类问题IMDB实战(SimpleRNN，LSTM，GRU）

使用经典的 IMDB 影评数据集来完成情感分类任务。 IMDB 影评数据集包含了50000 条用户评价，评价的标签分为消极和积极， 其中 IMDB 评级 一、数据集加载以及数据集预…

人工智能 2022年11月28日
048
• #### 基于nao机器人实现语音对话（智能版本）

nao机器人实现语音对话 1、语音获取 nao耳麦有一个功能，它可以通过声音大小判断能力值，也就是声音越大能量越大。所以我们此次项目主要运用的就是nao 的这个功能，来展开实现的。…

人工智能 2022年9月7日
0110
• #### 图像滤波基本知识

图像滤波：就是将一幅图像通过滤波器经过转换之后得到另一幅图像，其中滤波器又称为卷积核，滤波的过程称为卷积。 这是一个3*3的卷积核，边更清晰了，用不同的卷积核能得到不同的图像 卷积…

人工智能 2022年12月12日
031