人工智慧第4周筆記

人工智慧與應用筆記4

第四周

其他組報告老師的講義
大家都在看莫凡講解類神經網路

CSM 演算法

增加node的流程 (有7個步驟)

The proposed CSM learning procedure

  • 判斷是不是熟悉的case

  • When we encounter a new case (a new input/output relationship), we first check if it is familiar to us.

    • If it is, there is no spontaneous learning effort involved. Later the new case is merged into our knowledge system.
    • If it is not, we might cram this unfamiliar case first. The cramming results in a strict rule with respect to this unfamiliar case. Then we will soften the strictness of the new case and do our best to merge the new case into our knowledge system.

    • 面對的問題如果是曾經有過的資料,則會選擇相關的資訊回應。

    • 面對的問題如果是未曾有過的資料,則會選擇增加一個新回應。

ASLFN (3層的類神經網路 輸入->隱藏->輸出層)

  • The Adaptive Single-hidden Layer Feed-forward Neural Networks (ASLFN, i.e., the amount of adopted hidden nodes is variable) (references: Tsaih 1993; Tsaih 1998; Tsaih and Cheng 2009; Huang, Yu, Tsaih and Huang Tsaih 2014; Kuo, Lin, and Hsu 2018)
  • 2-class categorization problem and binary inputs {−1, 1}^𝑚

Activation Function (激活函數)

  • tanh, the hyperbolic tangent activation function, is used in all hidden nodes.

The Activation Value of Hidden Nodes (隱藏層的激活值)

  • 激活函數不能亂給,會有梯度爆炸的情形發生。

$𝐱^𝑐≡(𝑥_1^𝑐,𝑥_2^𝑐, …,𝑥_𝑚^𝑐 )T:𝑡ℎ𝑒 input vector of the 𝑐^th case$

$𝑤_{𝑖0}^{𝐻}:the threshold value of the 𝑖^th hidden node$

$𝑤_𝑖𝑗^𝐻:the weight between the 𝑖^th hidden node and the 𝑗^th input node$

$𝐰_𝑖^𝐻≡(𝑤_𝑖0^𝐻,𝑤_𝑖1^𝐻,𝑤_𝑖2^𝐻, …,𝑤_𝑖𝑚^𝐻 )^T$; $𝐰^𝐻≡{({𝐰_1^𝐻}^T,{𝐰_2^𝐻}^T,…,{𝐰_𝑝^𝐻}^T)}^T$

The Activation Value of the Output Node (輸出層的激活數值)

Parameters and Indexes (參數)

  • N denotes the number of all reference observations
  • m denotes the number of input nodes
  • p denotes the number of adopted hidden nodes; p equals 1 at the beginning and is adaptive
  • $𝑦^𝑐$ denotes the desired output of the $𝑐^th$ case, with 1.0 and -1.0 being the desired outputs of classes 1 and 2
  • n denotes the $𝑛^th$ stage of handling n reference observations ${(𝐱^1, 𝑦^1), (𝐱^2, 𝑦^2), …, (𝐱^𝑛, 𝑦^𝑛)}$, and $𝐈(𝑛)$ is the set of indices of these observations.
  • At the $𝑛^th$ stage, the loss function $𝐸_𝑛 (𝐰)≡\frac{∑_{𝑐=1}^{𝑛}(𝑓(𝐱^𝑐,𝐰)−𝑦^𝑐 )^2}{𝑛}+10^{-3}‖𝐰‖^2$
  • $𝐈(𝑛)≡𝐈_1 (𝑛)∪𝐈_2 (𝑛)$, where$𝐈_1(𝑛)$ and $𝐈_2(𝑛)$ are the set of indices of n given cases in classes 1 and 2
  • At the $𝑛^th$ stage with the reference observations ${(𝐱^𝑐, 𝑦^𝑐): 𝑐∈𝐈(𝑛)}$, we look for an acceptable SLFN, in which the condition L regarding ${𝑓(𝐱^𝑐,𝐰), ∀ 𝑐∈𝐈(𝑛)}$ is satisfied.

The condition 𝐿 regarding ${𝑓(𝐱^𝑐,𝐰), ∀ 𝑐∈𝐈(𝑛)}$

The Learning Goal (學習目標)

  • At the $𝑛^{th}$ stage, through minimizing the loss function
  • $𝐸_𝑛 (𝐰)≡\frac{∑_{𝑐=1}^{𝑛}(𝑓(𝐱^𝑐,𝐰)−𝑦^𝑐 )^2}{𝑛}+10^{-3} (∑_{𝑖=0}^{𝑝}{(w_𝑖^o)^2} + ∑_{𝑖=1}^{𝑝}∑_{𝑗=0}^{𝑚}(𝑤_{𝑖𝑗}^{𝐻})^2)$,the learning goal is to seek $𝐰$ where $𝑓(𝐱^𝑐,𝐰)>𝜈,∀ 𝑐∈𝐈_1 (𝑛)$ and $𝑓(𝐱^𝑐,𝐰)≤−𝜈,∀ 𝑐∈𝐈_2(𝑛)$, with $1>𝜈>0$. An alternative goal of learning is to seek 𝐰 that satisfies the condition 𝐿 regarding ${𝑓(𝐱^𝑐,𝐰), ∀ 𝑐∈𝐈(𝑛)}$
  • When $𝛼>𝛽$ is satisfied, $𝑓(𝐱^𝑐,𝐰)≥𝑣 ,∀ 𝑐∈𝐈_1(𝑛)$and $𝑓(𝐱^𝑐,𝐰)≤−𝑣,∀ 𝑐∈𝐈_2(𝑛)$ can be achieved by directly adjusting $𝐰^𝑜$ according to the following:

The Proposed CSM Learning Algorithm

  • Step 1: Initialize a SLFN with one hidden node with a randomized w. Obtain the first reference observation $(𝐱^1, 𝑦^1)$. Set $n = 2$.

  • Step 2: If $n > N$, STOP.

  • Step 3: Present the n reference observations ${(𝐱^𝑐, 𝑦^𝑐): c ∈ I(n)}$.

  • Step 4: If the condition L regarding ${𝑓(𝐱^𝑐,𝐰), ∀ 𝑐∈𝐈(𝑛)}$ is satisfied, go to Step 7.1.

  • Step 5: Save $w$.

  • Step 6: Apply the weight-tuning mechanism to $min_{𝐰}{⁡𝐸(𝐰)}$ to adjust $w$ until one of the following two cases occurs:

    • a. If the condition L regarding {𝑓(𝐱^𝑐,𝐰), ∀ 𝑐∈𝐈(𝑛)} is satisfied, go to step 7.1.
    • b. If an unacceptable result is obtained, then
      • Restore w
      • Let $p + 1 → p$ and add to the existing SLFN the new $𝑝^{th}$ hidden node with the following $𝐰_𝑝^𝐻$ and $𝐰_𝑝^𝑜$:

  • Step 7.1: Apply the weight-tuning mechanism one hundred times to minimizing $𝐸_𝑛(𝐰)$ to adjust $𝐰$, while keeping the condition L regarding ${𝑓(𝐱^𝑐,𝐰), ∀ 𝑐∈𝐈(𝑛)}$ satisfied.
  • Step 7.2: Calculate $𝑔_𝑘^′ ∀ 𝑘$, where $𝐰_𝑘^{′}≡ 𝐰 – ({𝑤_𝑘^𝑜, 𝐰_𝑘^𝐻})$, $𝑓(𝐱^𝑐,𝐰_𝑘^′ )≡𝑓(𝐱^𝑐,𝐰)- 𝑤_{𝑘}^{𝑜}𝑎_{𝑘}^{𝑐}$,$𝛼_𝑘^′≡ min_{𝑐ϵ𝐈_1(𝑛)}⁡𝑓(𝐱^𝑐,𝐰_𝑘^′ )$, $𝛽_𝑘^′≡max_{𝑐ϵ𝐈_2(𝑛)}⁡𝑓(𝐱^𝑐,𝐰_𝑘^′)$, and $𝑔_𝑘^′≡ 𝛼_𝑘^′−𝛽_𝑘^′$.
  • Step 7.3: If $- \theta >max_{1<k<p}{g_{k}^{‘}}$, where $\theta$ is a given constant, go to Step 2.
  • Step 7.4: If $max_{1<k<p}^⁡{𝑔_𝑘^′} > 0$, prune the $𝑖^{th}$ hidden node, in which 𝑖 is the first index of $arg max_{1<k<p}𝑔_{𝑘}^{′}$, $p-1->p$, $𝐰_𝑖^′->w$, and go to Step 7.1.

Flowchart of the proposed algorithm (流程圖)

Explanation of the Proposed CSM Learning Algorithm (解釋CSM學習演算法)

  • Step 6 conducts the cramming mechanism.
  • The Appendix shows the properness of the cramming mechanism.
  • All hidden nodes use the same activation function, but some of them have the heterogeneity due to their large associated weights.
  • The total amount of used hidden nodes will be large if new hidden nodes are added frequently.
  • Step 7.1 and Step 7.5.c are designed to soften the heterogeneity.
  • At Step 6, Step 7.1 and Step 7.5.c, the weight-tuning mechanism is applied to minimizing $𝐸_𝑛(𝐰)$ to adjust weights, while there is the regularization term in $𝐸_𝑛(𝐰)$.
  • The total amount of used hidden nodes will be large if new hidden nodes are added frequently.
  • Step 7.2 to Step 7.5 are designed to merge the unfamiliar case into the knowledge system via reducing the total amount of used hidden nodes.
  • The pruning issue file shows the logic of the merging mechanism.

The irrelevant hidden node

The irrelevance examination mechanism

No under-fitting

The Regularization term in $𝐸_𝑛(𝐰)$

  • Regarding the overfitting issue, the regularization term in the loss function $𝐸_𝑛(𝐰)$ may prevent the model from doing too well on training data.

筆記

  • 演算法的停止準則與學習目標不一定會有關係

Cramming (硬背下去)

Softening (軟化)

Merging (合併)