ML Course Review for Neural Networks

deal with non-linear

use tanh(x) instead of sign(x) to enable gradient descent

patterns in back propagation

border pixels are less used without padding

receptive field

pooling

why ReLU is popular

attention layer

different queries have different attentions
ideas:
- input: query vector \(q_j\), key vectors \(k_i\), and value vectors \(v_i\)
- similarity scores are computed using \(q_j \cdot k_i\) (query and key)
- similarity scores are normalized (using softmax) to obtain weights
- weights are applied to value vectors –> weighted sum

limitation of convolution layer

compare different layers

convolution vs fully-connected
- conv is a special FC
- with sparse connection
- weights sharing
- conv applies constrains on weights to extract local features
attention layer
- weights are not trainable but are computed based on similarities
summary:
- all are matrix multiplication –> linear combination of features
- FC: trained weights
- Conv: trained but constrained weights
- Attn: computed weights
pooling: can perform non-linear operations

CNN models

Why can’t VGG go deep?
- matrix multiplication causes gradient explosion
- successive convolutions
ResNet
- skip connections –> gradient highway
- element-wise feature summation
DenseNet
- use feature concatenations

transfer learning

small datasets
transfer?
- well-trained feature extractors
- basic features like edges and shapes
train on a large dataset
fine-tune the weights using new (small) dataset
why not transfer the FC?
- the weights in the FC layer show how the model arranges and uses the extracted features, which should be different for different tasks.

data augmentation and dropout

idea of regularization

data augmentation

dropout

why it’s good
- essentially using dropout trains a large ensemble of models
- robustness
randomly drop neurons during training but don’t drop during testing
one problem remains: the expectations of neurons are different for training and testing
- if the dropout rate is \(p\), \(E[a]\) is expectation at testing, then at training we have: \((1-p) E[a]\)
- how to fix? multiply by \(1-p\) at test time
- why is this a problem?
  - for example the threshold obtained during training might not work at test time due to the difference in expectation

weights initialization and batch normalization

weights initialization

Xavier initialization
- divide the variance of weights by the number of input features \(\frac{1}{fea_{in}}\)
Kaiming initialization
- if ReLU is used: since half of weights are dropped on average, use \(\frac{2}{fea_{in}}\)

batch normalization

during training
- compute mini-batch mean and variance
- normalize
- scale and shift
  - scale and shift parameters
at test time
- fixed empirical mean and std during training are used
why normalize then shift
- the scale and shift parameters can then be shared by different batches
benefits
- improves gradient flow through the network
- allows higher learning rates
- reduces the strong dependence on initialization
note that each input channel needs its own batch normalization layer
- for \(C\) input channels, we need \(C\) batch normalization

Graph Neural Networks

note

pixels have order on \(H\) and \(W\) dimensions but Not on \(C\) dimension, so we cannot apply a 3D convolution layer on 2D input (\(H \times W \times C\))
locality is important for convolution
but data in graph don’t have locality information
- there is no left or right for neighbors of a node
- difference between image and graph
  - number of neighboring nodes are fixed on image
  - the neighboring nodes on image are ordered by their relative positions but not on graph

ML with graphs

graph convolutional networks

\(g_conv(H, A) = AHW\)

\(A\) adjacent matrix
\(H\) feature matrix
- \(AH\) summation from neighboring nodes
\(W\) weights
- transformation from input features to output features
- or projection