deal with non-linear
- explicitly project features to higher dimensional space
- use svm with kernel function (implecitly)
use tanh(x)
instead of sign(x)
to enable gradient descent
patterns in back propagation
- add: gradient distributor
- max: gradient router
- mul: gradient switcher
border pixels are less used without padding
receptive field
- the region of the input sapce that affects a particular unit of the network
pooling
- can reduce the dimension of the feature maps and save computations
- reduction operation like max and average
- number of trainable parameters is 0
- enlarge receptive fields
why ReLU is popular
- no gradient vanishing
attention layer
- different queries have different attentions
- ideas:
- input: query vector \(q_j\), key vectors \(k_i\), and value vectors \(v_i\)
- similarity scores are computed using \(q_j \cdot k_i\) (query and key)
- similarity scores are normalized (using
softmax
) to obtain weights - weights are applied to value vectors –> weighted sum
limitation of convolution layer
- local features –> larger range features needs to stack multiple layers
- over-fitting and inefficiency
- so we need global feature extractor
compare different layers
- convolution vs fully-connected
- conv is a special FC
- with sparse connection
- weights sharing
- conv applies constrains on weights to extract local features
- attention layer
- weights are not trainable but are computed based on similarities
- summary:
- all are matrix multiplication –> linear combination of features
- FC: trained weights
- Conv: trained but constrained weights
- Attn: computed weights
- pooling: can perform non-linear operations
CNN models
- Why can’t VGG go deep?
- matrix multiplication causes gradient explosion
- successive convolutions
- ResNet
- skip connections –> gradient highway
- element-wise feature summation
- DenseNet
- use feature concatenations
transfer learning
- small datasets
- transfer?
- well-trained feature extractors
- basic features like edges and shapes
- train on a large dataset
- fine-tune the weights using new (small) dataset
- why not transfer the FC?
- the weights in the FC layer show how the model arranges and uses the extracted features, which should be different for different tasks.
data augmentation and dropout
idea of regularization
data augmentation
- increase the size of the dataset
- limitation: the argmented data have the same labels
dropout
- why it’s good
- essentially using dropout trains a large ensemble of models
- robustness
- randomly drop neurons during training but don’t drop during testing
- one problem remains: the expectations of neurons are different for training and testing
- if the dropout rate is \(p\), \(E[a]\) is expectation at testing, then at training we have: \((1-p) E[a]\)
- how to fix? multiply by \(1-p\) at test time
- why is this a problem?
- for example the threshold obtained during training might not work at test time due to the difference in expectation
weights initialization and batch normalization
weights initialization
- Xavier initialization
- divide the variance of weights by the number of input features \(\frac{1}{fea_{in}}\)
- Kaiming initialization
- if ReLU is used: since half of weights are dropped on average, use \(\frac{2}{fea_{in}}\)
batch normalization
- during training
- compute mini-batch mean and variance
- normalize
- scale and shift
- scale and shift parameters
- at test time
- fixed empirical mean and std during training are used
- why normalize then shift
- the scale and shift parameters can then be shared by different batches
- benefits
- improves gradient flow through the network
- allows higher learning rates
- reduces the strong dependence on initialization
- note that each input channel needs its own batch normalization layer
- for \(C\) input channels, we need \(C\) batch normalization
Graph Neural Networks
note
- pixels have order on \(H\) and \(W\) dimensions but Not on \(C\) dimension, so we cannot apply a 3D convolution layer on 2D input (\(H \times W \times C\))
- locality is important for convolution
- but data in graph don’t have locality information
- there is no left or right for neighbors of a node
- difference between image and graph
- number of neighboring nodes are fixed on image
- the neighboring nodes on image are ordered by their relative positions but not on graph
ML with graphs
- node classification
- link prediction
- graph classification
graph convolutional networks
\(g_conv(H, A) = AHW\)
- \(A\) adjacent matrix
- \(H\) feature matrix
- \(AH\) summation from neighboring nodes
- \(W\) weights
- transformation from input features to output features
- or projection