ML Review Summary - Wangqian Ju's website

Assumption
- independent
- no multicollinearity
- binary observation for the dependent variable
- linear relationship between the log of odds and predictors
coefficients
- log odds increases or odds multiplies by \(e^\beta\)

intro
bagging and OOB:
- bootstrap aggregation: train DTs with bootstrap samples (with replacement), aggregate by the majority vote
- out of bag
“random”
- random rows: bagging
- random cols: use random features in DTs’ training
diversity: each DT is trained based on different training data and different features
- diverse committee to cast votes
- relatively robust to multicollinearity
split criteria:
- Gini impurity vs Entropy
  - best split: lowest Gini impurity, lowest entropy
- Each node can compute this measurement; if the split decreases the measurement, then split.
  - the measurement of the parent node is already computed
  - the measurement of the child nodes (or the measurement of this split) is the weighted sum of each child node
parameters:
- number of trees; max depth; min_samples_leaf; number of features; bootstrap sample size

crossvalidation:
- split the data in different ways; for each split, train and validate; aggregate validation results
- hyperparameters: nested crossvalidation
  - use cross-validation to determine the hyperparameter, pick the winner to make sure this strategy works, cross-validate this strategy
jackknife vs bootstrap
- For the more general jackknife, the delete-m observations jackknife, the bootstrap can be seen as a random approximation of it

Very interesting one (Airbnb):
- Naive ways: remove or replace with median/mean for numeric and mode for categorical
  - cons: might introduce many gaps and significant structure
- suggested: KNN median with a special distance metric and normalizing method
  - normalizing features so that both numerical and categorical features are mapped to the interval [0,1]
  - how to normalize? conditional CDF given Y = 1
  - impute the missing by the median of the K nearest neighbors
proximity imputation, on the fly imputation (and other imputation for random forest model), link (Paper Tang, 2017)
A summary
- Delete: lose information
- impute with means/medians, or modes: biased
- Predict: use the non-missings as the training and predict the missing
- KNN: use the information of the neighbors to impute
- Time-series: linear interpolation + seasonal adjustments

intro
under- and over-sampling
ensemble different resampled datasets
- chunk the data of the abundant class into m chunks, combine each chunk with the rare class data to train m models; ensemble m models
each chunk can have a different ratio between the abundant and rare class
penalizing the wrong classification of the rare class more than wrong classifications of the abundant class
SMOTE
- randomly selected an instance of the minority class; find its k nearest minority neighbors; randomly select a neighbor; make a (random) convex combination of the neighbor and the instance

lower the capacity of the model to memorize the training data, link
- reduce the number of parameters
- regularization: penalize large weights
- dropout
8 simple techniques
- cross-validation
- data augmentation: increase the sample size
- feature selection: reduce the number of features
- regularization
ensembling
- bagging: a large number of strong learners (relatively unconstrained) in parallel; then combine
- boosting: weak learners in sequence; learning from the mistakes of the previous one
overfitting, underfitting, bias, variance
- link-overfitting and underfitting
- underfitting, high bias, low variance, model is too simple
- overfitting, high variance, low bias, model is too complex

link-wiki-roc
AUC: the probability of ranking a randomly chosen positive instance higher than a randomly chosen negative instance link-roc
F1 score: harmonic mean of precision and recall
- it penalizes the extreme values
- Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial
- Accuracy can be used when the class distribution is similar while F1-score is a better metric when there are imbalanced classes as in the above case.

PCA
Multicollinearity
- detect:
- variance inflation factor
- Coefficients have signs opposite to what you’d expect from theory
- high standard errors

feature selection with real and categorical data
Feature Selection: Select a subset of input features from the dataset.
- Unsupervised: Do not use the target variable (e.g. remove redundant variables).
  - Correlation
- Supervised: Use the target variable (e.g. remove irrelevant variables).
  - Wrapper: Search for well-performing subsets of features.
    - Recursive Feature Elimination
  - Filter: Select subsets of features based on their relationship with the target.
    - Statistical Methods (ANOVA, Chi^2, Correlation)
    - Feature Importance Methods
  - Intrinsic: Algorithms that perform automatic feature selection during training.
    - Decision Trees, LASSO
Dimensionality Reduction: Project input data into a lower-dimensional feature space.
Feature Importance
- importance = the feature’s contribution to the decrease of the Gini impurity/entropy/variance (in regression case) - useful for random forest
- random shuffle a feature, throw it into the model, check the metric; the more decrease we observe in the metric, the more important the feature is
- like feature selection: drop one feature, see the change of the metric

link-difference between bagging and boosting
boosting:
- build learners in a sequential way
- misclassification errors become weights so that the next learner can learn from the previous mistakes
- increase model performance, reduce errors
- overfitting might be a problem
bagging:
- build learners with bootstrap samples in a parallel way
- if all learners are not good, bagging cannot make it better
- can deal with the problem of overfitting

Use visual content to explain technical information and processes
Avoid technical terminology when possible
Focus on impact and initiatives when explaining technical concepts
- why we need it instead of how it works (of course, it also depends on the purpose of the talk)