5 min read

ML Review Summary

Logistic Regression

  • Assumption
    • independent
    • no multicollinearity
    • binary observation for the dependent variable
    • linear relationship between the log of odds and predictors
  • coefficients
    • log odds increases or odds multiplies by \(e^\beta\)

Random Forest

  • intro
  • bagging and OOB:
    • bootstrap aggregation: train DTs with bootstrap samples (with replacement), aggregate by the majority vote
    • out of bag
  • “random”
    • random rows: bagging
    • random cols: use random features in DTs’ training
  • diversity: each DT is trained based on different training data and different features
    • diverse committee to cast votes
    • relatively robust to multicollinearity
  • split criteria:
    • Gini impurity vs Entropy
      • best split: lowest Gini impurity, lowest entropy
    • Each node can compute this measurement; if the split decreases the measurement, then split.
      • the measurement of the parent node is already computed
      • the measurement of the child nodes (or the measurement of this split) is the weighted sum of each child node
  • parameters:
    • number of trees; max depth; min_samples_leaf; number of features; bootstrap sample size

Resampling

  • crossvalidation:
    • split the data in different ways; for each split, train and validate; aggregate validation results
    • hyperparameters: nested crossvalidation
      • use cross-validation to determine the hyperparameter, pick the winner to make sure this strategy works, cross-validate this strategy
  • jackknife vs bootstrap
    • For the more general jackknife, the delete-m observations jackknife, the bootstrap can be seen as a random approximation of it

Missing Data

  • Very interesting one (Airbnb):
    • Naive ways: remove or replace with median/mean for numeric and mode for categorical
      • cons: might introduce many gaps and significant structure
    • suggested: KNN median with a special distance metric and normalizing method
      • normalizing features so that both numerical and categorical features are mapped to the interval [0,1]
      • how to normalize? conditional CDF given Y = 1
      • impute the missing by the median of the K nearest neighbors
  • proximity imputation, on the fly imputation (and other imputation for random forest model), link (Paper Tang, 2017)
  • A summary
    • Delete: lose information
    • impute with means/medians, or modes: biased
    • Predict: use the non-missings as the training and predict the missing
    • KNN: use the information of the neighbors to impute
    • Time-series: linear interpolation + seasonal adjustments

Imbalanced Data

  • intro
  • under- and over-sampling
  • ensemble different resampled datasets
    • chunk the data of the abundant class into m chunks, combine each chunk with the rare class data to train m models; ensemble m models
  • each chunk can have a different ratio between the abundant and rare class
  • penalizing the wrong classification of the rare class more than wrong classifications of the abundant class
  • SMOTE
    • randomly selected an instance of the minority class; find its k nearest minority neighbors; randomly select a neighbor; make a (random) convex combination of the neighbor and the instance
Overfitting
  • lower the capacity of the model to memorize the training data, link
    • reduce the number of parameters
    • regularization: penalize large weights
    • dropout
  • 8 simple techniques
    • cross-validation
    • data augmentation: increase the sample size
    • feature selection: reduce the number of features
    • regularization
  • ensembling
    • bagging: a large number of strong learners (relatively unconstrained) in parallel; then combine
    • boosting: weak learners in sequence; learning from the mistakes of the previous one
  • overfitting, underfitting, bias, variance

ROC, Precision-Recall, Specificity-Sensitivity

  • link-wiki-roc
  • AUC: the probability of ranking a randomly chosen positive instance higher than a randomly chosen negative instance link-roc
  • F1 score: harmonic mean of precision and recall
    • it penalizes the extreme values
    • Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial
    • Accuracy can be used when the class distribution is similar while F1-score is a better metric when there are imbalanced classes as in the above case.

Features

  • PCA
  • Multicollinearity
    • detect:
    • variance inflation factor
    • Coefficients have signs opposite to what you’d expect from theory
    • high standard errors

Feature Selection

  • feature selection with real and categorical data
  • Feature Selection: Select a subset of input features from the dataset.
    • Unsupervised: Do not use the target variable (e.g. remove redundant variables).
      • Correlation
    • Supervised: Use the target variable (e.g. remove irrelevant variables).
      • Wrapper: Search for well-performing subsets of features.
        • Recursive Feature Elimination
      • Filter: Select subsets of features based on their relationship with the target.
        • Statistical Methods (ANOVA, Chi^2, Correlation)
        • Feature Importance Methods
      • Intrinsic: Algorithms that perform automatic feature selection during training.
        • Decision Trees, LASSO
  • Dimensionality Reduction: Project input data into a lower-dimensional feature space.
  • Feature Importance
    • importance = the feature’s contribution to the decrease of the Gini impurity/entropy/variance (in regression case) - useful for random forest
    • random shuffle a feature, throw it into the model, check the metric; the more decrease we observe in the metric, the more important the feature is
    • like feature selection: drop one feature, see the change of the metric

Regularization

Boosting vs Bagging

  • link-difference between bagging and boosting
  • boosting:
    • build learners in a sequential way
    • misclassification errors become weights so that the next learner can learn from the previous mistakes
    • increase model performance, reduce errors
    • overfitting might be a problem
  • bagging:
    • build learners with bootstrap samples in a parallel way
    • if all learners are not good, bagging cannot make it better
    • can deal with the problem of overfitting

Outliers

Explain to non-tech

  • Use visual content to explain technical information and processes
  • Avoid technical terminology when possible
  • Focus on impact and initiatives when explaining technical concepts
    • why we need it instead of how it works (of course, it also depends on the purpose of the talk)