5 min read
ML Review Summary
2022-03-31
Logistic Regression
- Assumption
- independent
- no multicollinearity
- binary observation for the dependent variable
- linear relationship between the log of odds and predictors
- coefficients
- log odds increases or odds multiplies by \(e^\beta\)
Random Forest
- intro
- bagging and OOB:
- bootstrap aggregation: train DTs with bootstrap samples (with replacement), aggregate by the majority vote
- out of bag
- “random”
- random rows: bagging
- random cols: use random features in DTs’ training
- diversity: each DT is trained based on different training data and different features
- diverse committee to cast votes
- relatively robust to multicollinearity
- split criteria:
- Gini impurity vs Entropy
- best split: lowest Gini impurity, lowest entropy
- Each node can compute this measurement; if the split decreases the measurement, then split.
- the measurement of the parent node is already computed
- the measurement of the child nodes (or the measurement of this split) is the weighted sum of each child node
- parameters:
- number of trees; max depth;
min_samples_leaf
; number of features; bootstrap sample size
Resampling
- crossvalidation:
- split the data in different ways; for each split, train and validate; aggregate validation results
- hyperparameters: nested crossvalidation
- use cross-validation to determine the hyperparameter, pick the winner to make sure this strategy works, cross-validate this strategy
- jackknife vs bootstrap
- For the more general jackknife, the delete-m observations jackknife, the bootstrap can be seen as a random approximation of it
Missing Data
- Very interesting one (Airbnb):
- Naive ways: remove or replace with median/mean for numeric and mode for categorical
- cons: might introduce many gaps and significant structure
- suggested: KNN median with a special distance metric and normalizing method
- normalizing features so that both numerical and categorical features are mapped to the interval [0,1]
- how to normalize? conditional CDF given Y = 1
- impute the missing by the median of the K nearest neighbors
- proximity imputation, on the fly imputation (and other imputation for random forest model), link (Paper Tang, 2017)
- A summary
- Delete: lose information
- impute with means/medians, or modes: biased
- Predict: use the non-missings as the training and predict the missing
- KNN: use the information of the neighbors to impute
- Time-series: linear interpolation + seasonal adjustments
Imbalanced Data
- intro
- under- and over-sampling
- ensemble different resampled datasets
- chunk the data of the abundant class into m chunks, combine each chunk with the rare class data to train m models; ensemble m models
- each chunk can have a different ratio between the abundant and rare class
- penalizing the wrong classification of the rare class more than wrong classifications of the abundant class
- SMOTE
- randomly selected an instance of the minority class; find its k nearest minority neighbors; randomly select a neighbor; make a (random) convex combination of the neighbor and the instance
Overfitting
- lower the capacity of the model to memorize the training data, link
- reduce the number of parameters
- regularization: penalize large weights
- dropout
- 8 simple techniques
- cross-validation
- data augmentation: increase the sample size
- feature selection: reduce the number of features
- regularization
- ensembling
- bagging: a large number of strong learners (relatively unconstrained) in parallel; then combine
- boosting: weak learners in sequence; learning from the mistakes of the previous one
- overfitting, underfitting, bias, variance
ROC, Precision-Recall, Specificity-Sensitivity
- link-wiki-roc
- AUC: the probability of ranking a randomly chosen positive instance higher than a randomly chosen negative instance link-roc
- F1 score: harmonic mean of precision and recall
- it penalizes the extreme values
- Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial
- Accuracy can be used when the class distribution is similar while F1-score is a better metric when there are imbalanced classes as in the above case.
Features
- PCA
- Multicollinearity
- detect:
- variance inflation factor
- Coefficients have signs opposite to what you’d expect from theory
- high standard errors
Feature Selection
- feature selection with real and categorical data
- Feature Selection: Select a subset of input features from the dataset.
- Unsupervised: Do not use the target variable (e.g. remove redundant variables).
- Supervised: Use the target variable (e.g. remove irrelevant variables).
- Wrapper: Search for well-performing subsets of features.
- Recursive Feature Elimination
- Filter: Select subsets of features based on their relationship with the target.
- Statistical Methods (ANOVA, Chi^2, Correlation)
- Feature Importance Methods
- Intrinsic: Algorithms that perform automatic feature selection during training.
- Dimensionality Reduction: Project input data into a lower-dimensional feature space.
- Feature Importance
- importance = the feature’s contribution to the decrease of the Gini impurity/entropy/variance (in regression case) - useful for random forest
- random shuffle a feature, throw it into the model, check the metric; the more decrease we observe in the metric, the more important the feature is
- like feature selection: drop one feature, see the change of the metric
Boosting vs Bagging
- link-difference between bagging and boosting
- boosting:
- build learners in a sequential way
- misclassification errors become weights so that the next learner can learn from the previous mistakes
- increase model performance, reduce errors
- overfitting might be a problem
- bagging:
- build learners with bootstrap samples in a parallel way
- if all learners are not good, bagging cannot make it better
- can deal with the problem of overfitting
Explain to non-tech
- Use visual content to explain technical information and processes
- Avoid technical terminology when possible
- Focus on impact and initiatives when explaining technical concepts
- why we need it instead of how it works (of course, it also depends on the purpose of the talk)