Importance Marker Scale: This scale indicates the general usage and importance among ML practitioners (1 to 5, where 5 is most widely used/critical, and 1 is less common/context-dependent).
COMPLETE MACHINE LEARNING PIPELINE
│
├── 1. DATA PREPROCESSING & CLEANING (CRITICAL - First Priority) (5/5)
│ ├── 1.1 Data Cleaning (5/5)
│ │ ├── Handling Inconsistent Data (5/5)
│ │ │ ├── pandas.DataFrame.replace()
│ │ │ ├── regex
│ │ │ └── ✓ Correcting typos, standardizing formats
│ │ ├── Removing Duplicates (5/5)
│ │ │ ├── pandas.DataFrame.drop_duplicates()
│ │ │ └── ✓ Ensures unique records
│ │ └── Handling Noise (4/5)
│ │ ├── Binning (4/5)
│ │ ├── Regression (3/5)
│ │ └── Clustering (3/5)
│ │
│ ├── 1.2 Imputation Strategies (Handling Missing Values) (5/5)
│ │ ├── Simple Imputation (5/5)
│ │ │ ├── Mean/Median/Mode Imputation (5/5)
│ │ │ │ ├── sklearn.impute.SimpleImputer
│ │ │ │ ├── pandas.DataFrame.fillna()
│ │ │ │ └── ✓ Fast, simple baseline for numerical/categorical data
│ │ │ ├── Forward/Backward Fill (4/5)
│ │ │ │ ├── pandas.DataFrame.fillna(method='ffill')
│ │ │ │ ├── pandas.DataFrame.fillna(method='bfill')
│ │ │ │ └── ✓ Ideal for time-series data with temporal dependencies
│ │ │ └── Constant Value Imputation (4/5)
│ │ │ ├── sklearn.impute.SimpleImputer(strategy='constant')
│ │ │ └── ✓ Domain-specific constants (e.g., 0, 'Unknown', 'Missing')
│ │ ├── Advanced Imputation (4/5)
│ │ │ ├── Iterative Imputation (MICE) (4/5)
│ │ │ │ ├── sklearn.impute.IterativeImputer
│ │ │ │ ├── fancyimpute.IterativeImputer
│ │ │ │ └── ✓ Robust multivariate imputation using chained equations
│ │ │ ├── K-Nearest Neighbors Imputation (4/5)
│ │ │ │ ├── sklearn.impute.KNNImputer
│ │ │ │ └── ✓ Preserves feature relationships, good for mixed data types
│ │ │ └── Matrix Factorization (3/5)
│ │ │ ├── fancyimpute.MatrixFactorization
│ │ │ ├── fancyimpute.NuclearNormMinimization
│ │ │ └── ✓ Advanced technique for high-dimensional sparse data
│ │ └── Domain-Specific Imputation (3/5)
│ │ ├── Time Series Interpolation (3/5)
│ │ │ ├── pandas.DataFrame.interpolate()
│ │ │ ├── scipy.interpolate.interp1d
│ │ │ └── ✓ Linear, polynomial, spline interpolation for time series
│ │ └── Seasonal Decomposition (3/5)
│ │ ├── statsmodels.tsa.seasonal.seasonal_decompose
│ │ └── ✓ Handles seasonal patterns in time series data
│ │
│ ├── 1.3 Outlier Detection & Treatment (4/5)
│ │ ├── Statistical Methods (4/5)
│ │ │ ├── Z-Score Method (4/5)
│ │ │ │ ├── scipy.stats.zscore
│ │ │ │ └── ✓ Identifies outliers beyond 3 standard deviations
│ │ │ ├── Interquartile Range (IQR) (4/5)
│ │ │ │ ├── pandas.DataFrame.quantile()
│ │ │ │ └── ✓ Robust to extreme values, uses Q1 and Q3
│ │ │ └── Modified Z-Score (3/5)
│ │ │ ├── scipy.stats.median_abs_deviation
│ │ │ └── ✓ Uses median instead of mean, more robust
│ │ └── Machine Learning Methods (3/5)
│ │ ├── Isolation Forest (3/5)
│ │ │ ├── sklearn.ensemble.IsolationForest
│ │ │ └── ✓ Unsupervised anomaly detection, good for high dimensions
│ │ ├── Local Outlier Factor (LOF) (3/5)
│ │ │ ├── sklearn.neighbors.LocalOutlierFactor
│ │ │ └── ✓ Density-based outlier detection
│ │ └── One-Class SVM (2/5)
│ │ ├── sklearn.svm.OneClassSVM
│ │ └── ✓ Kernel-based outlier detection
│ │
│ ├── 1.4 Data Integration (3/5)
│ │ ├── Schema Integration (3/5)
│ │ │ ├── pandas.DataFrame.merge()
│ │ │ └── ✓ Combining data from heterogeneous sources
│ │ └── Entity Resolution (2/5)
│ │ ├── recordlinkage
│ │ └── ✓ Identifying and linking records that refer to the same entity
│ │
│ └── 1.5 Data Transformation (4/5)
│ ├── Aggregation (4/5)
│ │ ├── pandas.DataFrame.groupby().agg()
│ │ └── ✓ Summarizing data (e.g., sum, mean, count)
│ ├── Generalization (3/5)
│ │ ├── Custom logic
│ │ └── ✓ Replacing low-level data with high-level concepts (e.g., age ranges)
│ └── Normalization (Data Warehousing Context) (3/5)
│ ├── SQL techniques
│ └── ✓ Reducing data redundancy and improving data integrity
│
├── 2. FEATURE ENGINEERING (HIGH IMPORTANCE - Second Priority) (4/5)
│ ├── Categorical Encoding (4/5)
│ │ ├── One-Hot Encoding (5/5)
│ │ │ ├── sklearn.preprocessing.OneHotEncoder
│ │ │ ├── pandas.get_dummies()
│ │ │ └── ✓ Best for nominal categories, avoids ordinal assumptions
│ │ ├── Label Encoding (4/5)
│ │ │ ├── sklearn.preprocessing.LabelEncoder
│ │ │ └── ✓ Suitable for ordinal categories with natural ordering
│ │ ├── Ordinal Encoding (4/5)
│ │ │ ├── sklearn.preprocessing.OrdinalEncoder
│ │ │ └── ✓ Preserves ordinal relationships between categories
│ │ ├── Target Encoding (3/5)
│ │ │ ├── category_encoders.TargetEncoder
│ │ │ ├── category_encoders.MEstimateEncoder
│ │ │ └── ✓ Uses target statistics, good for high-cardinality categories
│ │ ├── Binary Encoding (2/5)
│ │ │ ├── category_encoders.BinaryEncoder
│ │ │ └── ✓ Reduces dimensionality compared to one-hot
│ │ ├── Hash Encoding (2/5)
│ │ │ ├── category_encoders.HashingEncoder
│ │ │ └── ✓ Fixed-size output, handles unseen categories
│ │ └── Frequency Encoding (2/5)
│ │ ├── category_encoders.CountEncoder
│ │ └── ✓ Replaces categories with their occurrence frequency
│ │
│ ├── Numerical Transformations (4/5)
│ │ ├── Scaling & Normalization (5/5)
│ │ │ ├── StandardScaler (Z-score) (5/5)
│ │ │ │ ├── sklearn.preprocessing.StandardScaler
│ │ │ │ └── ✓ Mean=0, Std=1, best for normally distributed data
│ │ │ ├── MinMaxScaler (5/5)
│ │ │ │ ├── sklearn.preprocessing.MinMaxScaler
│ │ │ │ └── ✓ Scales to [0,1] range, preserves relationships
│ │ │ ├── RobustScaler (4/5)
│ │ │ │ ├── sklearn.preprocessing.RobustScaler
│ │ │ │ └── ✓ Uses median and IQR, robust to outliers
│ │ │ ├── MaxAbsScaler (3/5)
│ │ │ │ ├── sklearn.preprocessing.MaxAbsScaler
│ │ │ │ └── ✓ Scales by maximum absolute value, preserves sparsity
│ │ │ └── Normalizer (3/5)
│ │ │ ├── sklearn.preprocessing.Normalizer
│ │ │ └── ✓ Scales individual samples to unit norm
│ │ ├── Distribution Transformation (4/5)
│ │ │ ├── Log Transform (4/5)
│ │ │ │ ├── numpy.log1p()
│ │ │ │ ├── numpy.log()
│ │ │ │ └── ✓ Reduces right skewness, handles positive values
│ │ │ ├── Square Root Transform (3/5)
│ │ │ │ ├── numpy.sqrt()
│ │ │ │ └── ✓ Moderate skewness reduction
│ │ │ ├── Box-Cox Transform (3/5)
│ │ │ │ ├── scipy.stats.boxcox()
│ │ │ │ ├── sklearn.preprocessing.PowerTransformer(method='box-cox')
│ │ │ │ └── ✓ Optimal power transformation for positive data
│ │ │ └── Yeo-Johnson Transform (3/5)
│ │ │ ├── sklearn.preprocessing.PowerTransformer(method='yeo-johnson')
│ │ │ └── ✓ Handles both positive and negative values
│ │ └── Binning/Discretization (4/5)
│ │ ├── Equal-Width Binning (4/5)
│ │ │ ├── sklearn.preprocessing.KBinsDiscretizer(strategy='uniform')
│ │ │ ├── pandas.cut()
│ │ │ └── ✓ Equal-sized intervals, may have unequal frequencies
│ │ ├── Equal-Frequency Binning (4/5)
│ │ │ ├── sklearn.preprocessing.KBinsDiscretizer(strategy='quantile')
│ │ │ ├── pandas.qcut()
│ │ │ └── ✓ Equal sample sizes per bin
│ │ └── K-Means Binning (3/5)
│ │ ├── sklearn.preprocessing.KBinsDiscretizer(strategy='kmeans')
│ │ └── ✓ Clusters data points into bins using K-means
│ │
│ ├── Feature Creation (4/5)
│ │ ├── Polynomial Features (4/5)
│ │ │ ├── sklearn.preprocessing.PolynomialFeatures
│ │ │ └── ✓ Creates polynomial and interaction terms
│ │ ├── Interaction Features (4/5)
│ │ │ ├── sklearn.preprocessing.PolynomialFeatures(interaction_only=True)
│ │ │ └── ✓ Only interaction terms, no polynomial terms
│ │ ├── Domain-Specific Features (5/5)
│ │ │ ├── Date/Time Features (5/5)
│ │ │ │ ├── pandas.dt.year, pandas.dt.month, pandas.dt.dayofweek
│ │ │ │ ├── featuretools.primitives.TimeSeriesFeatures
│ │ │ │ └── ✓ Extract temporal patterns and cyclical features
│ │ │ ├── Text Features (TF-IDF, N-grams) (4/5)
│ │ │ │ ├── sklearn.feature_extraction.text.TfidfVectorizer
│ │ │ │ ├── sklearn.feature_extraction.text.CountVectorizer
│ │ │ │ └── ✓ Convert text to numerical features
│ │ │ └── Geospatial Features (3/5)
│ │ │ ├── geopy.distance
│ │ │ └── ✓ Distance calculations, coordinate transformations
│ │ └── Automated Feature Engineering (2/5)
│ │ ├── featuretools.dfs()
│ │ ├── tsfresh.extract_features()
│ │ └── ✓ Automatically generates features from relational data
│
├── 3. FEATURE SELECTION (MEDIUM-HIGH IMPORTANCE - Third Priority) (3/5)
│ ├── Filter Methods (Univariate) (4/5)
│ │ ├── Statistical Tests (4/5)
│ │ │ ├── Chi-Square Test (4/5)
│ │ │ │ ├── sklearn.feature_selection.chi2
│ │ │ │ ├── sklearn.feature_selection.SelectKBest(chi2)
│ │ │ │ └── ✓ For categorical features vs categorical target
│ │ │ ├── ANOVA F-Test (4/5)
│ │ │ │ ├── sklearn.feature_selection.f_classif
│ │ │ │ ├── sklearn.feature_selection.f_regression
│ │ │ │ └── ✓ For numerical features vs categorical/numerical target
│ │ │ ├── Mutual Information (4/5)
│ │ │ │ ├── sklearn.feature_selection.mutual_info_classif
│ │ │ │ ├── sklearn.feature_selection.mutual_info_regression
│ │ │ │ └── ✓ Captures non-linear relationships
│ │ │ └── Kendall's Tau (2/5)
│ │ │ ├── scipy.stats.kendalltau
│ │ │ └── ✓ Non-parametric correlation measure
│ │ ├── Correlation-Based (4/5)
│ │ │ ├── Pearson Correlation (4/5)
│ │ │ │ ├── pandas.DataFrame.corr()
│ │ │ │ ├── numpy.corrcoef()
│ │ │ │ ├── scipy.stats.pearsonr()
│ │ │ │ └── ✓ Linear relationships, normally distributed data
│ │ │ ├── Spearman Correlation (4/5)
│ │ │ │ ├── scipy.stats.spearmanr()
│ │ │ │ ├── pandas.DataFrame.corr(method='spearman')
│ │ │ │ └── ✓ Monotonic relationships, rank-based
│ │ │ └── Kendall Correlation (2/5)
│ │ │ ├── scipy.stats.kendalltau()
│ │ │ └── ✓ Robust to outliers, small sample sizes
│ │ └── Variance-Based (4/5)
│ │ ├── Low Variance Filter (4/5)
│ │ │ ├── sklearn.feature_selection.VarianceThreshold
│ │ │ └── ✓ Removes features with low variance (near-constant)
│ │ └── High Correlation Filter (4/5)
│ │ ├── Custom implementation with pandas.DataFrame.corr()
│ │ └── ✓ Removes highly correlated features (multicollinearity)
│ │
│ ├── Wrapper Methods (Model-Based) (3/5)
│ │ ├── Forward Selection (3/5)
│ │ │ ├── sklearn.feature_selection.SequentialFeatureSelector(direction='forward')
│ │ │ ├── mlxtend.feature_selection.SequentialFeatureSelector
│ │ │ └── ✓ Starts empty, adds features iteratively
│ │ ├── Backward Elimination (3/5)
│ │ │ ├── sklearn.feature_selection.SequentialFeatureSelector(direction='backward')
│ │ │ ├── mlxtend.feature_selection.SequentialFeatureSelector
│ │ │ └── ✓ Starts with all features, removes iteratively
│ │ ├── Recursive Feature Elimination (RFE) (4/5)
│ │ │ ├── sklearn.feature_selection.RFE
│ │ │ ├── sklearn.feature_selection.RFECV
│ │ │ └── ✓ Recursively eliminates least important features
│ │ └── Genetic Algorithms (2/5)
│ │ ├── sklearn-genetic-opt.GAFeatureSelectionCV
│ │ ├── DEAP
│ │ └── ✓ Evolutionary approach to feature selection
│ │
│ ├── Embedded Methods (Intrinsic) (4/5)
│ │ ├── Tree-Based Importance (5/5)
│ │ │ ├── Random Forest Importance (5/5)
│ │ │ │ ├── sklearn.ensemble.RandomForestClassifier.feature_importances_
│ │ │ │ ├── sklearn.ensemble.RandomForestRegressor.feature_importances_
│ │ │ │ └── ✓ Gini/entropy-based importance, handles interactions
│ │ │ ├── Extra Trees Importance (4/5)
│ │ │ │ ├── sklearn.ensemble.ExtraTreesClassifier.feature_importances_
│ │ │ │ └── ✓ More randomized than Random Forest
│ │ │ ├── XGBoost Importance (5/5)
│ │ │ │ ├── xgboost.XGBClassifier.feature_importances_
│ │ │ │ ├── xgboost.plot_importance()
│ │ │ │ └── ✓ Gain, weight, cover importance metrics
│ │ │ ├── LightGBM Importance (5/5)
│ │ │ │ ├── lightgbm.LGBMClassifier.feature_importances_
│ │ │ │ ├── lightgbm.plot_importance()
│ │ │ │ └── ✓ Split-based importance, fast training
│ │ │ └── CatBoost Importance (4/5)
│ │ │ ├── catboost.CatBoostClassifier.feature_importances_
│ │ │ └── ✓ Handles categorical features natively
│ │ └── Regularization-Based (4/5)
│ │ ├── L1 Regularization (Lasso) (5/5)
│ │ │ ├── sklearn.linear_model.Lasso
│ │ │ └── ✓ Drives coefficients to zero, performs feature selection
│ │ ├── L2 Regularization (Ridge) (4/5)
│ │ │ ├── sklearn.linear_model.Ridge
│ │ │ └── ✓ Shrinks coefficients, prevents overfitting, no explicit selection
│ │ └── Elastic Net (L1+L2) (4/5)
│ │ ├── sklearn.linear_model.ElasticNet
│ │ └── ✓ Combines L1 and L2, robust to correlated features
│ │
│ └── Hybrid Methods (3/5)
│ ├── SelectFromModel (3/5)
│ │ ├── sklearn.feature_selection.SelectFromModel
│ │ └── ✓ Uses model's feature importance/coefficients to select features
│ └── Boruta Algorithm (2/5)
│ ├── boruta_py.BorutaPy
│ └── ✓ All-relevant feature selection using Random Forest
│
├── 4. MODEL SELECTION (HIGH IMPORTANCE - Fourth Priority) (5/5)
│ ├── 4.1 Algorithm Selection (5/5)
│ │ ├── Supervised Learning Algorithms (5/5)
│ │ │ ├── Classification (e.g., Logistic Regression, SVM, Decision Trees, Random Forest, XGBoost)
│ │ │ │ ├── sklearn.linear_model
│ │ │ │ ├── sklearn.svm
│ │ │ │ ├── sklearn.tree
│ │ │ │ ├── sklearn.ensemble
│ │ │ │ ├── xgboost
│ │ │ │ └── ✓ Choosing the right algorithm based on problem type and data characteristics
│ │ │ └── Regression (e.g., Linear Regression, Ridge, Lasso, SVR, Gradient Boosting)
│ │ │ ├── sklearn.linear_model
│ │ │ ├── sklearn.svm
│ │ │ ├── sklearn.ensemble
│ │ │ └── ✓ Selecting models for continuous target variables
│ │ └── Unsupervised Learning Algorithms (4/5)
│ │ ├── Clustering (e.g., K-Means, DBSCAN, Hierarchical Clustering)
│ │ │ ├── sklearn.cluster
│ │ │ └── ✓ Grouping similar data points
│ │ └── Dimensionality Reduction (e.g., PCA, t-SNE)
│ │ ├── sklearn.decomposition
│ │ ├── sklearn.manifold
│ │ └── ✓ Reducing feature space for visualization or efficiency
│ │
│ ├── 4.2 Hyperparameter Tuning (5/5)
│ │ ├── Grid Search (5/5)
│ │ │ ├── sklearn.model_selection.GridSearchCV
│ │ │ └── ✓ Exhaustive search over a specified parameter grid
│ │ ├── Random Search (4/5)
│ │ │ ├── sklearn.model_selection.RandomizedSearchCV
│ │ │ └── ✓ Random search over parameters from a distribution
│ │ ├── Bayesian Optimization (3/5)
│ │ │ ├── scikit-optimize
│ │ │ ├── hyperopt
│ │ │ └── ✓ Uses probabilistic model to find optimal hyperparameters efficiently
│ │ └── Automated ML (AutoML) (2/5)
│ │ ├── Auto-Sklearn
│ │ ├── TPOT
│ │ └── ✓ Automates hyperparameter tuning and model selection
│ │
│ └── 4.3 Cross-Validation Strategies (5/5)
│ ├── K-Fold Cross-Validation (5/5)
│ │ ├── sklearn.model_selection.KFold
│ │ ├── sklearn.model_selection.cross_val_score
│ │ └── ✓ Standard for robust model evaluation, reduces variance
│ ├── Stratified K-Fold (5/5)
│ │ ├── sklearn.model_selection.StratifiedKFold
│ │ └── ✓ Preserves class proportions in each fold, essential for imbalanced data
│ ├── Leave-One-Out Cross-Validation (LOOCV) (2/5)
│ │ ├── sklearn.model_selection.LeaveOneOut
│ │ └── ✓ High computational cost, used for small datasets
│ └── Time Series Cross-Validation (3/5)
│ ├── sklearn.model_selection.TimeSeriesSplit
│ └── ✓ Preserves temporal order, crucial for time series models
│
├── 5. MODEL TRAINING & EVALUATION (CRITICAL - Fifth Priority) (5/5)
│ ├── 5.1 Model Training (5/5)
│ │ ├── Data Splitting (Train/Test/Validation) (5/5)
│ │ │ ├── sklearn.model_selection.train_test_split
│ │ │ └── ✓ Essential for unbiased evaluation of model performance
│ │ ├── Model Fitting (5/5)
│ │ │ ├── model.fit(X_train, y_train)
│ │ │ └── ✓ The core process of learning patterns from training data
│ │ └── Ensemble Methods (4/5)
│ │ ├── Bagging (e.g., Random Forest) (4/5)
│ │ │ ├── sklearn.ensemble.BaggingClassifier
│ │ │ └── ✓ Training multiple models independently and averaging predictions
│ │ ├── Boosting (e.g., AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost) (5/5)
│ │ │ ├── sklearn.ensemble
│ │ │ ├── xgboost
│ │ │ ├── lightgbm
│ │ │ ├── catboost
│ │ │ └── ✓ Sequentially building models to correct errors of previous models
│ │ └── Stacking (3/5)
│ │ ├── sklearn.ensemble.StackingClassifier
│ │ └── ✓ Training a meta-model on predictions of multiple base models
│ │
│ ├── 5.2 Model Evaluation (Metrics) (5/5)
│ │ ├── Classification Metrics (5/5)
│ │ │ ├── Accuracy (5/5)
│ │ │ ├── Precision, Recall, F1-Score (5/5)
│ │ │ │ ├── sklearn.metrics.accuracy_score
│ │ │ │ ├── sklearn.metrics.precision_score
│ │ │ │ ├── sklearn.metrics.recall_score
│ │ │ │ ├── sklearn.metrics.f1_score
│ │ │ │ └── ✓ For evaluating classification model performance
│ │ │ ├── ROC AUC (4/5)
│ │ │ │ ├── sklearn.metrics.roc_auc_score
│ │ │ │ └── ✓ Measures classifier's ability to distinguish between classes
│ │ │ └── Confusion Matrix (5/5)
│ │ │ ├── sklearn.metrics.confusion_matrix
│ │ │ └── ✓ Visualizes performance of a classification model
│ │ └── Regression Metrics (5/5)
│ │ ├── Mean Squared Error (MSE), Root Mean Squared Error (RMSE) (5/5)
│ │ │ ├── sklearn.metrics.mean_squared_error
│ │ │ └── ✓ Common metrics for regression, penalize larger errors more
│ │ ├── Mean Absolute Error (MAE) (4/5)
│ │ │ ├── sklearn.metrics.mean_absolute_error
│ │ │ └── ✓ Less sensitive to outliers than MSE
│ │ └── R-squared ($R^2$) (4/5)
│ │ ├── sklearn.metrics.r2_score
│ │ └── ✓ Proportion of variance in dependent variable predictable from independent variables
│ │
│ └── 5.3 Overfitting/Underfitting Diagnosis (5/5)
│ ├── Learning Curves (4/5)
│ │ ├── sklearn.model_selection.learning_curve
│ │ └── ✓ Visualizing model performance with increasing training data size
│ ├── Validation Curves (4/5)
│ │ ├── sklearn.model_selection.validation_curve
│ │ └── ✓ Visualizing model performance with varying hyperparameter values
│ └── Bias-Variance Trade-off (5/5)
│ ├── Conceptual understanding
│ └── ✓ Balancing model complexity to minimize generalization error
│
├── 6. REGULARIZATION STRATEGIES (MEDIUM IMPORTANCE - Sixth Priority) (4/5)
│ ├── Linear Model Regularization (4/5)
│ │ ├── L1 Regularization (Lasso) (5/5)
│ │ │ ├── sklearn.linear_model.Lasso
│ │ │ └── ✓ Feature selection property, drives weights to zero
│ │ ├── L2 Regularization (Ridge) (4/5)
│ │ │ ├── sklearn.linear_model.Ridge
│ │ │ └── ✓ Shrinks coefficients, prevents overfitting, no explicit selection
│ │ ├── Elastic Net (L1+L2) (4/5)
│ │ │ ├── sklearn.linear_model.ElasticNet
│ │ │ └── ✓ Hybrid of L1 and L2, handles correlated features well
│ │ └── Group Lasso (2/5)
│ │ ├── sklearn_learn_contrib.group_lasso
│ │ └── ✓ Selects/eliminates groups of features together
│ │
│ ├── Tree-Based Regularization (3/5)
│ │ ├── Max Depth Control (4/5)
│ │ │ ├── sklearn.tree.DecisionTreeClassifier(max_depth)
│ │ │ ├── xgboost.XGBClassifier(max_depth)
│ │ │ └── ✓ Limits tree growth, prevents overfitting
│ │ ├── Min Samples Split/Leaf (4/5)
│ │ │ ├── sklearn.tree.DecisionTreeClassifier(min_samples_split)
│ │ │ ├── sklearn.tree.DecisionTreeClassifier(min_samples_leaf)
│ │ │ └── ✓ Controls minimum samples required to split/form a leaf
│ │ └── Feature Subsetting (e.g., Random Forest) (3/5)
│ │ ├── sklearn.ensemble.RandomForestClassifier(max_features)
│ │ └── ✓ Randomly selects a subset of features for each tree
│ │
│ └── Neural Network Regularization (3/5)
│ ├── Dropout (4/5)
│ │ ├── tensorflow.keras.layers.Dropout
│ │ ├── torch.nn.Dropout
│ │ └── ✓ Randomly sets a fraction of input units to 0 at each update
│ ├── Batch Normalization (4/5)
│ │ ├── tensorflow.keras.layers.BatchNormalization
│ │ │ └── ✓ Normalizes layer inputs, reduces internal covariate shift
│ │ └── Weight Decay (L1/L2 penalty) (3/5)
│ │ ├── tensorflow.keras.regularizers.l1_l2
│ │ ├── torch.optim.Adam(weight_decay)
│ │ └── ✓ Adds penalty to weights, similar to L1/L2 regularization
│
└── 7. DIMENSIONALITY REDUCTION (CONTEXT-DEPENDENT - Seventh Priority) (1/5)
├── Linear Methods (3/5)
│ ├── Principal Component Analysis (PCA) (4/5)
│ │ ├── sklearn.decomposition.PCA
│ │ └── ✓ Transforms data to orthogonal components, retains variance
│ ├── Linear Discriminant Analysis (LDA) (3/5)
│ │ ├── sklearn.discriminant_analysis.LinearDiscriminantAnalysis
│ │ │ └── ✓ Maximizes class separability, supervised
│ │ ├── Independent Component Analysis (ICA) (2/5)
│ │ │ ├── sklearn.decomposition.FastICA
│ │ │ └── ✓ Separates multivariate signal into independent components
│ │ └── Factor Analysis (2/5)
│ │ ├── sklearn.decomposition.FactorAnalysis
│ │ └── ✓ Explains variance using a smaller number of latent factors
│ │
│ ├── Non-Linear Methods (3/5)
│ │ ├── t-Distributed Stochastic Neighbor Embedding (t-SNE) (4/5)
│ │ │ ├── sklearn.manifold.TSNE
│ │ │ └── ✓ Best for visualization, preserves local structure
│ │ ├── UMAP (Uniform Manifold Approximation and Projection) (4/5)
│ │ │ ├── umap-learn.UMAP
│ │ │ └── ✓ Faster than t-SNE, good for visualization and general embedding
│ │ ├── Kernel PCA (2/5)
│ │ │ ├── sklearn.decomposition.KernelPCA
│ │ │ └── ✓ Non-linear PCA using kernel tricks
│ │ └── Autoencoders (3/5)
│ │ ├── tensorflow.keras.models.Sequential
│ │ ├── torch.nn.Module
│ │ └── ✓ Neural network for learning compressed data representation
│ │
│ └── Sparse Methods (2/5)
│ ├── Sparse PCA (2/5)
│ │ ├── sklearn.decomposition.SparsePCA
│ │ └── ✓ PCA with sparse components, improves interpretability
│ └── Dictionary Learning (2/5)
│ ├── sklearn.decomposition.DictionaryLearning
│ └── ✓ Learns a dictionary of sparse components