2. FEATURE ENGINEERING (HIGH IMPORTANCE - Second Priority) (4/5) │ ├── Categorical Encoding (4/5) │ │ ├── One-Hot Encoding (5/5) │ │ │ ├── sklearn.preprocessing.OneHotEncoder │ │ │ ├── pandas.get_dummies() │ │ │ └── ✓ Best for nominal categories, avoids ordinal assumptions │ │ ├── Label Encoding (4/5) │ │ │ ├── sklearn.preprocessing.LabelEncoder │ │ │ └── ✓ Suitable for ordinal categories with natural ordering │ │ ├── Ordinal Encoding (4/5) │ │ │ ├── sklearn.preprocessing.OrdinalEncoder │ │ │ └── ✓ Preserves ordinal relationships between categories │ │ ├── Target Encoding (3/5) │ │ │ ├── category_encoders.TargetEncoder │ │ │ ├── category_encoders.MEstimateEncoder │ │ │ └── ✓ Uses target statistics, good for high-cardinality categories │ │ ├── Binary Encoding (2/5) │ │ │ ├── category_encoders.BinaryEncoder │ │ │ └── ✓ Reduces dimensionality compared to one-hot │ │ ├── Hash Encoding (2/5) │ │ │ ├── category_encoders.HashingEncoder │ │ │ └── ✓ Fixed-size output, handles unseen categories │ │ └── Frequency Encoding (2/5) │ │ ├── category_encoders.CountEncoder │ │ └── ✓ Replaces categories with their occurrence frequency │ │ │ ├── Numerical Transformations (4/5) │ │ ├── Scaling & Normalization (5/5) │ │ │ ├── StandardScaler (Z-score) (5/5) │ │ │ │ ├── sklearn.preprocessing.StandardScaler │ │ │ │ └── ✓ Mean=0, Std=1, best for normally distributed data │ │ │ ├── MinMaxScaler (5/5) │ │ │ │ ├── sklearn.preprocessing.MinMaxScaler │ │ │ │ └── ✓ Scales to [0,1] range, preserves relationships │ │ │ ├── RobustScaler (4/5) │ │ │ │ ├── sklearn.preprocessing.RobustScaler │ │ │ │ └── ✓ Uses median and IQR, robust to outliers │ │ │ ├── MaxAbsScaler (3/5) │ │ │ │ ├── sklearn.preprocessing.MaxAbsScaler │ │ │ │ └── ✓ Scales by maximum absolute value, preserves sparsity │ │ │ └── Normalizer (3/5) │ │ │ ├── sklearn.preprocessing.Normalizer │ │ │ └── ✓ Scales individual samples to unit norm │ │ ├── Distribution Transformation (4/5) │ │ │ ├── Log Transform (4/5) │ │ │ │ ├── numpy.log1p() │ │ │ │ ├── numpy.log() │ │ │ │ └── ✓ Reduces right skewness, handles positive values │ │ │ ├── Square Root Transform (3/5) │ │ │ │ ├── numpy.sqrt() │ │ │ │ └── ✓ Moderate skewness reduction │ │ │ ├── Box-Cox Transform (3/5) │ │ │ │ ├── scipy.stats.boxcox() │ │ │ │ ├── sklearn.preprocessing.PowerTransformer(method='box-cox') │ │ │ │ └── ✓ Optimal power transformation for positive data │ │ │ └── Yeo-Johnson Transform (3/5) │ │ │ ├── sklearn.preprocessing.PowerTransformer(method='yeo-johnson') │ │ │ └── ✓ Handles both positive and negative values │ │ └── Binning/Discretization (4/5) │ │ ├── Equal-Width Binning (4/5) │ │ │ ├── sklearn.preprocessing.KBinsDiscretizer(strategy='uniform') │ │ │ ├── pandas.cut() │ │ │ └── ✓ Equal-sized intervals, may have unequal frequencies │ │ ├── Equal-Frequency Binning (4/5) │ │ │ ├── sklearn.preprocessing.KBinsDiscretizer(strategy='quantile') │ │ │ ├── pandas.qcut() │ │ │ └── ✓ Equal sample sizes per bin │ │ └── K-Means Binning (3/5) │ │ ├── sklearn.preprocessing.KBinsDiscretizer(strategy='kmeans') │ │ └── ✓ Clusters data points into bins using K-means │ │ │ ├── Feature Creation (4/5) │ │ ├── Polynomial Features (4/5) │ │ │ ├── sklearn.preprocessing.PolynomialFeatures │ │ │ └── ✓ Creates polynomial and interaction terms │ │ ├── Interaction Features (4/5) │ │ │ ├── sklearn.preprocessing.PolynomialFeatures(interaction_only=True) │ │ │ └── ✓ Only interaction terms, no polynomial terms │ │ ├── Domain-Specific Features (5/5) │ │ │ ├── Date/Time Features (5/5) │ │ │ │ ├── pandas.dt.year, pandas.dt.month, pandas.dt.dayofweek │ │ │ │ ├── featuretools.primitives.TimeSeriesFeatures │ │ │ │ └── ✓ Extract temporal patterns and cyclical features │ │ │ ├── Text Features (TF-IDF, N-grams) (4/5) │ │ │ │ ├── sklearn.feature_extraction.text.TfidfVectorizer │ │ │ │ ├── sklearn.feature_extraction.text.CountVectorizer │ │ │ │ └── ✓ Convert text to numerical features │ │ │ └── Geospatial Features (3/5) │ │ │ ├── geopy.distance │ │ │ └── ✓ Distance calculations, coordinate transformations │ │ └── Automated Feature Engineering (2/5) │ │ ├── featuretools.dfs() │ │ ├── tsfresh.extract_features() │ │ └── ✓ Automatically generates features from relational data
← Back to Main Pipeline