1. DATA PREPROCESSING & CLEANING (CRITICAL - First Priority) (5/5) │ ├── 1.1 Data Cleaning (5/5) │ │ ├── Handling Inconsistent Data (5/5) │ │ │ ├── pandas.DataFrame.replace() │ │ │ ├── regex │ │ │ └── ✓ Correcting typos, standardizing formats │ │ ├── Removing Duplicates (5/5) │ │ │ ├── pandas.DataFrame.drop_duplicates() │ │ │ └── ✓ Ensures unique records │ │ └── Handling Noise (4/5) │ │ ├── Binning (4/5) │ │ ├── Regression (3/5) │ │ └── Clustering (3/5) │ │ │ ├── 1.2 Imputation Strategies (Handling Missing Values) (5/5) │ │ ├── Simple Imputation (5/5) │ │ │ ├── Mean/Median/Mode Imputation (5/5) │ │ │ │ ├── sklearn.impute.SimpleImputer │ │ │ │ ├── pandas.DataFrame.fillna() │ │ │ │ └── ✓ Fast, simple baseline for numerical/categorical data │ │ │ ├── Forward/Backward Fill (4/5) │ │ │ │ ├── pandas.DataFrame.fillna(method='ffill') │ │ │ │ ├── pandas.DataFrame.fillna(method='bfill') │ │ │ │ └── ✓ Ideal for time-series data with temporal dependencies │ │ │ └── Constant Value Imputation (4/5) │ │ │ ├── sklearn.impute.SimpleImputer(strategy='constant') │ │ │ └── ✓ Domain-specific constants (e.g., 0, 'Unknown', 'Missing') │ │ ├── Advanced Imputation (4/5) │ │ │ ├── Iterative Imputation (MICE) (4/5) │ │ │ │ ├── sklearn.impute.IterativeImputer │ │ │ │ ├── fancyimpute.IterativeImputer │ │ │ │ └── ✓ Robust multivariate imputation using chained equations │ │ │ ├── K-Nearest Neighbors Imputation (4/5) │ │ │ │ ├── sklearn.impute.KNNImputer │ │ │ │ └── ✓ Preserves feature relationships, good for mixed data types │ │ │ └── Matrix Factorization (3/5) │ │ │ ├── fancyimpute.MatrixFactorization │ │ │ ├── fancyimpute.NuclearNormMinimization │ │ │ └── ✓ Advanced technique for high-dimensional sparse data │ │ └── Domain-Specific Imputation (3/5) │ │ ├── Time Series Interpolation (3/5) │ │ │ ├── pandas.DataFrame.interpolate() │ │ │ ├── scipy.interpolate.interp1d │ │ │ └── ✓ Linear, polynomial, spline interpolation for time series │ │ └── Seasonal Decomposition (3/5) │ │ ├── statsmodels.tsa.seasonal.seasonal_decompose │ │ └── ✓ Handles seasonal patterns in time series data │ │ │ ├── 1.3 Outlier Detection & Treatment (4/5) │ │ ├── Statistical Methods (4/5) │ │ │ ├── Z-Score Method (4/5) │ │ │ │ ├── scipy.stats.zscore │ │ │ │ └── ✓ Identifies outliers beyond 3 standard deviations │ │ │ ├── Interquartile Range (IQR) (4/5) │ │ │ │ ├── pandas.DataFrame.quantile() │ │ │ │ └── ✓ Robust to extreme values, uses Q1 and Q3 │ │ │ └── Modified Z-Score (3/5) │ │ │ ├── scipy.stats.median_abs_deviation │ │ │ └── ✓ Uses median instead of mean, more robust │ │ └── Machine Learning Methods (3/5) │ │ ├── Isolation Forest (3/5) │ │ │ ├── sklearn.ensemble.IsolationForest │ │ │ └── ✓ Unsupervised anomaly detection, good for high dimensions │ │ ├── Local Outlier Factor (LOF) (3/5) │ │ │ ├── sklearn.neighbors.LocalOutlierFactor │ │ │ └── ✓ Density-based outlier detection │ │ └── One-Class SVM (2/5) │ │ ├── sklearn.svm.OneClassSVM │ │ └── ✓ Kernel-based outlier detection │ │ │ ├── 1.4 Data Integration (3/5) │ │ ├── Schema Integration (3/5) │ │ │ ├── pandas.DataFrame.merge() │ │ │ └── ✓ Combining data from heterogeneous sources │ │ └── Entity Resolution (2/5) │ │ ├── recordlinkage │ │ └── ✓ Identifying and linking records that refer to the same entity │ │ │ └── 1.5 Data Transformation (4/5) │ ├── Aggregation (4/5) │ │ ├── pandas.DataFrame.groupby().agg() │ │ └── ✓ Summarizing data (e.g., sum, mean, count) │ ├── Generalization (3/5) │ │ ├── Custom logic │ │ └── ✓ Replacing low-level data with high-level concepts (e.g., age ranges) │ └── Normalization (Data Warehousing Context) (3/5) │ ├── SQL techniques │ └── ✓ Reducing data redundancy and improving data integrity
← Back to Main Pipeline