1. DATA PREPROCESSING & CLEANING (CRITICAL - First Priority) (5/5)
│ ├── 1.1 Data Cleaning (5/5)
│ │ ├── Handling Inconsistent Data (5/5)
│ │ │ ├── pandas.DataFrame.replace()
│ │ │ ├── regex
│ │ │ └── ✓ Correcting typos, standardizing formats
│ │ ├── Removing Duplicates (5/5)
│ │ │ ├── pandas.DataFrame.drop_duplicates()
│ │ │ └── ✓ Ensures unique records
│ │ └── Handling Noise (4/5)
│ │ ├── Binning (4/5)
│ │ ├── Regression (3/5)
│ │ └── Clustering (3/5)
│ │
│ ├── 1.2 Imputation Strategies (Handling Missing Values) (5/5)
│ │ ├── Simple Imputation (5/5)
│ │ │ ├── Mean/Median/Mode Imputation (5/5)
│ │ │ │ ├── sklearn.impute.SimpleImputer
│ │ │ │ ├── pandas.DataFrame.fillna()
│ │ │ │ └── ✓ Fast, simple baseline for numerical/categorical data
│ │ │ ├── Forward/Backward Fill (4/5)
│ │ │ │ ├── pandas.DataFrame.fillna(method='ffill')
│ │ │ │ ├── pandas.DataFrame.fillna(method='bfill')
│ │ │ │ └── ✓ Ideal for time-series data with temporal dependencies
│ │ │ └── Constant Value Imputation (4/5)
│ │ │ ├── sklearn.impute.SimpleImputer(strategy='constant')
│ │ │ └── ✓ Domain-specific constants (e.g., 0, 'Unknown', 'Missing')
│ │ ├── Advanced Imputation (4/5)
│ │ │ ├── Iterative Imputation (MICE) (4/5)
│ │ │ │ ├── sklearn.impute.IterativeImputer
│ │ │ │ ├── fancyimpute.IterativeImputer
│ │ │ │ └── ✓ Robust multivariate imputation using chained equations
│ │ │ ├── K-Nearest Neighbors Imputation (4/5)
│ │ │ │ ├── sklearn.impute.KNNImputer
│ │ │ │ └── ✓ Preserves feature relationships, good for mixed data types
│ │ │ └── Matrix Factorization (3/5)
│ │ │ ├── fancyimpute.MatrixFactorization
│ │ │ ├── fancyimpute.NuclearNormMinimization
│ │ │ └── ✓ Advanced technique for high-dimensional sparse data
│ │ └── Domain-Specific Imputation (3/5)
│ │ ├── Time Series Interpolation (3/5)
│ │ │ ├── pandas.DataFrame.interpolate()
│ │ │ ├── scipy.interpolate.interp1d
│ │ │ └── ✓ Linear, polynomial, spline interpolation for time series
│ │ └── Seasonal Decomposition (3/5)
│ │ ├── statsmodels.tsa.seasonal.seasonal_decompose
│ │ └── ✓ Handles seasonal patterns in time series data
│ │
│ ├── 1.3 Outlier Detection & Treatment (4/5)
│ │ ├── Statistical Methods (4/5)
│ │ │ ├── Z-Score Method (4/5)
│ │ │ │ ├── scipy.stats.zscore
│ │ │ │ └── ✓ Identifies outliers beyond 3 standard deviations
│ │ │ ├── Interquartile Range (IQR) (4/5)
│ │ │ │ ├── pandas.DataFrame.quantile()
│ │ │ │ └── ✓ Robust to extreme values, uses Q1 and Q3
│ │ │ └── Modified Z-Score (3/5)
│ │ │ ├── scipy.stats.median_abs_deviation
│ │ │ └── ✓ Uses median instead of mean, more robust
│ │ └── Machine Learning Methods (3/5)
│ │ ├── Isolation Forest (3/5)
│ │ │ ├── sklearn.ensemble.IsolationForest
│ │ │ └── ✓ Unsupervised anomaly detection, good for high dimensions
│ │ ├── Local Outlier Factor (LOF) (3/5)
│ │ │ ├── sklearn.neighbors.LocalOutlierFactor
│ │ │ └── ✓ Density-based outlier detection
│ │ └── One-Class SVM (2/5)
│ │ ├── sklearn.svm.OneClassSVM
│ │ └── ✓ Kernel-based outlier detection
│ │
│ ├── 1.4 Data Integration (3/5)
│ │ ├── Schema Integration (3/5)
│ │ │ ├── pandas.DataFrame.merge()
│ │ │ └── ✓ Combining data from heterogeneous sources
│ │ └── Entity Resolution (2/5)
│ │ ├── recordlinkage
│ │ └── ✓ Identifying and linking records that refer to the same entity
│ │
│ └── 1.5 Data Transformation (4/5)
│ ├── Aggregation (4/5)
│ │ ├── pandas.DataFrame.groupby().agg()
│ │ └── ✓ Summarizing data (e.g., sum, mean, count)
│ ├── Generalization (3/5)
│ │ ├── Custom logic
│ │ └── ✓ Replacing low-level data with high-level concepts (e.g., age ranges)
│ └── Normalization (Data Warehousing Context) (3/5)
│ ├── SQL techniques
│ └── ✓ Reducing data redundancy and improving data integrity