Post

Feature Engineering for Tabular ML: A Python Guide

Learn practical feature engineering techniques in Python for tabular machine learning — encoding, scaling, binning, interaction features, and target encoding with real examples.

Feature Engineering for Tabular ML: A Python Guide

Why Features Matter More Than the Model

In tabular ML, a well-engineered feature set with a simple model usually beats a complex model with raw, untransformed columns. This guide covers the techniques that consistently move the needle — the same groundwork used in Fraud Detection with Machine Learning before any model gets touched.

1
pip install pandas scikit-learn category_encoders numpy

Setting Up a Sample Dataset

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "age": [25, 34, 45, 23, 52, 38],
    "income": [42000, 58000, 95000, 31000, 110000, 67000],
    "city": ["NYC", "LA", "NYC", "SF", "LA", "SF"],
    "signup_date": pd.to_datetime(["2024-01-15", "2023-06-20", "2022-11-03", "2024-03-10", "2021-08-25", "2023-09-12"]),
    "purchased": [0, 1, 1, 0, 1, 0],
})
print(df)

Encoding Categorical Variables

One-hot encoding works for low-cardinality columns, but a high-cardinality column like city (with hundreds of values) explodes into too many sparse columns. Target encoding handles that better:

1
2
3
4
5
6
7
8
9
# One-hot for low cardinality
df_onehot = pd.get_dummies(df, columns=["city"], prefix="city")

# Target encoding for high cardinality
from category_encoders import TargetEncoder

encoder = TargetEncoder(cols=["city"])
df["city_target_enc"] = encoder.fit_transform(df["city"], df["purchased"])
print(df[["city", "city_target_enc"]])
1
2
3
4
5
6
7
  city  city_target_enc
0  NYC          0.612
1   LA          0.598
2  NYC          0.612
3   SF          0.401
4   LA          0.598
5   SF          0.401

Target encoding replaces each category with the average target value for that category — but it must be fit only on training data, or it leaks the label into the features.

Extracting Date Features

A raw timestamp is nearly useless to a model. Decompose it into signal the model can actually use:

1
2
3
4
df["signup_year"] = df["signup_date"].dt.year
df["signup_month"] = df["signup_date"].dt.month
df["days_since_signup"] = (pd.Timestamp.now() - df["signup_date"]).dt.days
df["signup_is_weekend"] = df["signup_date"].dt.dayofweek >= 5

days_since_signup (tenure) is almost always more predictive than the raw date — it captures “how established is this customer” directly.

Binning Continuous Variables

Binning can help tree-based models find non-linear thresholds faster, and makes features interpretable:

1
2
3
4
5
6
7
df["age_bracket"] = pd.cut(
    df["age"],
    bins=[0, 25, 35, 50, 100],
    labels=["18-25", "26-35", "36-50", "50+"],
)

df["income_quantile"] = pd.qcut(df["income"], q=4, labels=["Q1", "Q2", "Q3", "Q4"])

pd.qcut bins by quantile rather than fixed ranges — useful when the distribution is skewed and fixed bins would put most rows in one bucket.

Creating Interaction Features

Sometimes the relationship between two features matters more than either alone:

1
2
df["income_per_age"] = df["income"] / df["age"]
df["high_income_young"] = ((df["income"] > df["income"].median()) & (df["age"] < 35)).astype(int)

high_income_young is a manually-crafted interaction a linear model can’t discover on its own — tree-based models can find some interactions automatically, but explicit ones still often help.

Scaling Numeric Features

Required for linear models, SVMs, and neural networks; not needed for tree-based models (Random Forest, XGBoost):

1
2
3
4
5
6
7
8
from sklearn.preprocessing import StandardScaler, RobustScaler

scaler = StandardScaler()
df["income_scaled"] = scaler.fit_transform(df[["income"]])

# RobustScaler is better when outliers are present — uses median/IQR instead of mean/std
robust_scaler = RobustScaler()
df["income_robust_scaled"] = robust_scaler.fit_transform(df[["income"]])

Putting It in a Reusable Pipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_features = ["age", "income"]
categorical_features = ["city"]

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
])

pipeline = Pipeline([("preprocessor", preprocessor)])
X_transformed = pipeline.fit_transform(df)
print(X_transformed.shape)

Wrapping transforms in a ColumnTransformer + Pipeline means the exact same fitted preprocessing applies automatically at inference time — no risk of training/serving skew from manually reapplying steps.

Key Takeaways

  • Target encoding beats one-hot for high-cardinality categorical features, but must be fit only on training data to avoid leakage
  • Decompose dates into tenure, day-of-week, and seasonality features — raw timestamps carry little signal
  • Binning helps interpretability and can help linear models capture non-linear thresholds
  • Manually crafted interaction features still add value even with tree-based models
  • Tree-based models don’t need scaling; linear models, SVMs, and neural nets do
  • Always wrap preprocessing in a Pipeline/ColumnTransformer to guarantee training and inference use identical transforms
Khushal Jethava
Khushal Jethava

Machine Learning Engineer at Codiste, specializing in Generative AI, NLP, and Computer Vision. Building production AI systems with Python.

This post is licensed under CC BY 4.0 by the author.