The Complete Guide to Feature Scaling in Machine Learning

5 min readSep 9, 2024

Introduction

Feature scaling is a crucial preprocessing step in machine learning, ensuring that all features contribute equally to the learning process. Without scaling, machine learning algorithms like Principal Component Analysis (PCA), K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and neural networks may underperform. This guide will explain why scaling matters, cover common techniques like standardization and min-max scaling, and show their impact on PCA and classification accuracy.

By the end of this guide, you’ll understand:

The role of feature scaling in machine learning.
Key differences between standardization and min-max scaling.
How scaling improves PCA and classification performance.

Why Is Feature Scaling Important in Machine Learning?

Feature scaling ensures that all features in a dataset contribute equally to algorithms, preventing those with larger ranges from dominating the model. This is particularly important in distance-based algorithms like K-Nearest Neighbors (KNN) and Principal Component Analysis (PCA), where unscaled data can lead to poor results.

Example of the Impact of Scaling

Consider a dataset with age values between 0 and 100 and income ranging from 0 to 100,000. If left unscaled, the algorithm will heavily weigh the income feature over age, skewing predictions and reducing the model’s accuracy.

Feature Scaling Techniques

1. Standardization (Z-Score Normalization)

Standardization transforms features so they have a mean of 0 and a standard deviation of 1, making it ideal for algorithms that assume normally distributed data, such as PCA, SVM, and logistic regression.

Python Code Example: Standardization

import pandas as pd
from sklearn.preprocessing import StandardScaler
from typing import List, Tuple

def load_data(url: str, usecols: List[int], column_names: List[str]) -> pd.DataFrame:
    """
    Load data from a CSV file with specified columns and column names.
    
    Args:
    url (str): URL of the CSV file.
    usecols (List[int]): Indices of columns to use.
    column_names (List[str]): Names for the selected columns.
    
    Returns:
    pd.DataFrame: Loaded dataframe with specified columns.
    """
    return pd.read_csv(url, usecols=usecols, names=column_names, header=None)

def standardize_features(df: pd.DataFrame, features: List[str]) -> Tuple[pd.DataFrame, StandardScaler]:
    """
    Standardize selected features of a dataframe.
    
    Args:
    df (pd.DataFrame): Input dataframe.
    features (List[str]): List of feature names to standardize.
    
    Returns:
    Tuple[pd.DataFrame, StandardScaler]: DataFrame with standardized features and fitted scaler.
    """
    scaler = StandardScaler()
    df[features] = scaler.fit_transform(df[features].values)  # Optimize by using .values
    return df, scaler

def main():
    url = 'https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/wine_data.csv'
    usecols = [0, 1, 2]
    column_names = ['Class label', 'Alcohol', 'Malic acid']
    features_to_standardize = ['Alcohol', 'Malic acid']
    
    df = load_data(url, usecols, column_names)
    df_std, scaler = standardize_features(df, features_to_standardize)
    
    print("Original data:")
    print(df.head())
    print("\nStandardized data:")
    print(df_std.head())
    print(f"\nScaler mean: {scaler.mean_}")
    print(f"Scaler variance: {scaler.var_}")

if __name__ == "__main__":
    main()

When to Use Standardization

When your model relies on normally distributed data (PCA, SVM).
When you need equal importance across features, especially in distance-based models like KNN.

2. Min-Max Scaling (Normalization)

Min-max scaling normalizes data to a specific range, often between 0 and 1. It’s commonly applied to datasets that don’t follow a Gaussian distribution, especially in neural networks or image processing.

Python Code Example: Min-Max Scaling

from sklearn.preprocessing import MinMaxScaler

# Define features for scaling
features = ['Alcohol', 'Malic acid']

# Initialize and fit the Min-Max Scaler
minmax_scaler = MinMaxScaler()
df_minmax = minmax_scaler.fit_transform(df[features])

# Print the scaled features
print(df_minmax)

When to Use Min-Max Scaling

For non-Gaussian distributions.
This is ideal for neural networks, where the features need to be on a similar scale to speed up convergence.

Visualizing Feature Scaling

To fully grasp the effects of standardization and min-max scaling, visualizing the transformation helps.

import matplotlib.pyplot as plt

def plot_scaling(df, df_std, df_minmax):
    """
    Plot the original, standardized, and Min-Max scaled features for comparison.
    
    Parameters:
    df (DataFrame): Original DataFrame with 'Alcohol' and 'Malic acid'.
    df_std (array): Standardized feature values.
    df_minmax (array): Min-Max scaled feature values.
    """
    plt.figure(figsize=(8, 6))
    
    # Plot original scale
    plt.scatter(df['Alcohol'], df['Malic acid'], color='green', label='Original Scale', alpha=0.5)
    
    # Plot standardized scale
    plt.scatter(df_std[:, 0], df_std[:, 1], color='red', label='Standardized', alpha=0.5)
    
    # Plot Min-Max scaled
    plt.scatter(df_minmax[:, 0], df_minmax[:, 1], color='blue', label='Min-Max Scaled', alpha=0.5)
    
    plt.title('Feature Scaling on Wine Dataset')
    plt.xlabel('Alcohol')
    plt.ylabel('Malic Acid')
    plt.legend(loc='upper left')
    plt.grid()
    plt.show()

# Call the function with appropriate DataFrame and scaled data
plot_scaling(df, df_std, df_minmax)

The Role of Feature Scaling in PCA

What Is PCA?

Principal Component Analysis (PCA) is a dimensionality reduction technique that projects data onto new axes (principal components) based on maximum variance. However, PCA is sensitive to feature scaling — without it, larger features dominate the principal components, reducing PCA’s effectiveness.

Example: PCA with and Without Standardization

Let’s see how PCA performs on both standardized and non-standardized data.

from sklearn.decomposition import PCA

# Perform PCA on non-standardized data
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df[['Alcohol', 'Malic acid']])
# Perform PCA on standardized data
pca_std = PCA(n_components=2)
df_std_pca = pca_std.fit_transform(df_std)

# Print PCA results for non-standardized and standardized data
print("PCA on non-standardized data:\n", df_pca)
print("PCA on standardized data:\n", df_std_pca)

Visualizing PCA with and without Standardization

import matplotlib.pyplot as plt

# Create subplots for PCA visualizations
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 5))

# Plot PCA results on non-standardized data
ax1.scatter(df_pca[:, 0], df_pca[:, 1], color='blue', alpha=0.5)
ax1.set_title('PCA on Non-Standardized Data')
ax1.set_xlabel('1st Principal Component')
ax1.set_ylabel('2nd Principal Component')

# Plot PCA results on standardized data
ax2.scatter(df_std_pca[:, 0], df_std_pca[:, 1], color='red', alpha=0.5)
ax2.set_title('PCA on Standardized Data')
ax2.set_xlabel('1st Principal Component')
ax2.set_ylabel('2nd Principal Component')

# Adjust layout for better fit
plt.tight_layout()
plt.show()

Key Observations:

Without scaling, the first principal component is dominated by features with larger ranges, making PCA ineffective.
With scaling, all features contribute equally, allowing PCA to better capture the structure of the data.

Classifier Performance After PCA: Naive Bayes Example

To demonstrate the effect of scaling on classification, we’ll use a Naive Bayes classifier on PCA-transformed data.

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Train Naive Bayes on non-standardized PCA data
gnb = GaussianNB()
gnb.fit(df_pca, df['Class label'])
predictions_pca = gnb.predict(df_pca)

# Train Naive Bayes on standardized PCA data
gnb_std = GaussianNB()
gnb_std.fit(df_std_pca, df['Class label'])
predictions_std_pca = gnb_std.predict(df_std_pca)

# Compare and print accuracy scores
print("Accuracy on Non-Standardized PCA:", accuracy_score(df['Class label'], predictions_pca))
print("Accuracy on Standardized PCA:", accuracy_score(df['Class label'], predictions_std_pca))

Results:

Without scaling: The classifier accuracy is significantly lower, as PCA is unable to effectively reduce dimensionality.
With scaling: PCA performs better, resulting in higher classifier accuracy.

Conclusion

Feature scaling plays a pivotal role in machine learning, especially in algorithms sensitive to the scale of input features, like PCA and KNN. Techniques like standardization and min-max scaling ensure that all features contribute equally, preventing skewed model performance.

Key Takeaways:

Standardization is crucial for models that rely on normally distributed data.
Min-max scaling is best for models like neural networks and datasets with non-Gaussian distributions.
Scaling significantly improves PCA performance and overall classification accuracy.

By understanding and applying feature scaling techniques, you can enhance your machine learning model’s performance in real-world scenarios.