Under the Hood: Principle Component Analysis (PCA)

8 min readApr 27, 2024

Introduction

In the realm of data analysis and machine learning, Principal Component Analysis (PCA) stands as a powerful technique for dimensionality reduction, feature extraction, and data visualization. Originally introduced by Karl Pearson in 1901, PCA has since become a staple tool in various domains, ranging from image processing and signal analysis to bioinformatics and finance.

At its core, PCA aims to transform a high-dimensional dataset into a lower-dimensional subspace while preserving as much of the original information as possible. This transformation is achieved by identifying the directions (principal components) along which the data exhibits the highest variance. By projecting the data onto these principal components, we can effectively capture the most significant patterns and structures within the dataset, thereby reducing its dimensionality and computational complexity.

This article delves into the theory behind PCA, providing a comprehensive understanding of its mathematical foundations. Additionally, we will explore how to implement PCA from scratch using Python (with a little help from NumPy), without relying on external libraries. Finally, we will walk through a real-world application of PCA, demonstrating its practical utility in a data analysis project.

Mathematical Foundations of PCA

To fully grasp the mechanics of PCA, we must first understand the fundamental concepts of linear algebra and statistics that underpin this powerful technique.

Variance and Covariance

Variance: A measure of how spread out a set of data points is from its mean value.
Covariance: A measure of how two random variables vary together, indicating the strength and direction of their linear relationship.

Eigenvectors and Eigenvalues

Eigenvectors: Non-zero vectors that, when multiplied by a square matrix, result in a scalar multiple of themselves.
Eigenvalues: The scalar multipliers associated with each eigenvector.

Singular Value Decomposition (SVD)

SVD is a matrix factorization technique that decomposes a matrix into three matrices: U, Σ, and V^T (the transpose of V).
The columns of U and V represent the left and right singular vectors, respectively, while Σ is a diagonal matrix containing the singular values.
SVD plays a crucial role in the mathematical formulation of PCA. If you would like to see an article on the breakdown of SVD, leave a comment.

The PCA Algorithm

Given a dataset X with n observations and p features, the goal of PCA is to find a lower-dimensional subspace that captures the maximum amount of variance present in the original data. This subspace is defined by a set of orthogonal vectors called principal components.

The PCA algorithm can be broken down in 6 steps as follows:

First, standardize the data by subtracting the mean from each feature and divide by the standard deviation to ensure numerical stability and prevent features with larger scales from dominating the analysis.
Compute the covariance matrix by calculating the covariance matrix Σ of the standardized data, which captures the relationships between features.
Compute the eigenvectors and eigenvalues. First find the eigenvectors and corresponding eigenvalues of the covariance matrix Σ. The eigenvectors represent the principal components, and the eigenvalues represent the amount of variance captured by each principal component.
Rearrange the eigenvectors in descending order of their corresponding eigenvalues. The eigenvectors with the highest eigenvalues capture the most significant directions of variance in the data.
Choose the top k eigenvectors, corresponding to the k largest eigenvalues, to form the new basis for the lower-dimensional subspace.
Project the original data onto the subspace spanned by the top k eigenvectors by multiplying the data matrix X by the matrix of selected eigenvectors. The resulting matrix represents the transformed data in the lower-dimensional subspace, capturing the most important patterns and structures.

It’s important to note that PCA assumes that the data is linearly separable and that the principal components are orthogonal (perpendicular) to each other. This assumption allows for an efficient decomposition of the data into uncorrelated components.

Implementing PCA from Scratch in Python

Now that we have a solid abstract notion of the theoretical foundations of PCA, let’s dive into implementing it from scratch using Python to ground our understanding. We will break down the implementation into smaller steps, allowing for a better understanding of the underlying computations.

Standardize the data by subtracting the mean from each feature. Divide by the standard deviation.

import numpy as np

def standardize(X):
    means = np.mean(X, axis=0)
    stds = np.std(X, axis=0)
    X_standardized = (X - means) / stds
    return X_standardized

2. Compute the covariance matrix by calculating the covariance matrix using the standardized data.

def compute_covariance(X):
    n, p = X.shape
    covariance = np.dot(X.T, X) / (n - 1)
    return covariance

3. Compute the eigenvectors and eigenvalues by finding the eigenvectors and eigenvalues of the covariance matrix.

def compute_eigen(covariance):
    eigenvalues, eigenvectors = np.linalg.eig(covariance)
    return eigenvalues, eigenvectors

4. Sort the eigenvectors by decreasing eigenvalues and rearranging the eigenvectors in descending order of their corresponding eigenvalues.

def sort_eigen(eigenvalues, eigenvectors):
    sorted_indices = np.argsort(eigenvalues)[::-1]
    sorted_eigenvalues = eigenvalues[sorted_indices]
    sorted_eigenvectors = eigenvectors[:, sorted_indices]
    return sorted_eigenvalues, sorted_eigenvectors

5. Select the top k eigenvectors by choosing the top k eigenvectors based on the desired number of principal components.

def select_k_eigenvectors(sorted_eigenvalues, sorted_eigenvectors, k):
    selected_eigenvalues = sorted_eigenvalues[:k]
    selected_eigenvectors = sorted_eigenvectors[:, :k]
    return selected_eigenvalues, selected_eigenvectors

6. Project the data onto the new subspace by projecting the original data onto the subspace spanned by the top k eigenvectors.

def project_data(X, selected_eigenvectors):
    X_projected = np.dot(X, selected_eigenvectors)
    return X_projected

7. Putting it all together, we can define a function that performs PCA on a given dataset.

def pca(X, k):
    X_standardized = standardize(X)
    covariance = compute_covariance(X_standardized)
    eigenvalues, eigenvectors = compute_eigen(covariance)
    sorted_eigenvalues, sorted_eigenvectors = sort_eigen(eigenvalues, eigenvectors)
    selected_eigenvalues, selected_eigenvectors = select_k_eigenvectors(sorted_eigenvalues, sorted_eigenvectors, k)
    X_projected = project_data(X_standardized, selected_eigenvectors)
    return X_projected, selected_eigenvalues

This function takes a dataset X and the desired number of principal components k as input, and returns the projected data in the lower-dimensional subspace along with the corresponding eigenvalues.

Real-World Application: Facial Recognition

Now that we have implemented PCA from scratch, let’s explore a real-world application of this powerful technique: facial recognition. In this section, we will walk through the process of using PCA for dimensionality reduction and feature extraction in a facial recognition system.

Suppose we have a dataset of grayscale facial images, each represented as a flattened vector of pixel intensities. Our goal is to develop a facial recognition system that can accurately identify individuals based on their facial features.

Prepare the data by loading the dataset of facial images, convert each image to a flattened vector of pixel intensities, and assemble the vectors into a matrix X, where each row represents an image.

import numpy as np
from PIL import Image

def load_images(image_paths):
    images = []
    for path in image_paths:
        image = Image.open(path).convert('L')  # Convert to grayscale
        image_array = np.asarray(image).ravel()  # Flatten the image to a vector
        images.append(image_array)
    X = np.array(images)
    return X

2. Perform PCA on the dataset by applying the PCA function we implemented earlier to the dataset X. Choose the number of principal components k based on the desired dimensionality reduction.

X = load_images(image_paths)
k = 100  # Choose the number of principal components
X_pca, eigenvalues = pca(X, k)

3. Train a classifier by splitting the PCA-transformed data into training and testing sets. Train a classifier (e.g., logistic regression, support vector machines, or neural networks) on the training set.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X_pca, labels, test_size=0.2, random_state=42)

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

4. Evaluate the classifier by testing the trained classifier on the testing set. Compute performance metrics such as accuracy, precision, recall, and F1-score.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-score: {f1:.3f}")

Analyze the results by interpreting the performance metrics and assess the effectiveness of the facial recognition system. Explore the principal components and their relationship to facial features.

By applying PCA to the facial image dataset, we have effectively reduced the dimensionality of the data while preserving the most salient features. This dimensionality reduction not only improves computational efficiency but also helps mitigate the curse of dimensionality, which can negatively impact the performance of machine learning models.

The transformed data, represented by the principal components, captures the essential variations in facial features, making it easier for the classifier to learn and discriminate between different individuals. By analyzing the principal components and their corresponding eigenvalues, we can gain insights into the most prominent facial features that contribute to the recognition process.

Conclusion

Throughout this article, we have explored Principal Component Analysis (PCA) in depth, covering its theoretical foundations, implementing it from scratch in Python, and demonstrating its application in a real-world facial recognition project. However, it’s important to note that the implementation we covered represents the classical form of PCA, known as “standard PCA” or “linear PCA.”

While standard PCA is a powerful and widely-used technique, it has certain limitations and assumptions that have led to the development of various extensions and variations. These alternative approaches aim to address specific challenges or cater to different types of data and applications.

One notable extension is Kernel PCA, which is a non-linear variant of PCA. Standard PCA assumes that the data is linearly separable and that the principal components are orthogonal. However, in many real-world scenarios, the data may exhibit non-linear relationships. Kernel PCA addresses this issue by first mapping the data into a higher-dimensional feature space using a non-linear kernel function, and then performing PCA in this new space. This approach allows for the extraction of non-linear principal components, enabling more accurate representation of complex data structures.

Another variation is Incremental PCA, which is particularly useful when dealing with large or streaming datasets that cannot be loaded into memory all at once. Unlike standard PCA, which requires the entire dataset to be available upfront, Incremental PCA updates the principal components incrementally as new data arrives. This makes it more computationally efficient and scalable for real-time applications or scenarios where data is continuously generated.

Robust PCA is another interesting variant that aims to handle datasets contaminated with noise or outliers. Standard PCA can be sensitive to outliers, as they can significantly influence the calculation of the principal components. Robust PCA employs techniques such as alternating minimization or convex optimization to separate the data into a low-rank component (capturing the essential patterns) and a sparse component (representing the noise or outliers).

Sparse PCA is designed to find principal components that are sparse, meaning that they have a large number of zero or near-zero coefficients. This property can be beneficial in applications where interpretability and feature selection are important, as the sparse principal components can highlight the most relevant features while suppressing irrelevant or redundant ones.

These are just a few examples of the diverse array of PCA variants and extensions that have been developed to address specific challenges or cater to different types of data and applications. As data analysis and machine learning continue to evolve, we can expect further advancements and innovations in the field of PCA, enabling more robust, efficient, and specialized dimensionality reduction and feature extraction techniques.

While the implementation covered in this article focused on standard PCA, it serves as a foundation for understanding the core principles and mechanics of this powerful technique. By building upon this foundation and exploring the various extensions and variations, researchers and practitioners can unlock new possibilities and tackle increasingly complex data analysis problems across diverse domains.