Machine Learning Fundamentals with Python

Introduction to Machine Learning Concepts

In this article, we’ll explore how to code five machine learning concepts using Python. We’ll fetch the problem statement along with starter code from Deep-ML. Additionally, we’ll be adding a little theory with each problem to give you an idea behind the code.

5 Machine Learning Concepts

The five concepts we’ll be exploring are:

PCA (Principal Component Analysis)
Feature Scaling
Confusion Matrix for Binary Classification
Overfitting & Underfitting
Random Shuffle of Dataset

PCA (Principal Component Analysis)

Principal Component Analysis is a dimensionality reduction technique. Let’s assume we have a dataset that contains n-1 independent features and 1 dependent feature. Now, this gives us an n-dimensional dataset, which in some cases might be very large. So, the dimension reduction technique can be used here to get only the important features (columns), also known as components.

An important thing to keep in mind is that the loss of information should also be minimal while choosing only the important components.

Steps in PCA

Data Standardization — It is crucial, as PCA chooses the important components that maximize the variance in data, so if the data isn’t standardized, PCA will be biased towards the features with a large numerical range.
Covariance Matrix — The next step is to compute the covariance matrix. It shows how features vary from each other.
Eigenvalues and Eigenvectors — Eigenvectors indicate the direction of the principal components, while eigenvalues describe the variance of each component.
Sort Eigenvalues and Eigenvectors — Rank the principal components in descending order of their eigenvalues. The first component explains the most variance, the second explains the next most, and so on.
Return TopK Components — Finally, return the top K components as new features (principal components).

import numpy as np

def pca(data: np.ndarray, k: int) -> np.ndarray:
    # Step 1: Standardize
    data_std = (data - np.mean(data, axis=0)) / np.std(data, axis=0)

    # Step 2: Covariance matrix
    cov_matrix = np.cov(data_std, rowvar=False)

    # Step 3: Eigen decomposition
    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

    # Step 4: Sort by eigenvalue
    sorted_idx = np.argsort(eigenvalues)[::-1]
    eigenvectors = eigenvectors[:, sorted_idx]

    # Step 5: Take top-k eigenvectors
    components = eigenvectors[:, :k]

    for i in range(components.shape[1]):
        if components[0, i] < 0:
            components[:, i] *= -1

    return np.round(components, 4)

Feature Scaling

This is where you get to understand data and also play with it. Feature scaling is a pre-processing technique that helps keep the values of each column within a certain range, so they don’t vary significantly from one another.

A visual example —

Before scaling, the dataset contains values with different ranges. After scaling, all the values are within the same range.

The formula for min-max scaling is:

import numpy as np

def feature_scaling(data: np.ndarray) -> (np.ndarray, np.ndarray):
    # Formula for both
    # normalized_data = (data - X_min) / (X_max - X_min)
    # standardized_data = (data - X_mean) / X_std

    X_min = np.min(data, axis=0)
    X_max = np.max(data, axis=0)

    X_mean = np.mean(data, axis=0)
    X_std = np.std(data, axis=0)

    normalized_data = (data - X_min) / (X_max - X_min)
    standardized_data = (data - X_mean) / (X_std)

    return standardized_data, normalized_data

Confusion Matrix for Binary Classification

In ML, the confusion matrix is very confusing. But it still makes sense with practices. Let’s decode the concept step by step.

Before examining the confusion matrix, ensure you understand the classification problem in machine learning. In a classification setup, we get y_pred (a list of predicted values) after running our prediction model on X_test.

def confusion_matrix(data):
    # Implement the function here
    TP = 0 #True Positive
    FP = 0 #False Positive
    FN = 0 #False Negative
    TN = 0 #True Negative

    for y_test, y_pred in data:
        if y_test == 1 and y_pred == 1:
            TP += 1
        elif y_test == 1 and y_pred == 0:
            FN += 1
        elif y_test == 0 and y_pred == 1:
            FP += 1
        elif y_test == 0 and y_pred == 0:
            TN += 1

    return [[TP, FN],[FP, TN]]

Overfitting & Underfitting

These two concepts mainly align with training and evaluating the Machine Learning models. Overfitting describes how well the model learned from the data so that it can generalize better to new data. On the other hand, Underfitting describes that the model was not able to learn properly from the training data.

Overfitting — High accuracy on training data, but lower accuracy on test data.
Underfitting — Low accuracy on both training and test data.

def model_fit_quality(training_accuracy, test_accuracy):
    """Determine if the model is overfitting, underfitting, or a good fit based on training and test accuracy.
    :param training_accuracy: float, training accuracy of the model (0 <= training_accuracy <= 1)
    :param test_accuracy: float, test accuracy of the model (0 <= test_accuracy <= 1)
    :return: int, one of '1', '-1', or '0'.
    """
    # Your code here
    if training_accuracy - test_accuracy > 0.2:
        return 1
    elif training_accuracy < 0.7 and test_accuracy < 0.7:
        return -1
    else:
        return 0

Random Shuffle of Dataset

This is a typically overlooked but very important concept to understand. When we discuss shuffling a dataset, it refers to shuffling the rows within it. This technique is useful because it helps reduce overfitting (by preventing bias). For example, we use this method when implementing the cross-validation technique.

import numpy as np

def shuffle_data(X, y, seed=None):
    np.random.seed(seed)
    indices = np.arange(len(X))
    np.random.shuffle(indices)
    return X[indices], y[indices]

Conclusion

These were only five examples explained; you can visit Deep-ML to solve more such problems. And I highly encourage you to do so if you’re preparing for an AI/ML role.

FAQs

What is PCA in machine learning?
PCA stands for Principal Component Analysis, which is a dimensionality reduction technique.
What is feature scaling in machine learning?
Feature scaling is a pre-processing technique that helps keep the values of each column within a certain range, so they don’t vary significantly from one another.
What is a confusion matrix in machine learning?
A confusion matrix is a table used to evaluate the performance of a classification model.
What is overfitting in machine learning?
Overfitting occurs when a model is too complex and learns the noise in the training data, resulting in poor performance on new, unseen data.
What is underfitting in machine learning?
Underfitting occurs when a model is too simple and fails to capture the underlying patterns in the training data, resulting in poor performance on both training and test data.