Breast Cancer Classification

Overview

This project applies basic machine learning methods to the Breast Cancer Wisconsin (Diagnostic) Dataset from the UCI repository.
The goal was to predict whether a tumor is malignant or benign based on cell nuclei measurements.

I used logistic regression with and without Principal Component Analysis (PCA) to evaluate how dimensionality reduction affects model performance.


Dataset


Methods

  1. Data Preprocessing
    • Standardized features (mean 0, unit variance)
    • Stratified train/test split
  2. Models
    • Logistic Regression (baseline, no dimensionality reduction)
    • Logistic Regression after PCA (reduced to 2 components)
  3. Evaluation
    • Accuracy on train and test sets
    • Confusion matrix and classification report
    • PCA scatter plot to visualize separation

Results

PCA reduced dimensionality while retaining most variance.
Logistic regression still performed well, but raw features gave slightly higher accuracy.


Visualization


Takeaways


Tech Stack


Repository

Full code and notebook: GitHub Repo