Breast Cancer Classification
Overview
This project applies basic machine learning methods to the Breast Cancer Wisconsin (Diagnostic) Dataset from the UCI repository.
The goal was to predict whether a tumor is malignant or benign based on cell nuclei measurements.
I used logistic regression with and without Principal Component Analysis (PCA) to evaluate how dimensionality reduction affects model performance.
Dataset
Methods
- Data Preprocessing
- Standardized features (mean 0, unit variance)
- Stratified train/test split
- Models
- Logistic Regression (baseline, no dimensionality reduction)
- Logistic Regression after PCA (reduced to 2 components)
- Evaluation
- Accuracy on train and test sets
- Confusion matrix and classification report
- PCA scatter plot to visualize separation
Results
- Without PCA:
- Train Accuracy ≈ 0.99
- Test Accuracy ≈ 0.96
- With PCA (2 components):
- Train Accuracy ≈ 0.96
- Test Accuracy ≈ 0.95
PCA reduced dimensionality while retaining most variance.
Logistic regression still performed well, but raw features gave slightly higher accuracy.
Visualization
- PCA scatter plot shows malignant and benign samples forming distinct clusters.
- Confusion matrices highlight that both approaches achieve high precision/recall.
Takeaways
- Even simple models like Logistic Regression can achieve high accuracy on structured biomedical data.
- PCA is useful for visualization and dimensionality reduction, though it may slightly reduce accuracy.
- This project reflects how machine learning can support early cancer diagnosis, and demonstrates my ability to apply scikit-learn workflows to biomedical datasets.
Tech Stack
- Python
- scikit-learn
- NumPy, pandas, matplotlib
Repository
Full code and notebook: GitHub Repo