Cracking the Code of Online Popularity: Lessons from Feature Selection and PCA
Predicting whether an article will go viral is a puzzle that blends data science with human behavior. In this project, we worked with a large dataset of online news articles, aiming to forecast popularity (measured as the number of shares) using dozens of explanatory variables. The assignment was straightforward in its goal but complex in its execution: reduce dimensionality, train models, and report performance. Along the way, we uncovered lessons about interpretability, complexity, and the limits of linear regression in messy, real-world data. The Dimensionality Challenge Our dataset contained nearly 40,000 articles with 60+ explanatory variables — ranging from keyword frequency to sentiment polarity. This posed the classic curse of dimensionality : too many features relative to the predictive signal often leads to overfitting, inefficiency, and inscrutable models. To tackle this, we explored three modeling paths: Full features — a baseline model with all predictors. Feature...