Multiple Regression in Biology & Analytics Project

Multiple Regression in Biology & Analytics

A regression case study where I modeled forest-fire burned area using environmental predictors, diagnosed the limits of OLS, and translated a weak but meaningful signal into responsible ecological interpretation.

Created 1 April 2026 at 2:21 PM, this BANA482 case study was my multiple regression project using the UCI Forest Fires dataset. I modeled burned area in Montesinho Natural Park, transformed the highly skewed response with log(area + 1), tested environmental predictors, and treated the model limitations as part of the finding rather than something to hide.

R R Markdown tidyverse dplyr GGally car MASS caret lmtest Multiple linear regression OLS diagnostics VIF analysis Breusch-Pagan test Shapiro-Wilk test Train/test validation

Narrated walkthrough

This audio is not a word-for-word copy of the case below. You can read the written case while listening to me explain the project in more detail.

0:00 / 0:00

Speed

I modeled ecosystem stress, not just a regression equation

I framed this project around a real analytical tension: forest-fire burned area is not a neat classroom response variable. Most observations are small, a few are extreme, and the process is shaped by weather, fuel moisture, wind, terrain, vegetation, and human response. My job was to use multiple regression carefully without pretending that the available variables could explain the whole ecological system.

Slide summarizing the Forest Fires dataset with 517 fire events, 8 predictors, no missing values, and mean burned area — I started by defining the response and predictor set before fitting any model.

I transformed burned area before trusting OLS

The original burned-area variable was heavily right-skewed. The median fire burned only 0.52 hectares, while the maximum reached 1090.84 hectares. I used log(area + 1) to compress extreme values and make the response more workable for OLS. This did not magically make the problem easy, but it made the model more statistically defensible.

Exploratory analysis slide comparing original burned area with log-transformed burned area and showing weak correlations with log area — The transformation made the response more usable, while the correlation table warned me that no single predictor had strong linear association with fire size.

I explored the predictors before estimating the model

The scatterplots showed a pattern that shaped the whole case: the predictors contained signal, but not the kind of clean linear signal that produces a high explanatory model. DMC and DC were strongly related to each other, temperature and humidity moved in opposite directions, and wind showed one of the more useful relationships with log burned area.

Faceted scatterplots of environmental predictors against log burned area — The exploratory plots made the weak but visible environmental signal clear before model fitting.

I fit the full environmental model first

The full model included FFMC, DMC, DC, ISI, temperature, relative humidity, wind, and rain. The result was intentionally sobering: adjusted R-squared was only 0.004 and the overall model p-value was 0.247. Wind was the only statistically significant predictor in that model, with beta = 0.0758 and p = 0.039. I treated that as a meaningful but limited finding: higher wind is associated with larger fires, but weather variables alone are not enough to predict burned area well.

Regression results slide showing coefficient estimates, model fit, and wind as the significant predictor — The full model gave one clear signal: wind mattered, but the overall predictive power was weak.

I checked assumptions instead of only reporting p-values

I used VIF to test whether multicollinearity was distorting the coefficients. Every VIF was below 5, with the highest value at 2.66 for temperature, so multicollinearity was not the core issue. The Breusch-Pagan test did not reject homoscedasticity, but the Shapiro-Wilk test strongly rejected residual normality. That result made sense because even after transformation, fire outcomes still have heavy tails.

Model-assumption slide showing VIF values, Breusch-Pagan result, Shapiro-Wilk result, and residual warning — The diagnostics helped me separate model limitations from technical problems like multicollinearity.

I compared candidate models honestly

I tested temporal controls, a temperature-humidity interaction, and a stepwise candidate model. Month/day controls raised adjusted R-squared to 0.021, but the improvement was still modest. The interaction term was not significant. Stepwise selection reduced the model to DMC, RH, and wind, with the lowest AIC and BIC among the candidates, but even that model remained practically limited. The important point was not to crown a model too loudly when the data did not earn it.

Model-comparison slide showing AIC comparison and stepwise regression result — The candidate comparison showed that a cleaner model can be statistically preferable while still having weak explanatory power.

I validated the model on held-out data

I used an 80/20 train/test split with caret to check how the model behaved outside the training data. The test-set RMSE was 1.71 and MAE was 1.30 on the log scale. That validation result supported the same conclusion as the regression summary: the model can identify directional environmental influence, but it should not be sold as a precise fire-size prediction engine.

Train-test validation slide showing split size, RMSE, MAE, and training coefficients — The validation step kept the case grounded in predictive performance rather than in-sample fit alone.

I turned a weak model into a useful conclusion

The final takeaway was not that regression failed. It was that the response is complex and the available variables only describe part of the fire process. Wind repeatedly appeared as the strongest predictor, but the low R-squared showed that burned area also depends on missing variables such as slope, vegetation, fuel continuity, ignition location, suppression speed, and spatial spread. I used the model to say something careful, not something exaggerated.

Conclusion slide summarizing wind, model comparison, VIF, assumption checks, and limitations — The case ends by translating the statistical evidence into responsible ecological interpretation.

What this project says about how I work

This project is personal to me because it shows the way I want to handle analytics work: I do not just chase a significant coefficient or a beautiful slide. I clean the data, question the assumptions, compare models, validate performance, and then explain the result in language a decision-maker can use. The strongest part of this case is the honesty of the conclusion: weak prediction can still be valuable when it is interpreted with discipline.

Key features

01 Cleaned and validated the UCI Forest Fires dataset with 517 observations and no missing values

02 Transformed the heavily right-skewed burned-area response using log(area + 1)

03 Explored fire-weather, moisture-code, wind, rain, temperature, and humidity predictors

04 Estimated full, temporal-control, interaction, stepwise, and train/test regression models

05 Checked multicollinearity with VIF and confirmed every predictor remained below the common threshold of 5

06 Tested assumptions using residual plots, Breusch-Pagan, Shapiro-Wilk, leverage, and influence diagnostics

07 Compared candidate models with adjusted R-squared, AIC, BIC, RMSE, and MAE

08 Translated the weak model fit into a defensible ecological and analytics conclusion

Tech stack

R R Markdown tidyverse dplyr GGally car MASS caret lmtest Multiple linear regression OLS diagnostics VIF analysis Breusch-Pagan test Shapiro-Wilk test Train/test validation

Multiple Regression in Biology & Analytics

The Challenge

The Approach

How it works

I modeled ecosystem stress, not just a regression equation

I transformed burned area before trusting OLS