MBS Mohammed Baobaid
All projects
Project
University project Data Analytics Created 1 April 2026 at 2:21 PM

Multiple Regression in Biology & Analytics

A regression case study where I modeled forest-fire burned area using environmental predictors, diagnosed the limits of OLS, and translated a weak but meaningful signal into responsible ecological interpretation.

Created 1 April 2026 at 2:21 PM, this BANA482 case study was my multiple regression project using the UCI Forest Fires dataset. I modeled burned area in Montesinho Natural Park, transformed the highly skewed response with log(area + 1), tested environmental predictors, and treated the model limitations as part of the finding rather than something to hide.

R R Markdown tidyverse dplyr GGally car MASS caret lmtest Multiple linear regression OLS diagnostics VIF analysis Breusch-Pagan test Shapiro-Wilk test Train/test validation
Narrated walkthrough

This audio is not a word-for-word copy of the case below. You can read the written case while listening to me explain the project in more detail.

0:00 / 0:00
Speed
Multiple Regression in Biology & Analytics project preview
517 Fire events
8 Predictors
0.039 Wind p-value
2.66 Max VIF

Role

Lead regression analyst, report author, and presentation designer

Outcome

I analyzed 517 fire events with 8 environmental predictors, found that wind was the only statistically significant predictor in the full model (p = 0.039), confirmed all VIF values were below 5, and showed that weather-only linear models explain very little of burned-area variation without richer spatial, vegetation, and suppression variables.

The Challenge

The dataset looked simple at first: predict burned forest area from environmental and fire-weather variables. The hard part was that burned area is extremely skewed, many fires burn almost nothing, a few fires become extreme events, and the available predictors do not capture terrain, vegetation, ignition context, firefighting response, or spatial spread. I had to build a statistically careful model while being honest that a low R-squared can be a real property of the problem, not a failure of effort.

The Approach

I treated the project as a full regression workflow. I cleaned and inspected the Forest Fires data, transformed the response as log(area + 1), explored distributions and correlations, fitted a full OLS model, added optional month/day controls, tested a temperature-humidity interaction, compared candidate models, checked multicollinearity, examined diagnostics, and validated performance on an 80/20 split. I wrote the interpretation around evidence, uncertainty, and the ecological limits of the data.

How it works

I modeled ecosystem stress, not just a regression equation

I framed this project around a real analytical tension: forest-fire burned area is not a neat classroom response variable. Most observations are small, a few are extreme, and the process is shaped by weather, fuel moisture, wind, terrain, vegetation, and human response. My job was to use multiple regression carefully without pretending that the available variables could explain the whole ecological system.

Slide summarizing the Forest Fires dataset with 517 fire events, 8 predictors, no missing values, and mean burned area
I started by defining the response and predictor set before fitting any model.

I transformed burned area before trusting OLS

The original burned-area variable was heavily right-skewed. The median fire burned only 0.52 hectares, while the maximum reached 1090.84 hectares. I used log(area + 1) to compress extreme values and make the response more workable for OLS. This did not magically make the problem easy, but it made the model more statistically defensible.

Exploratory analysis slide comparing original burned area with log-transformed burned area and showing weak correlations with log area
The transformation made the response more usable, while the correlation table warned me that no single predictor had strong linear association with fire size.

I explored the predictors before estimating the model

The scatterplots showed a pattern that shaped the whole case: the predictors contained signal, but not the kind of clean linear signal that produces a high explanatory model. DMC and DC were strongly related to each other, temperature and humidity moved in opposite directions, and wind showed one of the more useful relationships with log burned area.

Faceted scatterplots of environmental predictors against log burned area
The exploratory plots made the weak but visible environmental signal clear before model fitting.

I fit the full environmental model first

The full model included FFMC, DMC, DC, ISI, temperature, relative humidity, wind, and rain. The result was intentionally sobering: adjusted R-squared was only 0.004 and the overall model p-value was 0.247. Wind was the only statistically significant predictor in that model, with beta = 0.0758 and p = 0.039. I treated that as a meaningful but limited finding: higher wind is associated with larger fires, but weather variables alone are not enough to predict burned area well.

Regression results slide showing coefficient estimates, model fit, and wind as the significant predictor
The full model gave one clear signal: wind mattered, but the overall predictive power was weak.

I checked assumptions instead of only reporting p-values

I used VIF to test whether multicollinearity was distorting the coefficients. Every VIF was below 5, with the highest value at 2.66 for temperature, so multicollinearity was not the core issue. The Breusch-Pagan test did not reject homoscedasticity, but the Shapiro-Wilk test strongly rejected residual normality. That result made sense because even after transformation, fire outcomes still have heavy tails.

Model-assumption slide showing VIF values, Breusch-Pagan result, Shapiro-Wilk result, and residual warning
The diagnostics helped me separate model limitations from technical problems like multicollinearity.

I compared candidate models honestly

I tested temporal controls, a temperature-humidity interaction, and a stepwise candidate model. Month/day controls raised adjusted R-squared to 0.021, but the improvement was still modest. The interaction term was not significant. Stepwise selection reduced the model to DMC, RH, and wind, with the lowest AIC and BIC among the candidates, but even that model remained practically limited. The important point was not to crown a model too loudly when the data did not earn it.

Model-comparison slide showing AIC comparison and stepwise regression result
The candidate comparison showed that a cleaner model can be statistically preferable while still having weak explanatory power.

I validated the model on held-out data

I used an 80/20 train/test split with caret to check how the model behaved outside the training data. The test-set RMSE was 1.71 and MAE was 1.30 on the log scale. That validation result supported the same conclusion as the regression summary: the model can identify directional environmental influence, but it should not be sold as a precise fire-size prediction engine.

Train-test validation slide showing split size, RMSE, MAE, and training coefficients
The validation step kept the case grounded in predictive performance rather than in-sample fit alone.

I turned a weak model into a useful conclusion

The final takeaway was not that regression failed. It was that the response is complex and the available variables only describe part of the fire process. Wind repeatedly appeared as the strongest predictor, but the low R-squared showed that burned area also depends on missing variables such as slope, vegetation, fuel continuity, ignition location, suppression speed, and spatial spread. I used the model to say something careful, not something exaggerated.

Conclusion slide summarizing wind, model comparison, VIF, assumption checks, and limitations
The case ends by translating the statistical evidence into responsible ecological interpretation.

What this project says about how I work

This project is personal to me because it shows the way I want to handle analytics work: I do not just chase a significant coefficient or a beautiful slide. I clean the data, question the assumptions, compare models, validate performance, and then explain the result in language a decision-maker can use. The strongest part of this case is the honesty of the conclusion: weak prediction can still be valuable when it is interpreted with discipline.

Results

  • The analysis used 517 fire events from Montesinho Natural Park in Portugal, with no missing values in the modeling dataset.
  • Burned area was highly right-skewed: the mean was 12.85 hectares, the median was 0.52 hectares, and the maximum was 1090.84 hectares.
  • The response was transformed to log(area + 1) so the regression would not be dominated entirely by extreme fires.
  • The full model had an adjusted R-squared of 0.004, with wind as the only statistically significant predictor at p = 0.039.
  • The temporal-control model improved adjusted R-squared to 0.021, but the overall model remained weak and only marginally significant.
  • The temperature-by-humidity interaction was not significant, so I did not force an unsupported interaction story.
  • All VIF values were below 5; the highest was temperature at 2.66, so multicollinearity was not the main reason for weak prediction.
  • The Breusch-Pagan test did not detect a major heteroscedasticity problem, but the Shapiro-Wilk test confirmed non-normal residuals due to heavy-tailed fire outcomes.
  • Train/test validation produced RMSE of 1.71 and MAE of 1.30 on the log scale, reinforcing that weather-only predictors cannot fully explain fire size.

Key features

01 Cleaned and validated the UCI Forest Fires dataset with 517 observations and no missing values
02 Transformed the heavily right-skewed burned-area response using log(area + 1)
03 Explored fire-weather, moisture-code, wind, rain, temperature, and humidity predictors
04 Estimated full, temporal-control, interaction, stepwise, and train/test regression models
05 Checked multicollinearity with VIF and confirmed every predictor remained below the common threshold of 5
06 Tested assumptions using residual plots, Breusch-Pagan, Shapiro-Wilk, leverage, and influence diagnostics
07 Compared candidate models with adjusted R-squared, AIC, BIC, RMSE, and MAE
08 Translated the weak model fit into a defensible ecological and analytics conclusion

Tech stack

R R Markdown tidyverse dplyr GGally car MASS caret lmtest Multiple linear regression OLS diagnostics VIF analysis Breusch-Pagan test Shapiro-Wilk test Train/test validation
Project links

Interested in similar work?

I build systems like this for teams that need reliable engineering, clean interfaces, and measurable outcomes.