What is Multiple Linear Regression (MLR)?
Multiple Linear Regression (MLR) is a method used to predict an outcome based on two or more factors. These factors are called independent variables, and the outcome we are trying to predict is called the dependent variable. MLR helps us understand how changes in the independent variables affect the dependent variable.
For example, if you want to predict store sales, you might use factors like advertising money, store size, and inventory to see how they influence sales.
Key Terminology
- Dependent Variable: This is what you are trying to predict or explain (e.g., sales).
- Independent Variables: These are the factors that influence or predict the dependent variable (e.g., advertising money, store size).
- Coefficients: These numbers show how much the dependent variable changes when one of the independent variables changes.
- Residuals (Errors): The difference between what the model predicts and the actual value.
The Multiple Linear Regression Formula
In MLR, the relationship between variables is represented by this formula:
Outcome = Intercept + Coefficient 1 (Factor 1) + Coefficient 2 (Factor 2) + ... + Error
- Outcome: The dependent variable you want to predict.
- Intercept: The starting point or predicted outcome when all factors are zero.
- Coefficients: Show how much each independent variable affects the outcome.
- Error: The difference between the predicted and actual outcome.
Example
Let’s say you want to predict sales using factors like advertising money, store size, and inventory. The formula might look like this:
Sales = -18.86 + 11.53(Advertising) + 16.2(Store Size) + 0.17(Inventory)
- For each additional dollar spent on advertising, sales increase by $11.53.
- Store size increases sales by $16.20 for each extra square foot.
- More inventory increases sales by $0.17 for every extra unit.
Steps to Perform Multiple Linear Regression
- Collect Data: Gather information about the outcome (dependent variable) and at least two factors (independent variables).
- Explore the Data: Look at your data to understand how the factors relate to each other and to the outcome. Use graphs like scatterplots to visualize relationships.
- Check the Assumptions:
- Linearity: The relationship between the factors and the outcome should be a straight line.
- Independence of Errors: The errors (differences between predicted and actual outcomes) should not depend on each other.
- Equal Error Spread (Homoscedasticity): The size of the errors should be the same across all values of the factors.
- Normal Error Distribution: The errors should follow a bell-shaped curve.
- Create the Model: Use software like Excel, Python, or R to build the MLR model based on your data.
- Interpret the Coefficients: Each coefficient tells you how much the dependent variable will change when one of the factors changes by one unit.
- Evaluate the Model: Use measures like R-squared, adjusted R-squared, and p-values to see how well your model explains the outcome.
- Predict New Outcomes: Once the model is created, you can use it to predict outcomes for new data.
Assumptions of Multiple Linear Regression
- Linearity: There should be a straight-line relationship between the outcome and the factors.
- Multicollinearity: The factors should not be too closely related to each other.
- Equal Error Spread: The spread of errors should be about the same for different levels of the factors.
- Normal Error Distribution: The errors should form a bell-shaped curve.
- Independent Errors: Errors should not influence each other.
How to Check the Assumptions
- Linearity: Use scatterplots to check if the relationship between factors and the outcome is a straight line.
- Multicollinearity: Use a tool like VIF (Variance Inflation Factor) to check if the factors are too closely related. A VIF higher than 10 suggests a problem.
- Equal Error Spread: Look at a residual plot to see if the errors are evenly spread.
- Normal Error Distribution: Make a histogram or Q-Q plot to check if the errors follow a bell-shaped curve.
- Independent Errors: Use the Durbin-Watson test to check if the errors are independent.
Goodness of Fit Measures
- R-Squared: Shows how much of the outcome is explained by the independent variables. A higher R-squared means a better model.
- Adjusted R-Squared: Adjusts R-squared to account for the number of independent variables in the model.
- P-Values: Tell you whether each factor is important for predicting the outcome. A p-value less than 0.05 is typically considered significant.
- F-Statistic: Tells you if the overall model is significant.
Dummy Variables
Sometimes, you need to include categories like store location (A, B, or C). Since you can’t use these directly in the model, you create dummy variables. A dummy variable is either 0 or 1:
- If a store offers free samples, the dummy variable is 1.
- If the store doesn’t offer free samples, the dummy variable is 0.
Using MLR to Make Predictions
Once you have built the MLR model, you can use it to predict outcomes. For example, if a store spends $6,000 on advertising, has 3,600 square feet, and $200,000 in inventory, the predicted sales would be:
Predicted Sales = -18.86 + 11.53(6) + 16.2(3.6) + 0.17(200) = $219,420
This means the store is expected to make $219,420 in sales under these conditions.
Applications of Multiple Linear Regression
- Business: Predicting sales based on factors like advertising, store size, and inventory.
- Healthcare: Predicting health outcomes using factors like age, diet, and physical activity.
- Marketing: Estimating how factors like ad spending and product pricing affect sales.
- Social Sciences: Studying how factors like education and family income affect academic performance.
Conclusion
Multiple Linear Regression is a powerful tool to understand how several factors influence an outcome. By following the steps, checking the assumptions, and interpreting the results correctly, you can make better predictions and decisions using real-world data.