# What is the correct way of adding bias terms in the residuals of the linear regression model?

First, I fit a linear model:

$y=beta_0 + beta_1x_1+beta_2x_2 + epsilon$

Now I want to visualize $y$ after the effects of $x_1$ and $x_2$ have been removed or adjusted. I can visualize the $y$ vs. $x_1$ or $x_2$ relationship by using only the residuals $epsilon$. The problem is I want to add the bias term in the residuals. Say, I want to plot the adjusted $y$ as a boxplot with respect to another independent variable (e.g. diagnostic group).

For now, I am adjusting the effects of $x_1$, $x_2$ on $y$ as below:

$y_a=y – beta_1(x_1-bar{x_1}) – beta_2(x_2-bar{x_2}) qquad (1)$

Here, for one data point, I am defining the effect of $x_1$ as the change in $y$ caused by the difference of $x_1$ from the mean of $x_1$ i.e. $bar{x_1}$.

After few algebraic manipulations:

$y_a= (beta_0 + beta_1 bar{x_1} + beta_2 bar{x_2}) + epsilon = bias + residuals qquad (2)$

First, I am not 100% convinced with myself with this technique. However, this article https://surfer.nmr.mgh.harvard.edu/ftp/articles/buckner2004.pdf also uses this technique for covariate adjustment (see Equation 1 on Page 728).

Question1: Is this technique correct? and why if yes/no? Or asking the same question based on equation 2: Is the bias term $(beta_0 + beta_1 bar{x_1} + beta_2 bar{x_2})$ added to the residuals is correct?

Let’s assume the above adjustment technique is correct.
Let’s say $x_1$ is a categorical variable with more than two levels. How to calculate the mean of $x_1$ ($bar{x_1}$)?

Question2:
How to calculate the mean of a categorical variable? To be strict, it doesn’t even makes a sense to calculate the mean or any summary statistic off a categorical variable. Is there any workaround for this?

Cross Validated Asked by gruangly on November 14, 2021

I think I am confused by the way you are using the word bias. Seems like they adjusted the HCV by head size by fitting a linear regression model. So for people with above average head-size they artificially reduced HCV and with below average head size they increased (adjusted) HCV, in order to reduce the variance caused by head size.

Then they used HCVadj as a factor to model dimentia status.

The reason they adjust before hand is because they want to compute the Cohen's D which uses the mean HCVs in the demented and non-demented population (no room for head-size or external factors in this model). So now they have a Cohen's D adjusted for head size. I see no problem here but I am no expert in Cohen's D.

If you have a categorical predictor you can accomplish the same with unbalanced effect coding. (see here: What is effect coding?). Include the 4-level class variable in the model using 4 dummy variables(x_21, x_22, ....) coded as shown on that page.

Fit the model y = B_1*(x_1 - X_bar) + B_2*x_21 + B_3*x_22 + B_4*x_23 + B_4*x_24.

Calculate the Ya's for each individual as you have in your equation (1) (without the intercept). Then use your Ya's to calculate your Cohen's D. If you are not calculating Cohen's D or something similar that requires 2 groups then you don't need this method. Maybe you can find some other way that takes into account the other factors?

Answered by Derrick Kaufman on November 14, 2021

## Related Questions

### Interpreting growth curve analysis (GCA) main effect in light of interaction (eye tracking data)

1  Asked on September 4, 2020 by meg

### ElasticNet coefficients are different for each cv.glmnet run

0  Asked on September 4, 2020 by jonathan

### Tensor Classification Models

1  Asked on September 3, 2020 by mamafoku

### Simulation of Secretary problem: optimal pool size given k=2?

1  Asked on August 30, 2020 by engrstudent

### Comparing more than two means of continuous variables

2  Asked on August 28, 2020 by kapetantuka

### For B-Spline why $n+1 > k ge 2$ and why is $t_{k-1} le t le t_{n+1}$

0  Asked on August 27, 2020 by user8714896

### Standard deviation and confidence level: how to interpret and evaluate the results

2  Asked on August 25, 2020 by andrea-moro

### Specifying several independent priors in stan_glm() in R

0  Asked on August 23, 2020 by marg

### How do I compare cv.glmnet models with AIC?

1  Asked on August 20, 2020 by thomas

### Maximum likelihood estimator for a discontinuous PDF

0  Asked on August 17, 2020 by probdiscr

### Difference between Linear Mixed Regression and Generalized Estimating Equation Results

1  Asked on August 13, 2020 by rnso

### ‘Translate’ ANOVA comparison on regression parameters into linear mixed model

1  Asked on August 13, 2020 by laurie

### Uncertainty propagation for the solution of an integral equation

0  Asked on August 12, 2020 by clment-f

### Which test should I use to compare 2 unrelated dichotomous variables?

1  Asked on August 10, 2020 by anna

### Difference in Differences with Multiple Time Periods and Multiple Treatment Periods

1  Asked on August 8, 2020 by john-baker

### ARDL and ECM lags

0  Asked on August 8, 2020 by php-useless

### Combining categorical and continuous features for neural networks

2  Asked on August 5, 2020 by 3michelin

### What statistical analysis to used for kinetic data with multiple groups?

1  Asked on August 5, 2020 by carlos-valenzuela

### In R, why do the p-values from anova() change when you add more predictors?

0  Asked on August 4, 2020 by m-smith

### Random forest after cross validation

1  Asked on August 1, 2020 by steven-niggebrugge