How to run svy: regress in R or get R-squared in R for complex survey data

I am trying to get r-squared, or explained variation, in a complex survey data using a linear regression (OLS).

In Stata, this can be done by using svy: regress. In R, however, when I use ‘survey’ package, there is no option for OLS linear regression. There is svyglm, which is generalized linear model (GLM), but this does not provide a value for explained variation (r-squared) because it isn’t OLS. Is there a way to get r-squared for complex survey data in R?

library(survey)

design <- svydesign(id = ~psu, strata = ~strata, weight = ~w_mec, nest = TRUE, data = sample)

model1 <- svyglm(design = design, bmi ~ 1 + age + black + hispanics + others + female + edu2 + edu3 + edu4 + near_poor + middle + high, family = gaussian(link = "identity"), data = sample)

summary(model1)


Above is an example of what I did in R. This doesn’t give r-squared because it’s GLM. You don’t really need to reproduce anything; this isn’t a code issue, I just want to know if there is a way to get r-squared for complex survey data in R.

Cross Validated Asked on November 14, 2021

To get the multiple R2 of a svyglm model, you could also use the summ function in the jtools package.

Answered by David F on November 14, 2021

Because "svyglm" objects are also "lm" objects by default -- see class(model1) in your example -- you can just run summary.lm(model1) to get the R^2.

Answered by user697473 on November 14, 2021

For a Gaussian glm (where the population parameter is the OLS parameter) you can just divide the dispersion parameter by the population variance and subtract from 1

Using one of the examples from the svyglm help page:

> data(api)
> dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
> api.reg <- svyglm(api.stu~enroll, design=dstrat)
> summary(api.reg)

Call:
svyglm(formula = api.stu ~ enroll, design = dstrat)

Survey design:
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.34383   11.46399   1.164    0.246
enroll       0.81454    0.02459  33.120   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 7331.633)

Number of Fisher Scoring iterations: 2

> nullmodel<-svyglm(api.stu~1,design=dstrat)
> summary(nullmodel)

Call:
svyglm(formula = api.stu ~ 1, design = dstrat)

Survey design:
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   498.23      16.06   31.02   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 137086.3)

Number of Fisher Scoring iterations: 2

> 1-7331.633/137086.3
[1] 0.9465181


You could also get the null-model variance using svyvar

> svyvar(~api.stu,design=dstrat)
variance    SE
api.stu   137086 19197


And in this case we have the whole population, so we can run lm on the population and compare the survey estimate of rsquared with the population value

> summary(lm(api.stu ~ enroll,data=apipop))

Call:
lm(formula = api.stu ~ enroll, data = apipop)

Residuals:
Min       1Q   Median       3Q      Max
-1021.20   -13.76     6.13    29.56   498.98

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.613245   1.953709   6.968 3.55e-12 ***
enroll       0.813556   0.002522 322.581  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 92.16 on 6155 degrees of freedom
(37 observations deleted due to missingness)
Multiple R-squared:  0.9442,    Adjusted R-squared:  0.9441
F-statistic: 1.041e+05 on 1 and 6155 DF,  p-value: < 2.2e-16


As an added bonus answer: if you want the Nagelkerke or Cox-Snell r-squared for binary or count data, there's a function psrsq

Answered by Thomas Lumley on November 14, 2021

You can get the amount of explained variation just by looking at the residual variance of your model and the residual variance of a null model (model with the explanatory covariates removed, including just an intercept). So the formula you need is:

R2_var = 1 - residual_variance / residual_variance_of_a_null_model

Answered by Tomas on November 14, 2021

Related Questions

Specifying specific priors for a correlation matrix via Stan

1  Asked on December 11, 2020 by sue-doh-nimh

Example of mean independent variables but dependent still

0  Asked on December 11, 2020 by luchonacho

When are observations not weakly exchangeable?

1  Asked on December 11, 2020 by rumtscho

How big should my subsample be?

1  Asked on December 11, 2020 by kaecvtionr

Spirtes’ example of d-separation not leading to independence in a directed cyclic graph with non-linear structural equations

1  Asked on December 10, 2020 by quant_dev

Asymptotic normality for nonsmooth objective functions

1  Asked on December 10, 2020

Regression: is it wrong to bin a continuous variable to overcome overfitting?

1  Asked on December 10, 2020 by st4co4

How do you compare standard deviations?

2  Asked on December 10, 2020 by yaynikkiprograms

How to interpret the beta estimates of a generalized linear model with a square root power link?

0  Asked on December 10, 2020 by statboy_41

Can k-fold CV help reduce sampling bias?

0  Asked on December 9, 2020 by aite97

Why is the standard deviation of the average of averages smaller than the standard deviation of the total?

0  Asked on December 9, 2020 by pinocchio

Calculating bias of ML estimate of AR(1) coefficient

1  Asked on December 9, 2020 by andrew-kirk

Using residuals from linear regression for normality testing for ANOVA

0  Asked on December 9, 2020 by s-ramagokula-krishnan

How does scaled conjugate gradient work in neural network training? Comparison with gradient descent

0  Asked on December 9, 2020 by johanna

For B-spline what does $sum_{i=0,n}N_{i,k}(t)=1$ mean?

1  Asked on December 9, 2020

Is it possible to detect overfitting automatically/programmatically after model creation?

0  Asked on December 9, 2020 by ayberk-yavuz

R lmer model: degree of freedom and chi square values are zero

1  Asked on December 9, 2020 by roromario

Random Censoring scheme in Weibull Distribution

0  Asked on December 8, 2020 by soham-bagchi

fixed effects vs random effects vs random intercept model

1  Asked on December 8, 2020 by daniela-rodrigues

Immediate NaN in loss function with custom activation without extreme batch size–how to prevent exploding gradients?

0  Asked on December 8, 2020 by rain