Cook County Property Tax Model in R

What is it?

A while back Tanya Schlusser published an excellent blog post along with a very accomplished Python Notebook modeling property taxes in her neighborhood. Turns out her neighborhood is next to mine, and Python is close to R so figured I should examine / replicate / extend this.

This repo, package and write-up are my current crack at it. See the README.md for some background. We bundled the code in an R package (in the same repo) as packages are a good way to bundle up code and data.

Here, we use a pre-made csv file (or, rather, data.frame resulting from reading it). The steps from getting initial Property Index Number (PIN) data and accumulation of per-property data given each PIN are described at the repo.

Data

library(ccptm)                          # load this package
suppressMessages({
    library(data.table)                 # data manipulation
    library(ggplot2)                    # plotting
    library(GGally)                     # additional pairs plot
    })
data(riverforest)  # our sample from River Forest, IL -- raw

data <- data.table(riverforest)

## shorten some column names
nm <- names(data)
setnames(data, nm, gsub("^Prop", "", nm))
setnames(data, nm, gsub("^Char", "", nm))
setnames(data, nm, gsub("^Info", "", nm))

## minimal transformation
data[, `:=`(MktValCurrYear       = as.numeric(gsub("[$,]", "", MktValCurrYear)),
            MktValPrevYear       = as.numeric(gsub("[$,]", "", MktValPrevYear)),
            AsdValTotalCertified = as.numeric(gsub("[$,]", "", AsdValTotalCertified)),
            AsdValTotalFirstPass = as.numeric(gsub("[$,]", "", AsdValTotalFirstPass)),
            SqFt                 = as.numeric(gsub("[,]", "", SqFt)),
            BldgSqFt             = as.numeric(gsub("[,]", "", BldgSqFt)),
            CentAir              = as.factor(ifelse(CentAir == "Yes", 1L, 0L)),
            Frpl                 = as.ordered(as.integer(Frpl)),
            Age                  = as.integer(Age),
            FullBaths            = as.ordered(as.integer(FullBaths)),
            HalfBaths            = as.ordered(as.integer(HalfBaths)),
            Basement             = as.factor(Basement),
            Garage               = as.factor(Garage),
            Attic                = as.factor(Attic),
            Use                  = as.factor(Use)
            )]
## re-order / re-level data
data[, `:=`(Basement  = stats::reorder(Basement, MktValCurrYear, FUN=median),
            Attic     = stats::reorder(Attic, MktValCurrYear, FUN=median))]
data[, `:=`(Basement  = stats::relevel(Basement, "None"),
            Attic     = stats::relevel(Attic, "None"))]
data[]
##       CurYear                PIN             Address         City
##    1:    2019 15-01-104-003-0000      1519 PARK AVE  RIVER FOREST
##    2:    2019 15-01-104-005-0000      1519 PARK AVE  RIVER FOREST
##    3:    2019 15-01-104-006-0000      1515 PARK AVE  RIVER FOREST
##    4:    2019 15-01-104-007-0000      1509 PARK AVE  RIVER FOREST
##    5:    2019 15-01-104-008-0000      1501 PARK AVE  RIVER FOREST
##   ---                                                            
## 1204:    2019 15-01-418-006-0000 821 BONNIE BRAE PL  RIVER FOREST
## 1205:    2019 15-01-418-007-0000 815 BONNIE BRAE PL  RIVER FOREST
## 1206:    2019 15-01-418-008-0000 811 BONNIE BRAE PL  RIVER FOREST
## 1207:    2019 15-01-418-009-0000 803 BONNIE BRAE PL  RIVER FOREST
## 1208:    2019 15-01-418-015-0000   826 N HARLEM AVE  RIVER FOREST
##           Township Classification  SqFt NBHD Taxcode
##    1: River Forest            206  7240   15   33001
##    2: River Forest            206  3620   15   33001
##    3: River Forest            206 10860   15   33001
##    4: River Forest            206 12645   15   33001
##    5: River Forest            206 13394   15   33001
##   ---                                               
## 1204: River Forest            206 11960   50   33001
## 1205: River Forest            206 11960   50   33001
## 1206: River Forest            206 11908   50   33001
## 1207: River Forest            206 12880   50   33001
## 1208: River Forest            206  9200   50   33001
##                 AsdValCurYear AsdValPrevYear AsdValLandFirstPass
##    1: 2019 Assessor Certified           2018               8,688
##    2: 2019 Assessor Certified           2018               4,344
##    3: 2019 Assessor Certified           2018              13,032
##    4: 2019 Assessor Certified           2018              15,174
##    5: 2019 Assessor Certified           2018              16,072
##   ---                                                           
## 1204: 2019 Assessor Certified           2018              11,362
## 1205: 2019 Assessor Certified           2018              11,362
## 1206: 2019 Assessor Certified           2018              11,312
## 1207: 2019 Assessor Certified           2018              12,236
## 1208: 2019 Assessor Certified           2018              20,470
##       AsdValLandCertified AsdValBldgFirstPass AsdValBldgCertified
##    1:               8,688              37,810              37,810
##    2:               4,344               9,452               9,452
##    3:              13,032              49,637              49,637
##    4:              15,174              48,633              48,633
##    5:              16,072              84,919              84,919
##   ---                                                            
## 1204:              11,362              65,016              65,016
## 1205:              11,362              56,680              56,680
## 1206:              11,312              56,651              56,651
## 1207:              12,236              69,345              69,345
## 1208:              20,470              21,638              21,638
##       AsdValTotalFirstPass AsdValTotalCertified CurrYear MktValCurrYear
##    1:                46498                46498     2019         464980
##    2:                13796                13796     2019         137960
##    3:                62669                62669     2019         626690
##    4:                63807                63807     2019         638070
##    5:               100991               100991     2019        1009910
##   ---                                                                  
## 1204:                76378                76378     2019         763780
## 1205:                68042                68042     2019         680420
## 1206:                67963                67963     2019         679630
## 1207:                81581                81581     2019         815810
## 1208:                42108                42108     2019         421080
##       PrevYear MktValPrevYear
##    1:     2018         464980
##    2:     2018         137960
##    3:     2018         626690
##    4:     2018         638070
##    5:     2018        1009910
##   ---                        
## 1204:     2018         763780
## 1205:     2018         680420
## 1206:     2018         679630
## 1207:     2018         815810
## 1208:     2018         421080
##                                                                    Desc
##    1: Two or more story residence, over 62 years, 2,201 to 4,999 sq.ft.
##    2: Two or more story residence, over 62 years, 2,201 to 4,999 sq.ft.
##    3: Two or more story residence, over 62 years, 2,201 to 4,999 sq.ft.
##    4: Two or more story residence, over 62 years, 2,201 to 4,999 sq.ft.
##    5: Two or more story residence, over 62 years, 2,201 to 4,999 sq.ft.
##   ---                                                                  
## 1204: Two or more story residence, over 62 years, 2,201 to 4,999 sq.ft.
## 1205: Two or more story residence, over 62 years, 2,201 to 4,999 sq.ft.
## 1206: Two or more story residence, over 62 years, 2,201 to 4,999 sq.ft.
## 1207: Two or more story residence, over 62 years, 2,201 to 4,999 sq.ft.
## 1208: Two or more story residence, over 62 years, 2,201 to 4,999 sq.ft.
##         ResType           Use Apts      ExtConst FullBaths HalfBaths
##    1: Two Story Single Family    0       Masonry         2         1
##    2: Two Story Single Family    0       Masonry         2         1
##    3: Two Story Single Family    0 Frame/Masonry         3         1
##    4: Two Story Single Family    0       Masonry         2         1
##    5: Two Story Single Family    0       Masonry         3         0
##   ---                                                               
## 1204: Two Story Single Family    0 Frame/Masonry         3         1
## 1205: Two Story Single Family    0 Frame/Masonry         2         1
## 1206: Two Story Single Family    0       Masonry         3         1
## 1207: Two Story Single Family    0       Masonry         2         2
## 1208: Two Story Single Family    0       Masonry         3         1
##                     Basement                  Attic CentAir Frpl
##    1:    Full and Unfinished                   None       1    1
##    2:    Full and Unfinished                   None       1    1
##    3:   Partial and Rec Room                   None       1    1
##    4: Partial and Unfinished    Full and Unfinished       1    1
##    5:   Partial and Rec Room Partial and Unfinished       1    2
##   ---                                                           
## 1204:      Full and Rec Room Partial and Unfinished       0    1
## 1205: Partial and Unfinished                   None       1    1
## 1206:   Partial and Rec Room Partial and Unfinished       1    2
## 1207:   Partial and Rec Room                   None       0    1
## 1208:      Full and Rec Room                   None       1    1
##                 Garage Age BldgSqFt           AsmtPass
##    1:   2 car attached  77     2463 Assessor Certified
##    2:   2 car attached  77     2463 Assessor Certified
##    3: 1.5 car attached  74     2608 Assessor Certified
##    4:   2 car detached  92     2750 Assessor Certified
##    5: 2.5 car attached  77     4486 Assessor Certified
##   ---                                                 
## 1204:   3 car detached  93     3616 Assessor Certified
## 1205:   2 car detached  93     3109 Assessor Certified
## 1206:   2 car detached  93     3094 Assessor Certified
## 1207:   3 car attached  93     3854 Assessor Certified
## 1208:   2 car detached  86     2516 Assessor Certified

Initial Plot

The market value and assed value columns are linearly dependent via a simple ‘10x’ multiplication. We will focus the MktValCurrYear column, either of the assessed value columns could be used equally well.

We start with a simple histogram plus density estimate and added rug plot of data points (using alpha blending to account for very high density density in the middle price range).m

## histogram + density + rug
ggplot(data, aes(x=MktValCurrYear)) +
    geom_histogram(aes(y=..density..), bins=40, color="darkgrey", fill="lightblue") +
    geom_density(color="darkblue") + geom_rug(alpha=0.1)

Scatter Plot

Several additional variables are of interest. The following pairs plot show several of them.

We split this into two plots to not scale each pairs-plot cell down by too much.

## pairs
## Frpl, FullBaths, HalfBaths, MktValCurrYear) -- CentAir,
ggpairs(data[, .(MktValCurrYear, SqFt, BldgSqFt, Age)],
        diag = list(continuous="densityDiag"),
        upper = list(continuous = wrap("points", size=0.5, alpha=0.1),
                     discrete="facetbar"),
        lower = list(continuous = wrap("points", size=0.5, alpha=0.1),
                     combo = "facetdensity"))

## pairs
## ,
ggpairs(data[HalfBaths<=2 & FullBaths<=4,   # suppresses some warnings from empty cells
             .(MktValCurrYear, Frpl, FullBaths, HalfBaths, CentAir)],
        mapping = aes(alpha = 0.001),
        diag = list(continuous="densityDiag"),
        upper = list(continuous = "points",
                     discrete="facetbar"),
        lower = list(continuous = "points",
                     combo = "facetdensity"))

Linear Models

We fit a few first exploratory models. We omitted the summary for brevity here.

## minimal first fit, overall fairly weak
fit0 <- lm(log(MktValCurrYear) ~ log(SqFt) + log(BldgSqFt) + CentAir + Frpl + log(Age) +
              FullBaths + HalfBaths + Basement + Attic - 1, data)
#summary(fit0)

fit1 <- lm(log(MktValCurrYear) ~ log(SqFt) + log(BldgSqFt) + log(Age) + FullBaths + HalfBaths - 1, data)
#summary(fit1)

fit2 <- lm(log(MktValCurrYear) ~ log(SqFt) + log(BldgSqFt) + log(Age) - 1, data)
#summary(fit2)

We can also plot the fits:

ggFittedVsActual <- function(fitted, actual) {
    ll <- range(range(fitted), range(actual))
    ggplot(data.frame(x=fitted, y=actual)) +
        geom_abline(intercept=0, slope=1, color="darkgrey") + #, style="dotted")
        geom_point(aes(x=fitted,y=actual),alpha=0.1,color="mediumblue") +
        coord_fixed(xlim=ll, ylim=ll) +
        xlab("Model Predictions") + ylab("Actual Values")
}


fit4 <- lm(MktValCurrYear ~ SqFt + BldgSqFt + Age - 1, data)
summary(fit4)
## 
## Call:
## lm(formula = MktValCurrYear ~ SqFt + BldgSqFt + Age - 1, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -718358  -30610    1913   44613  390780 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## SqFt       18.889      0.787    24.0   <2e-16 ***
## BldgSqFt  113.331      3.759    30.1   <2e-16 ***
## Age      1208.889    103.217    11.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88600 on 1205 degrees of freedom
## Multiple R-squared:  0.984,  Adjusted R-squared:  0.984 
## F-statistic: 2.42e+04 on 3 and 1205 DF,  p-value: <2e-16
ggFittedVsActual(predict(fit4), data[,MktValCurrYear])

fit5 <- lm(log(MktValCurrYear) ~ log(SqFt) + log(BldgSqFt) + Age - 1, data)
summary(fit5)
## 
## Call:
## lm(formula = log(MktValCurrYear) ~ log(SqFt) + log(BldgSqFt) + 
##     Age - 1, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6651 -0.0970  0.0343  0.1328  0.6061 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## log(SqFt)      0.59529    0.02136   27.87   <2e-16 ***
## log(BldgSqFt)  0.94702    0.02474   38.28   <2e-16 ***
## Age            0.00295    0.00037    7.96    4e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.214 on 1205 degrees of freedom
## Multiple R-squared:     1,   Adjusted R-squared:     1 
## F-statistic: 1.58e+06 on 3 and 1205 DF,  p-value: <2e-16
ggFittedVsActual(exp(predict(fit5)), data[,MktValCurrYear])

The basic log-linear model (see Tanya’s write-up for more; this appears to be prescribed by the County) appears to lack a little bit of additional structure:

Adding a squared term for squared footage reduces this effect and appears to control for underprediction for more highly-priced homes:

fit6 <- lm(log(MktValCurrYear) ~ log(SqFt) + I(log(BldgSqFt)^2) + log(BldgSqFt) + Age - 1, data)
summary(fit6)
## 
## Call:
## lm(formula = log(MktValCurrYear) ~ log(SqFt) + I(log(BldgSqFt)^2) + 
##     log(BldgSqFt) + Age - 1, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4325 -0.0489  0.0108  0.0756  0.4846 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## log(SqFt)           0.416726   0.017665   23.59  < 2e-16 ***
## I(log(BldgSqFt)^2) -0.090931   0.003191  -28.50  < 2e-16 ***
## log(BldgSqFt)       1.895748   0.038394   49.38  < 2e-16 ***
## Age                 0.001895   0.000289    6.56  7.8e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.165 on 1204 degrees of freedom
## Multiple R-squared:     1,   Adjusted R-squared:     1 
## F-statistic: 1.99e+06 on 4 and 1204 DF,  p-value: <2e-16
ggFittedVsActual(exp(predict(fit6)), data[,MktValCurrYear])

Additional Variables: Attic ?

Versus MarketValue

A slight trend towards higher values with living area attics:

attic <- data[, .(Attic, MktValCurrYear)]
ggplot(attic, aes(y=MktValCurrYear, x=Attic, color=Attic)) +
    geom_boxplot(notch=TRUE) + geom_jitter(position=position_jitter(0.25), cex=0.75, alpha=0.5) +
    coord_flip() + theme(legend.position = "none")

Residuals: Linear Model

attic2 <- data[, .(Attic, Resid=resid(fit4))]
ggplot(attic2, aes(y=Resid, x=Attic, color=Attic)) +
    geom_boxplot(notch=TRUE) + geom_jitter(position=position_jitter(0.25), cex=0.75, alpha=0.5) +
    coord_flip() + theme(legend.position = "none")

Residuals: Log-linear Model

attic3 <- data[, .(Attic, Resid=exp(resid(fit5)))]
ggplot(attic3, aes(y=Resid, x=Attic, color=Attic)) +
    geom_boxplot(notch=TRUE) + geom_jitter(position=position_jitter(0.25), cex=0.75, alpha=0.5) +
    coord_flip() + theme(legend.position = "none")

Improved Fits?

So lets reduces the five-valued factor variable into just two values: with or without attic.

data[, finishedAttic := grepl("Living", as.character(Attic))]
fit4a <- lm(MktValCurrYear ~ SqFt + BldgSqFt + Age + finishedAttic - 1, data)
summary(fit4a)
## 
## Call:
## lm(formula = MktValCurrYear ~ SqFt + BldgSqFt + Age + finishedAttic - 
##     1, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -714748  -33603    1848   44028  385007 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## SqFt               1.88e+01   7.91e-01   23.71  < 2e-16 ***
## BldgSqFt           1.11e+02   4.56e+00   24.24  < 2e-16 ***
## Age                1.07e+03   1.59e+02    6.72  2.8e-11 ***
## finishedAtticFALSE 2.40e+04   1.97e+04    1.22     0.22    
## finishedAtticTRUE  2.01e+04   2.25e+04    0.89     0.37    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88600 on 1203 degrees of freedom
## Multiple R-squared:  0.984,  Adjusted R-squared:  0.984 
## F-statistic: 1.45e+04 on 5 and 1203 DF,  p-value: <2e-16
fit5a <- lm(log(MktValCurrYear) ~ log(SqFt) + log(BldgSqFt) + Age + finishedAttic - 1, data)
summary(fit5a)
## 
## Call:
## lm(formula = log(MktValCurrYear) ~ log(SqFt) + log(BldgSqFt) + 
##     Age + finishedAttic - 1, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4328 -0.0477  0.0106  0.0751  0.4846 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## log(SqFt)          0.415661   0.017734   23.44  < 2e-16 ***
## log(BldgSqFt)      0.432440   0.027049   15.99  < 2e-16 ***
## Age                0.001907   0.000296    6.44  1.8e-10 ***
## finishedAtticFALSE 5.892568   0.211883   27.81  < 2e-16 ***
## finishedAtticTRUE  5.890236   0.214992   27.40  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.165 on 1203 degrees of freedom
## Multiple R-squared:     1,   Adjusted R-squared:     1 
## F-statistic: 1.58e+06 on 5 and 1203 DF,  p-value: <2e-16

It does not seem to add much value beyond what is already covered by square footage and building square footage.

Additional Variables: Basement?

Versus MarketValue

bsmt <- data[, .(Basement, MktValCurrYear)]
ggplot(bsmt, aes(y=MktValCurrYear, x=Basement, color=Basement)) +
    geom_boxplot(notch=TRUE) + geom_jitter(position=position_jitter(0.25), cex=0.75, alpha=0.5) +
    coord_flip() + theme(legend.position = "none")

Residuals: Linear Model

bsmt2 <- data[, .(Basement, Resid=resid(fit4))]
ggplot(bsmt2, aes(y=Resid, x=Basement, color=Basement)) +
    geom_boxplot(notch=TRUE) + geom_jitter(position=position_jitter(0.25), cex=0.75, alpha=0.5) +
    coord_flip() + theme(legend.position = "none")
## notch went outside hinges. Try setting notch=FALSE.

Residuals: Log-linear Model

bsmt3 <- data[, .(Basement, Resid=exp(resid(fit5)))]
ggplot(bsmt3, aes(y=Resid, x=Basement, color=Basement)) +
    geom_boxplot(notch=TRUE) + geom_jitter(position=position_jitter(0.25), cex=0.75, alpha=0.5) +
    coord_flip() + theme(legend.position = "none")
## notch went outside hinges. Try setting notch=FALSE.

The results appear counter-intuitive: higher value without a basement? Tanya discusses this a little bit. One angle also is that there are very few property without a basement. The negative coefficient can be seen in light of the added contribution to the other variable: an added basement, when holding square footage constant, actually takes away from non-basement living space which could explain the sign of the coefficient. Overall once again a variable which we may keep excluded.

Summary

We present a simple R implementation of a model of per-property attributes explaining the assessed value as a proxy for market value. It replicates the analysis in the original writeup by Tanya Schlusser on new data obtained with additional data fetching code included with our model.