1 Introduction

As emergent urban sprawl takes place in many cities, decision making appears to be vital to influence regional land use and land cover change, as well as various land development within the city. These decisions either on policy or economy, are made from different agents, including developers, planners, governors, as well as property owner and tenants. Each stakeholder seeks for interest and set up bottom lines. Property owner and tenants look for maximum profit and low price. Developers buy lower value land with high development potential and resell them when fully developed. Planners and governors are more interested in holistically urban growth through considering the regional growth in a systematically way. Sustainability urban sprawl are more appealing nowadays given the severe environmental challenges and climate change.

Austin, the Texas capital, had one of the highest rates in the nation for sprawl — the decrease in average neighborhood density — from 2010-2016, at the same time the population has soared. Given the lower tax rate and suitable geographic location, Austin becomes more appealing to those technology companies and new industries. Large high-tech firms and manufactures, including Dell, Samsung, Facebook, Google as well as Apple are all intended to relocate their headquarters or main offices in Austin or its suburb areas such as Round Rock. The circular process of sprawl—where residential development on the urban fringe spawns commercial and industrial growth which in turn generates more residential growth—is facilitated by the construction and expansion of roadways and other infrastructure. However, the negative effects of urbanization increased as the rate of decentralized suburban development increased. Sensitive land including forest and wetland were used for redevelopment, which in turn decrease the land value and create pollution.

In following memo, the urban growth model successfully predicts the land development and population change in 2020. Data collection, data analysis, model building on both demand and supply side, model accuracy as well as allocation strategy will be illustrated with informative diagrams and graphics.

2 Data Analysis

In terms of data collection, land cover data downloaded from the Multi-Resolution Land Characteristics Consortium’s National Land Cover Database (NLCD) includes annual land cover in 2001 and 2011 for the entire Austin. These data are sampled to a 1,000 by 1,000 ft^2 fishnet, which will be used for Population data is downloaded from the U.S. Census and joined to the fishnet by distributing Census Tract population to each grid cell.

2.1 land cover change from 2001 to 2011

As land cover data downloaded, we analysis the land cover in both 2001 and 2011. It is clear to see that majority of urbanization took place in suburb area. Central Austin were well-developed from 2011. To calculate the land cover change, all the land which were undeveloped in 2001 and developed in 2011 are classified as land cover change (value 1, land cover change map). Here, we define that the open space, low intensity, medium and high intensity area are all developed area. As this land cover change map illustrates, most of the land development are in north and south area.

2.2 Fishnet Creation

Fishnet is created at 1000 by 1000 foot resolution. Each land cover value, land change value, population as well as other data value will be aggregated into this fishnet for prediction model.

ggplot() +
  geom_sf(data=AustinMSA_fishnet) +
  labs(title="Fishnet,1000 foot resolution") +
  mapTheme()

This below map is a new land cover development change based on the fishnet aggregation. It will be slightly different from the original map considering the resolution factors.

ggplot() +
  geom_sf(data=changePoints) +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)$x, y=xyC(fishnet)$y, colour=lc_change)) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name = "") +
  labs(title = "Land cover development change", subtitle = "As fishnet centroids") +
  mapTheme()

2.3 land cover data 2001

As the land cover map shown above, numbers of categories were used which could not be clear enough statistically. In this case, the data is reclassified into new categories.

For more clarify, A series of graphics below elaborates the developed, farm, forest, otherUndeveloped, water as well as wetlands in Austin 2001.

aggregatedRasters %>%
  gather(var,value,developed:water) %>%
  st_cast("POLYGON") %>%  
  mutate(X = xyC(.)$x,
         Y = xyC(.)$y) %>%
  ggplot() +
    geom_sf(data=AustinMSA) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("Other","Land Cover"),
                        name = "") +
    labs(title = "Land cover types, 2001",
         subtitle = "As fishnet centroids") +
   mapTheme()

2.4 Import population data

After finishing the 2001 data analysis procedure, census population data based on census tract were downloaded for both 2000 and 2010.

grid.arrange(
ggplot() +
  geom_sf(data = AustinPop00, aes(fill=factor(ntile(pop_2000,5))), colour=NA) +
  scale_fill_manual(values = palette5,
                    labels=quintileBreaks(AustinPop00,"pop_2000"),
                   name="Quintile\nBreaks") +
  labs(title="Population, Austin MSA: 2000") +
  mapTheme(),

ggplot() +
  geom_sf(data = AustinPop10, aes(fill=factor(ntile(pop_2010,5))), colour=NA) +
  scale_fill_manual(values = palette5,
                    labels=quintileBreaks(AustinPop10,"pop_2010"),
                   name="Quintile\nBreaks") +
  labs(title="Population, Austin MSA: 2010") +
  mapTheme(), ncol=2)

To distribute population into fishnet, we used a weighted interpolation function, assigns a proportion of a tract’s population to a grid cell weighted by the proportion of the tract that intersects the grid cell. This approach works as we assume that the tract population is uniformly distributed across the tract. Compared with the following two graphics, it clear to see the differences between original data and population on fishnet.

grid.arrange(
ggplot() +
  geom_sf(data=AustinMSA) +
  geom_sf(data=AustinPop10, aes(fill=factor(ntile(pop_2010,5))),colour=NA) +
  scale_fill_manual(values = palette5,
                    labels=substr(quintileBreaks(AustinPop10,"pop_2010"),1,4),
                   name="Quintile\nBreaks") +
  labs(title="Population, Austin MSA: 2010",
       subtitle="Represented as tracts; Boundaries omitted") +
  mapTheme(),

ggplot() +
  geom_sf(data=fishnetPopulation, aes(fill=factor(ntile(pop_2010,5))),colour=NA) +
  scale_fill_manual(values = palette5,
                   labels=substr(quintileBreaks(fishnetPopulation,"pop_2010"),1,4),
                   name="Quintile\nBreaks") +
  labs(title="Population, Houston MSA: 2010",
       subtitle="Represented as fishnet gridcells; Boundaries omitted") +
  mapTheme(), ncol=2)

2.5 commercial land use area in 2010

Commercial land use is another important factor to be fully weighed. When development occurs, residential area would like to be reasonably closer to commercial district. The project uses 2010 Austin land use data and filter the main commercial area.

Below is the map showing the commercial district in fishnet.Based on each cell, a half-mile buffer also be created to analysis the total commercial space around each cell.

ggplot() +
  geom_sf(data=commercial_fishnet,aes(fill=commercial)) +
  labs(title = "commercial area") +
  mapTheme()

2.5 highway distance

Transportation accessibility is extremely important in urbanization. As development sprawls further afield, roadways through and adjacent to established neighborhoods will carry increased traffic loads. Distance to the highway could be a significant factor to analysis the urban sprawl.

On the New Development and Highways map, majority of the new development between 2001 and 2011 are within certain distance of highway system.

ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=lc_change),size=1.5) +
  geom_sf(data=AustinHighways, color = "#E1704C") +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Highways",
       subtitle = "As fishnet centroids") +
  mapTheme()

After calculating the distance to highways, the map below shows the mean distance by grid cell. The darker area represents closer to highway.

ggplot() +
  geom_sf(data=AustinMSA) +
  geom_point(data=highwayPoints_fishnet, aes(x=xyC(highwayPoints_fishnet)[,1], 
                                             y=xyC(highwayPoints_fishnet)[,2], 
                 colour=factor(ntile(distance_highways,5))),size=1.5) +
  scale_colour_manual(values = palette5,
                      labels=substr(quintileBreaks(highwayPoints_fishnet,"distance_highways"),1,8),
                      name="Quintile\nBreaks") +
  geom_sf(data=AustinHighways, colour = "red") +
  labs(title = "Distance to Highways",
       subtitle = "As fishnet centroids; Highways visualized in red") +
  mapTheme()

2.6 Spatial lag of development

Plenty of factors would affect the development decision. However, accessibility to the existing developed area should be taken into consideration. As new, low-density development is constructed further away from established urban areas the costs of streets, power lines, water and sewer mains increases.

Accessibility is measured by way of a spatial lag hypothesizing that new development is a fucntion of distance to existing development. The shorter the distance, the more accessible a grid cell is to existing development. This is measured by calculating the average distance from each grid cell to its 5 nearest developed neighboring grid cells in 2001. Since most of the developed area are in the central Austin, calculating the 2 nearest developed grid cells would not create significant correlation. Based on each cell, a quarter-mile buffer also be created to analysis the total developed space around each cell.

All the central Austin area have the same distance to the 5 nearest developed cells, which is 932, as the map shown below.

ggplot() +
  geom_sf(data=AustinMSA) +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2], 
                 colour=factor(ntile(lagDevelopment,5))), size=1.5) +
  scale_colour_manual(values = palette5,
                     labels=substr(quintileBreaks(fishnet,"lagDevelopment"),1,7),
                     name="Quintile\nBreaks") +
  labs(title = "Spatial lag to 2001 development",
       subtitle = "As fishnet centroids") +
  mapTheme()

2.7 Using council district to analysis urban development

Most of city Austin area are within Travis County. It would be hard to analysis the regional demand based on county. As a result, we use council district as study areas.

3 Prediction for 2010

The prediction model we created is logistic regression model. In other word, a binnary dependent variable is generated to determine whether the area will be developed.

3.1 Statistical relationship analysis

Several variables are created above. The relevant question is whether for a given feature, there is a statistically significant difference between areas that changed and areas that did not. To evaluate these relationships, the differences are explored in a set of plots below.

Closer to commercial district or highway systems would cause more probabilities of new development. Also, proximity of surrounding developed area would result in new development. However, this relationship is not as significant as two variables before.

dat %>%
  dplyr::select(distance_highways,lagDevelopment,commercial_distance,lc_change) %>%
  gather(Variable, Value, -lc_change, -geometry) %>%
  ggplot(., aes(lc_change, Value, fill=lc_change)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New development as a function of the countinuous variables") +
    plotTheme() 

Another diagram shows the commercial buffer and developed buffer for each cell. Both no change and new development area have some even value, which means the relationship is not very significant.

dat %>%
  dplyr::select(developed_buffer,commercial_buffer,lc_change) %>%
  gather(Variable, Value, -lc_change, -geometry) %>%
  ggplot(., aes(lc_change, Value, fill=lc_change)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New development as a function of the countinuous variables") +
    plotTheme() 

2000 and 2010 Population, as well as population change are illustrated below. Even though new development area has lower population, the population change is very significant, representing new moved-in residents.

dat %>%
  dplyr::select(pop_2000,pop_2010,pop_Change,lc_change) %>%
  gather(Variable, Value, -lc_change, -geometry) %>%
  ggplot(., aes(lc_change, Value, fill=lc_change)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New development as a function of factor variables") +
    plotTheme()

3.2 model setup

The following six models are created.

Model1 <- glm(lc_change ~ wetlands + forest + farm + otherUndeveloped, family=“binomial”(link=“logit”), data = datTrain)

Model2 <- glm(lc_change ~ wetlands + forest + farm + otherUndeveloped + lagDevelopment, family=“binomial”(link=“logit”), data = datTrain)

Model3 <- glm(lc_change ~ wetlands + forest + farm + otherUndeveloped + lagDevelopment + pop_2000 + pop_2010, family=“binomial”(link=“logit”), data = datTrain)

Model4 <- glm(lc_change ~ wetlands + forest + farm + otherUndeveloped + lagDevelopment + pop_Change, family=“binomial”(link=“logit”), data = datTrain)

Model5 <- glm(lc_change ~ wetlands + forest + farm + otherUndeveloped + lagDevelopment + pop_Change + distance_highways + commercial_distance, family=“binomial”(link=“logit”), data = datTrain)

Model6 <- glm(lc_change ~ wetlands + forest + farm + otherUndeveloped + lagDevelopment + pop_Change + distance_highways+commercial_distance+developed_buffer+commercial_buffer, family=“binomial”(link=“logit”), data = datTrain)

3.3 model accuracy

R-squared is a factor to evaluate the accuracy of the model. Model 5 and model 6 have the highest value. However, as we mentioned above, buffer variable is not as significant as others. In this case, we decide to use model 5.

modelList <- paste0("Model", 1:6)
map_dfc(modelList, function(x)pR2(get(x)))[4,] %>%
  setNames(paste0("Model",1:6)) %>%
  gather(Model,McFadden) %>%
  ggplot(aes(Model,McFadden)) +
    geom_bar(stat="identity") +
    labs(title= "McFadden R-Squared by Model") +
    plotTheme()
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2
## fitting null model for pseudo-r2

A density plot visualizing the distribution of predicted probabilities by observed class.This diagram can be used to determine our model threshold latter.

ggplot(testSetProbs, aes(probs)) +
  geom_density(aes(fill=class), alpha=0.5) +
  scale_fill_manual(values = palette2,
                    labels=c("No Change","New Development")) +
  labs(title = "Histogram of test set predicted probabilities",
       x="Predicted Probabilities",y="Density") +
  plotTheme()

The Sensitivity in our model is the rate of developed areas actually predicted as such. The Specificity in our model is the rate of No Change areas that were correctly predicted as No change.The purpose of this prediction model is to optimize the results. Based on the graphic above, we choose different threholds to test the model accuaracy.0.55,0.65,0.40 as well as 0.45 have accuracy of 0.80. However, the specificity of 0.65 is relatively low, which means the prediction of no change would not be precise. As a result, 0.40 threshold will be used for final model.

A seris of maps below to show the difference between 0.4 and 0.55 threhold. It is clear to see that 0.4 is more precise compared with the other one.

ggplot() +
  geom_point(data=predsForMap, aes(x=xyC(predsForMap)[,1], y=xyC(predsForMap)[,2], colour=Value)) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("No Change","New Development"),
                      name="") +
  labs(title="Development predictions - Low threshold") + mapTheme()

3.4 Generalizability

It matters whether the model we created would be suitable for other district or even another year. It is important to use spatial cross-validation to test the generalizability.The approach helps us understand whether our model is comparable to each county in the study area despite any possible differences in land use or land use planning.

In the following table, District 1, 6 and 8 have the most land change within this decade.District 4 which contains the central area of Austin, has the greatest Sensitivity level, which is reasonable given it contains most of the development.For the most part, these confusion matrix metrics suggest the model is generalizable to those counties that underwent significant development change.

4 population prediction for 2020 and development demand

4.1 population prediction for 2020

Since the population data we have are only 2000 and 2010, another prediction model is required for population in 2020. The project uses the exsiting dataset to build another model, shown below.

Develop,farm,otherundeveloped,population,distance to highway and commercial area, as well as distance to 5 nearest developed cells are used for population model. All of these variables are significant.

summary(popmodel1)
## 
## Call:
## lm(formula = pop_Change ~ developed + farm + otherUndeveloped + 
##     pop + distance_highways + lagDevelopment + commercial_distance, 
##     data = dat00)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -156.627   -9.388   -2.775    6.651  211.738 
## 
## Coefficients:
##                        Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)          7.53771536  0.46661259  16.154 < 0.0000000000000002 ***
## developed1           3.28547934  0.40238867   8.165 0.000000000000000345 ***
## farm1               -1.22121210  0.49284455  -2.478               0.0132 *  
## otherUndeveloped1    3.95860623  0.32044843  12.353 < 0.0000000000000002 ***
## pop                 -0.02801935  0.00390894  -7.168 0.000000000000793999 ***
## distance_highways    0.00042921  0.00005029   8.534 < 0.0000000000000002 ***
## lagDevelopment      -0.00040348  0.00006160  -6.550 0.000000000059423028 ***
## commercial_distance -0.00090643  0.00006212 -14.592 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.57 on 16075 degrees of freedom
## Multiple R-squared:  0.04138,    Adjusted R-squared:  0.04097 
## F-statistic: 99.13 on 7 and 16075 DF,  p-value: < 0.00000000000000022

The final population prediction result shows that district 6 will have the highest population,

districtPopulation_2020 %>%
  st_set_geometry(NULL) %>% 
  gather(Variable,Value, -council_di) %>%
  ggplot(aes(reorder(council_di,-Value),Value)) +
  geom_bar(aes(fill=Variable), stat = "identity") +
  scale_fill_manual(values = palette2,
                    labels=c("2020","2010"),
                    name="Population") +
  labs(title="Population Change by district: 2010 - 2020",
       x="Council District", y="Population") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  plotTheme()

The map of predicted probabilities that results is best thought of as a measure of predicted development demand in 2020.Those surburb district, including 1,2,6 are all in high development demand.

5 Comparing predicted development demand & environmental sensitivity

The negative effects of urbanization increased the rate of losing Sensitive land including forest and wetland. To sustainablaly evaluating the planning vision, this prediction model take the following aspects into consideration:

1. The total amount of wetlands and forest land cover area in 2011.
2. The amount of sensitive land (wetland and forest) lost between 2001 and 2011.
3. The total area of sensitive region (more than 1 acre) in 2011.

5.1 land cover data in 2011

A simple comparison of land cover in both 2001 and 2011.

Similar with 2001 land cover analysis approach,a series of graphics below elaborates the developed, farm, forest, otherUndeveloped, water as well as wetlands in Austin 2011.

dat2 %>%
  gather(var,value,developed11:water11) %>%
  st_centroid() %>%
  mutate(X = st_coordinates(.)[,1],
         Y = st_coordinates(.)[,2]) %>%
  ggplot() +
    geom_sf(data=AustinMSA) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("Other","Land Cover"),
                        name = "") +
    labs(title = "Land cover types, 2011",
         subtitle = "As fishnet centroids") +
   mapTheme()

5.2 Sensitive land lover lost

2001 Wetland and Forest that are developed in 2011 are defined as sensitive lost. Austin do has some sensitive lost when urban growth accelarates within this decade.

ggplot() +
  geom_sf(data=AustinMSA) +
  geom_point(data=dat2, aes(x=xyC(dat2)[,1], y=xyC(dat2)[,2], colour=as.factor(sensitive_lost11))) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","Sensitive Lost"),
                      name = "") +
  labs(title = "Sensitive lands lost: 2001 - 2011",
       subtitle = "As fishnet centroids") +
  mapTheme()

Given the development factors, urban growth could inevitably decrease the forest and wetland. However, larger natural resources should be preserved during the development process. Below is the map showing these sensitive regions which are larger than 1 acre in Austin.

ggplot() +
  geom_sf(data=AustinMSA) +
  geom_point(data=dat2, aes(x=xyC(dat2)[,1], y=xyC(dat2)[,2], colour=as.factor(sensitiveRegions))) +
  scale_colour_manual(values = palette2,
                      labels=c("Other","Sensitive Regions"),
                      name="") +
  labs(title = "Sensitive regions",
       subtitle = "Continous areas of either wetlands or forests\ngreater than 1 acre") +
  mapTheme()

5.3 Summary by District

To better understand the demand and supply side of development by district, we create the following graphic. Both district 1 and district 2 has high development demand in 2020 as well as population growth. However, by analyzing sensitive land area, district 2 are more suitable for development since the quantity of forest and wetland are lower. In this case, we will choose district 2 as study area for the supply-side analysis.

6 Supply-side change

In this scenario, a new highway was planned to connect the Airport and W William Cannon Dr. The new highway will be updated from E William Cannon Dr and McKinney Falls Pkwy. We expected that this new highway will stimulate more real estate development alongside and hence attract more population.

Then we used the population prediction model trained on population change from 2000 to 2010 to predict 2020 population of district 2 with and without this new development. The results, however, indicated that the new development will reduce predicted population.

In contrast, the prediction of development potential near new highway is higher than the prediction before, which means the new highway will generate new development according to our model.

7 Allocation

After building both demand and supply side scenarios,allocation is the final stage of the urban growth modeling process.Planners and governors are supposed to allocate resources based on the condition of each district. District 1 and 2 are more suitable for 2020 urban growth compared with other districts. When mentioning suitable, we decide to use sensitive region(sensitive area larger than 1 acre) as a factor to determine whether these areas are suitable.In this case, majority of sensitive region are located in west of Austin, near Colorado river.

A new graphic shows the development potential 2020 in Austin. North and east of Austin have some ideal area for urban growth.

Being more detailed in allocation process, district 1 has some population growth opportunites near northeast area. Some new development could be considered near highway. Given the city general environment, these area could be used for large single-family residential development or commercial district.It can also be used for industry as more companies are planning to move in Austin.

District 2, which we used as a study area for new highway development before, is also suitable for new development. Given its unique location (near airport), some TOD concepts and development could be implemented in these places. A new transportation hub and commercial center could be an option.