Abstract

This study is a replication of:

Meng, Yunliang. 2021. Crime rates and contextual characteristics: A case study in connecticut, USA. Human Geographies 15, (2) (11): 209-228, https://www.proquest.com/scholarly-journals/crime-rates-contextual-characteristics-case-study/docview/2638089143/se-2 (accessed April 6, 2025).

Study metadata

Original study spatio-temporal metadata

  • Spatial Coverage: Connecticut, USA
  • Spatial Resolution: County Subdivisions
  • Spatial Reference System: EPSG: 2234
  • Temporal Coverage: 2013 - 2017
  • Temporal Resolution: 1 year

Study design

This is a replication of a study on crime and contextual characteristics in Connecticut. The original study uses geographically weighted regression to test how crime rates at the county subdivision level vary based on several socio-demographic characteristics.

The original study is observational using socio-demographic indicators from the Census Bureau’s American Community Survey 5-year estimates and crime data from the Uniform Crime Report disseminated by the Federal Bureau of Investigation.

We will attempt to use the same methods and data sources as the original authors to see if there is any variation in our results or missing methods in their research.

Materials and procedure

Computational environment

Data and variables

There are two data sources for this study, one is demographic data from the American Community Survey and the other is crime rate statistics from the Uniform Crime Report gathered by the FBI.

Census County Subdivisions

  • Title: CT Census Subdivision Socio-demographic Data
  • Abstract: BCT Census County Subdivision Socio-demographic Data
  • Spatial Coverage: Connecticut
  • Spatial Resolution: County Subdivision
  • Spatial Representation Type: vector
  • Spatial Reference System: EPSG: 2234
  • Temporal Coverage: 2013-2017
  • Temporal Resolution: 1 year
  • Lineage: collected using the census API and tidycensus package in R
  • Distribution: Publicly available
  • Constraints: Public data
  • Data Quality: trustworthy
## Reading layer `county_subdivision' from data source 
##   `/Users/dermotmcmillan/Desktop/GitHub/RPr-CT-crime/data/raw/public/county_subdivision.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 173 features and 98 fields (with 4 geometries empty)
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -74 ymin: 41 xmax: -72 ymax: 42
## Geodetic CRS:  NAD83
Label Alias Definition Type Accuracy Domain Missing Data Value(s) Missing Data Frequency
total_population B01003_001 Total US population (Estimate)
age_20m B01001_008 Population of Males aged 20
age_21m B01001_009 Population of Males aged 21
age_22_24m B01001_010 Population of Males aged 22-24
age_25_29m B01001_011 Population of Males aged 25-29
age_30_34m B01001_012 Population of Males aged 30-34
age_20f B01001_032 Population of Females aged 20
age_21f B01001_033 Population of Females aged 20
age_22_24f B01001_034 Population of Females aged 22-24
age_25_29f B01001_035 Population of Females aged 25-29
age_30_34f B01001_036 Population of Females aged 30-34
education_total B15003_001 Total population
education_assoc B15003_021 Highest degree or the highest level of school completed = Associates degree
education_ba B15003_022 Highest degree or the highest level of school completed = Bachelors Degree
education_ma B15003_023 Highest degree or the highest level of school completed = Masters Degree
education_pro B15003_024 Highest degree or the highest level of school completed = Profession School Degree
education_phd B15003_025 Highest degree or the highest level of school completed = Doctorate Degree
median_income B19013_001 Median Household Income
poverty_total_pop B17001_001 Total Population
poverty_below B17001_002 Income below the poverty level in last 12 months
unemployment_total B23025_001 Total Population
unemployment_total_in_labor B23025_002 Population in Labor Force
unemployment_unemployed B23025_005 Unemployed population considered to be in labor force
housing_total B25003_001 Occupied Housing Units
housing_renter B25003_003 Renter occupied Housing Units
housing_units_total B25024_001 Housing Units
housing_units_2 B25024_004 Housing Units w/ 2 units
housing_units_3_4 B25024_005 Housing Units w/ 3 or 4 units
housing_units_5_9 B25024_006 Housing Units w/ 5 to 9 units
housing_units_10_19 B25024_007 Housing Units w/ 10 to 19units
housing_units_20_49 B25024_008 Housing Units w/ 20-49 units
housing_units_50 B25024_009 Housing Units w/ 50 or more units
moved_total B07001_001 Population 1 year or more in the US
moved_within_12_months B07001_017 Population that has moved homes in the past 12 months
households_total B11003_001 Family Type by Presence and Age of Own Children Under 18 Years
lone_parent_families_m B11003_010 Male Housholder, no wife present
lone_parent_families_f B11003_016 Female housholder, no husband present
hispanic B03002_012 Hispanic
race_white B03002_003 Not Hispanic or Latino, White alone
race_black B03002_003 Not Hispanic or Latino, Black or African American alone
race_asian B03002_006 Not Hispanic or Latino, Asian alone
race_native B03002_005 Not Hispanic or Latino, American Indian and Alaska Native Alone
race_pacific B03002_007 Not Hispanic or Latino, Native Hawaiian and Other Pacific Islander Alone
race_other B03002_008 Not Hispanic or Latino, Some Other Race Alone
race_two_or_more B03002_009 Not Hispanic or Latino, Two or more races

Connecticut Crime Rate/ Type

  • Title: CT
  • Abstract: BCT Census town level Crime Data
  • Spatial Coverage: Connecticut
  • Spatial Resolution: town
  • Spatial Representation Type: non-spatial
  • Temporal Coverage: 2013-2017
  • Temporal Resolution: 1 year
  • Lineage: gathered on 04/06/2024 from http://data.ctdata.org/dataset/ucr-crime-index
  • Distribution: Publicly available
  • Constraints: Public data
  • Data Quality: good, reported from local law enforcement agencies

Bias and threats to validity

The threat specifically relevant to this problem is the Modifiable Unit Area Problem since crime rates will have different social and spatial patterns at different scales. There are also potential sources of error related to endogeneity and spatial auto-correlation both of which are moderately accounted for in the original study. Additionally, the results do not have predictive power because the GWR is too regionally specific and over fit. Instead these results can be interpreted as exploratory requiring more rigorous research to contextualize and verify any findings. Bias is also inherent to crime data since crime is socially constructed and criminality is at least partially defined around race and class in America. Over-policing and over-reporting in Low Income areas and Black and brown neighborhoods introduces bias into the measurement of crime itself.

Data transformations / analysis

There are several methodological choices that the original authors did not specify, and which we will have to figure out by comparing results and summary statistics. Specifically, we need to choose a spatial weights matrix for the GWR. We will start with the default ArcGIS spatial matrix (since they used the ArcGIS tool for their analysis) and go from there. If we cannot figure out which one they used we will chose our own and compare results. There are also some transformation choices with the census data that we will have to figure out by comparing our data to the summary statistics provided (i.e what denominator for percentages).

Data transformations for Crime and Census data are provided in the following workflow:

Workflow
Workflow
## `summarise()` has grouped output by 'Town'. You can override using the
## `.groups` argument.

Analysis

Summary Stats

###Crime Data

Statistic Min Median Max IQR SD
Total Violent Crime 0 53 951 59 140
Total Property Crime 134 783 3911 1204 815
Table 1
Table 1

Unplanned Deviation It is clear, since the minimum values are different, that the author treated some empty or 0 values as nulls. Since we have no way of discerning which ones these were we will move forward by treating all empty values as 0.

Visualize Crime

###Census

Statistic Min Median Max IQR SD
age 5.92 14.61 41.40 5.86 5.36
poverty_rate 0.27 5.13 30.49 4.54 5.12
education 14.74 51.04 77.92 20.11 13.80
median_income 33841.00 85296.00 219868.00 28534.00 28102.93
unemployment_rate 1.21 5.58 16.02 2.61 2.27
rent_rate 2.26 18.98 76.20 15.58 13.67
multi_unit_rate 0.00 17.62 94.23 21.61 18.49
res_mobility 0.96 6.96 23.18 4.73 3.45
pop_density 11.38 169.49 3260.72 358.20 495.35
shannon_eq 0.07 0.24 0.65 0.19 0.15
Table 2
Table 2

Unplanned Deviation Variables were calculated using each tables respective total population. We cross compared using the summary table (table 2), and were able to match most of the values. Population density, housing type (multi_unit_rate), and residential mobility calculations yielded slightly different results. For population density, this is likely because of minor differences sin calculating area. For housing type and residential mobility, we were unable to parse the differences. Discrepancies may be because the original authors cleaned the data and didn’t report it.

The Shannon index we calculated reported very different summary statistics compared to the original study. Initially, we thought this was a calculation error but we re-ran the analysis several ways (hand-built method and ChatGPT generated workflow) and got the same results. To further verify, I compared the spatial distribution of the Shannon index to maps of other diversity measures in CT and they were almost identical. This, along with some concerning deviations in our analysis, lead us to believe that the original authors incorrectly calculated the metric.

Basic Regression

###Property Crime

term estimate std.error statistic p.value
(Intercept) 529.27 64.16 8.2 <0.0001
pop_density 0.77 0.13 5.9 <0.0001
multi_unit_rate 15.31 3.49 4.4 <0.0001

###Violent Crime

term estimate std.error statistic p.value
(Intercept) -71.24 21.56 -3.3 0.0012
pop_density 0.19 0.02 11.3 <0.0001
education 1.26 0.42 3.0 0.0029
poverty_rate 8.25 1.51 5.5 <0.0001
shannon_eq -73.11 53.25 -1.4 0.1716
Table 3
Table 3

The ordinary least squares regression results with Total Property Crime (really the crime rate per 100,000) as the response variable gave surprisingly similar results to the original OLS model in the study. The beta estimates were slightly different for both the predictor values, but this makes sense given that all 3 of the variables had minor discrepancies. We only used the predictor variables selected by the original authors. To expand in this section (and explore the tree of forking paths), it may make sense to do a variable selection process with our data too see if we may have chosen different predictors.

The OLS coefficients for Total Violent Crime were all similar to the original study except for the Shannon equability index (diversity), which didn’t even provide a significant result.

Moran’s I

In this section we only visualized the Local Moran’s I values, we did not calculate a Global Moran’s I for times sake.

Planned Deviation We had no idea what spatial weights matrix the original author used to calculate local Moran’s I scores so we went with what seems to be the default in ArcGIS: fixed distance based on the maximum of the nearest neighbor distances.

Ulanned Deviation It was difficult to determine the exact classification scheme used in ArcGIS for the cluster analysis, as the GUI offers multiple options and limited transparency. After researching the default settings and discussing with ChatGPT, we concluded that areas with statistically significant Local Moran’s I results were classified based on whether their own crime rate and the spatial lag (the average crime rate of neighboring areas) were above or below the global mean. This combination allowed us to assign clusters such as High-High, Low-Low, High-Low, and Low-High.

Figure 2
Figure 2

##GWR

## Adaptive q: 0.38 CV score: 53374871 
## Adaptive q: 0.62 CV score: 46640010 
## Adaptive q: 0.76 CV score: 47949831 
## Adaptive q: 0.65 CV score: 46898852 
## Adaptive q: 0.53 CV score: 46310686 
## Adaptive q: 0.54 CV score: 46175445 
## Adaptive q: 0.57 CV score: 46236944 
## Adaptive q: 0.55 CV score: 46178513 
## Adaptive q: 0.54 CV score: 46118022 
## Adaptive q: 0.54 CV score: 46134633 
## Adaptive q: 0.54 CV score: 46108956 
## Adaptive q: 0.54 CV score: 46091356 
## Adaptive q: 0.54 CV score: 46120000 
## Adaptive q: 0.54 CV score: 46098094 
## Adaptive q: 0.54 CV score: 46101446 
## Adaptive q: 0.54 CV score: 46093134 
## Adaptive q: 0.54 CV score: 46094716 
## Adaptive q: 0.54 CV score: 46091893 
## Adaptive q: 0.54 CV score: 46092195 
## Adaptive q: 0.54 CV score: 46091356
## Adaptive q: 0.38 CV score: 998007 
## Adaptive q: 0.62 CV score: 1036577 
## Adaptive q: 0.24 CV score: 1126640 
## Adaptive q: 0.47 CV score: 1033412 
## Adaptive q: 0.33 CV score: 1020719 
## Adaptive q: 0.39 CV score: 998600 
## Adaptive q: 0.38 CV score: 997949 
## Adaptive q: 0.36 CV score: 1004380 
## Adaptive q: 0.37 CV score: 1001825 
## Adaptive q: 0.38 CV score: 998121 
## Adaptive q: 0.38 CV score: 997871 
## Adaptive q: 0.38 CV score: 997839 
## Adaptive q: 0.38 CV score: 997840 
## Adaptive q: 0.38 CV score: 997838 
## Adaptive q: 0.38 CV score: 997838 
## Adaptive q: 0.38 CV score: 997838 
## Adaptive q: 0.38 CV score: 997838

We used the same method as the original author to chose an adaptive bandwith based on AIC minimization. We did not have to specify a spatial weights matrix for he GWR like we thought.

###Summary

Statistic Min Max under_1.96 between_1.96_2.58 above_2.58
pop_density.1 0.44 1.53 0.59 0.83 0.15
multi_unit_rate.1 2.25 25.94 0.34 0.56 0.11
pop_density.2 0.04 0.46 0.03 0.88 0.09
education.1 -3.22 2.85 0.41 0.50 0.09
poverty_rate.1 -2.29 24.21 0.59 0.33 0.08
shannon_eq.1 -250.19 143.80 0.91 0.01 0.08
Table 4
Table 4

Coefficient ranges for each of the models (Property/ Violent Crime) varried significantly between the original study and our reproduction. This difference can be explained by the discrepancies in the underlying data and the fact that GWR over fits (explains too much of the error).

Property Crime Model Visualized

Figure 3
Figure 3

Figure 4

Violent Crime Model Visualized

Figure 5
Figure 5

Figure 6

Results

For Each set of GWR maps, we visualized the coefficients based on quartile bins (same method as authors). This is not a particularly helpful way of visualizing coefficients, especially if using a diverging color scheme, since the sign of the coefficient has meaning and is misrepresented when the central color is not 0. We created a function for binning so that anyone who wishes to work further on this can easily change the visual representation of the data but did not implement a more statistically accurate color scheme. Despite this, many of the maps did maintain a similar spatial distribution of high and low coefficients (with respect to the mean) as compared to the original study. Some of the patterns, though, such as the distribution in the coefficients for the Shannon Equability Index were wildly different. This is pretty concerning, given the original author made some problematic and unfounded claims (based on these results) about the role of diversity in fostering crime.

Also, the spatial variation in t-scores was completely different. This makes sense given the ranges of coefficients were much different so the counties with beta values near zero changed between the studies. The only predictor (for both models) with significant results across the board was population density.

The truth is GWR and Moran’s I testing do not have much predictive power so many of the claims made by the author are unreasonable given the evidence. The only thing that can tell us is that there is or is not spatial variation in the relationship between the response and predictor variables, which our replication also shoes.

Discussion

Whilst replicating this study, we came across a number of deviations that we had to make, as well as some limitations and issues with the original study. One limitation in the conceptualization of this study lies in its reliance on predefined administrative boundaries, such as county subdivisions, which may not accurately reflect the social and environmental contexts influencing crime. This approach can lead to the Modifiable Areal Unit Problem (MAUP), where statistical results vary based on the spatial units used, potentially obscuring localized crime patterns and leading to ecological fallacies when inferring individual behaviors from aggregate data.

The study’s use of the Shannon Index — an ecological species diversity measure — as a metric for human diversity and how this impacts crime rates is especially problematic and misleading. When using the Shannon Index, diversity levels are considered as the level of evenness across eight racial groups, which does not represent true diversity, especially when analyzing a smaller geographic area such as Connecticut. Also, the assumptions made on the back of the Shannon index numbers are troubling. The study writes: “racial/ethnic heterogeneity weakens residents’ attachment to the areas where they live and reduces community organization and involvement”. The implication that community engagement is greater in less diverse communities is unfounded given both the original and replication study results, which showed coefficients ranging across space from both positive to negative. This implies that some geographically determined exogenous effect is more impact than diversity on crime rates. This is an especially unfounded claim given that beta values for this variable were not significant in both the OLS model and the GWR (save for a few counties). Additionally, the study’s focus on certain socioeconomic variables may overlook other influential factors like cultural dynamics, community cohesion, or informal social controls that are harder to quantify but significantly impact crime rates.

Regarding the measurement of crime, the UCR crime data is reported by county subdivision which may result in variations in law enforcement practices, reporting standards, and resource allocation across these jurisdictions. The result may be inconsistencies in crime data, complicating comparisons and trend analyses. Studies like this can influence policy decisions, potentially amplifying the effect of these inconsistencies. There is also always error and bias built into crime data since law enforcement historically over-polices and reports on marginalized groups — specifically Black Americans. In crime research this can create a sort of self fulfilling prophecy where over-reporting reinforces police bias, and that bias fuels over-reporting.

ACS has large margins of error for each statistic so when dealing with smaller, less populous geographic areas like Connecticut, these margins are often quite high. When these data are used as predictors in regression models or spatial analyses, the underlying uncertainty can lead to unstable or misleading results. This is part of a larger problem that we faced in our replication study, which is the original study’s lack of clarity regarding their choices in variables, models, and more. We have experienced a significant number of deviations in the data - and although they may be minor, these differences compound to create large gaps in the models and our ability to replicate the original study’s models.

The author importantly notes that discussions of crime should be more locally focused and context specific. Policy and policing of crime should reflect this context and the natural spatial variation of all forms of crime. The data and models used in the study show that there is spatial variation in the data but the models have little to no predictive power thus conclusions regarding the effect of diversity on crime and social capital, for example, should not be given too much weight. Further work and analysis needs to be done beyond the geographic weighted regression to apply conclusions to these relationships.

Integrity Statement

Include an integrity statement - The authors of this preregistration state that they completed this preregistration to the best of their knowledge and that no other preregistration exists pertaining to the same hypotheses and research.

This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](https://doi.org/10.17605/OSF.IO/W29MQ)

References

Bivand, Roger. 2025. Spdep: Spatial Dependence: Weighting Schemes, Statistics. https://github.com/r-spatial/spdep/.
Bivand, Roger S., Edzer Pebesma, and Virgilio Gómez-Rubio. 2013. Applied Spatial Data Analysis with R, Second Edition. Springer, NY. https://asdar-book.org/.
Bivand, Roger, and David W. S. Wong. 2018. “Comparing Implementations of Global and Local Indicators of Spatial Association.” TEST 27 (3): 716–48. https://doi.org/10.1007/s11749-018-0599-x.
Bivand, Roger, and Danlin Yu. 2024. Spgwr: Geographically Weighted Regression. https://github.com/rsbivand/spgwr/.
Henry, Lionel, and Hadley Wickham. 2025. Rlang: Functions for Base Types and Core r and Tidyverse Features. https://rlang.r-lib.org.
Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://here.r-lib.org/.
Pebesma, Edzer. 2018. Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
———. 2024. Sf: Simple Features for r. https://r-spatial.github.io/sf/.
Pebesma, Edzer, and Roger Bivand. 2023a. Spatial Data Science: With applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.
Pebesma, Edzer, and Roger S. Bivand. 2023b. Spatial Data Science with Applications in R. Chapman & Hall. https://r-spatial.org/book/.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, David, Alex Hayes, and Simon Couch. 2024. Broom: Convert Statistical Objects into Tidy Tibbles. https://broom.tidymodels.org/.
Roger Bivand. 2022. “R Packages for Analyzing Spatial Data: A Comparative Case Study with Areal Data.” Geographical Analysis 54 (3): 488–518. https://doi.org/10.1111/gean.12319.
Tennekes, Martijn. 2018. tmap: Thematic Maps in R.” Journal of Statistical Software 84 (6): 1–39. https://doi.org/10.18637/jss.v084.i06.
———. 2025. Tmap: Thematic Maps. https://github.com/r-tmap/tmap.
Walker, Kyle. 2024. Tigris: Load Census TIGER/Line Shapefiles. https://github.com/walkerke/tigris.
Walker, Kyle, and Matt Herman. 2025. Tidycensus: Load US Census Boundary and Attribute Data as Tidyverse and Sf-Ready Data Frames. https://walker-data.com/tidycensus/.
Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Thomas Lin Pedersen, and Dana Seidel. 2023. Scales: Scale Functions for Visualization. https://scales.r-lib.org.
Zhu, Hao. 2024. kableExtra: Construct Complex Table with Kable and Pipe Syntax. http://haozhu233.github.io/kableExtra/.