Abstract

This study is a replication of:

Meng, Yunliang. 2021. Crime rates and contextual characteristics: A case study in connecticut, USA. Human Geographies 15, (2) (11): 209-228, https://www.proquest.com/scholarly-journals/crime-rates-contextual-characteristics-case-study/docview/2638089143/se-2 (accessed April 6, 2025).

Study metadata

Key words: Connecticut, crime, inequality, contextual characteristics
Subject: Social and Behavioral Sciences: Geography: Human Geography
Date created: 04/06/2024
Date modified: 2025-05-20
Spatial Coverage: Connecticut, USA
Spatial Resolution: County Subdivisions
Spatial Reference System: EPSG: 2234
Temporal Coverage: 2013 - 2017
Temporal Resolution: 1 year

Original study spatio-temporal metadata

Spatial Coverage: Connecticut, USA
Spatial Resolution: County Subdivisions
Spatial Reference System: EPSG: 2234
Temporal Coverage: 2013 - 2017
Temporal Resolution: 1 year

Study design

This is a replication of a study on crime and contextual characteristics in Connecticut. The original study uses geographically weighted regression to test how crime rates at the county subdivision level vary based on several socio-demographic characteristics.

The original study is observational using socio-demographic indicators from the Census Bureau’s American Community Survey 5-year estimates and crime data from the Uniform Crime Report disseminated by the Federal Bureau of Investigation.

We will attempt to use the same methods and data sources as the original authors to see if there is any variation in our results or missing methods in their research.

Materials and procedure

Computational environment

Data and variables

There are two data sources for this study, one is demographic data from the American Community Survey and the other is crime rate statistics from the Uniform Crime Report gathered by the FBI.

Census County Subdivisions

Title: CT Census Subdivision Socio-demographic Data
Abstract: BCT Census County Subdivision Socio-demographic Data
Spatial Coverage: Connecticut
Spatial Resolution: County Subdivision
Spatial Representation Type: vector
Spatial Reference System: EPSG: 2234
Temporal Coverage: 2013-2017
Temporal Resolution: 1 year
Lineage: collected using the census API and tidycensus package in R
Distribution: Publicly available
Constraints: Public data
Data Quality: trustworthy

## Reading layer `county_subdivision' from data source 
##   `/Users/dermotmcmillan/Desktop/GitHub/RPr-CT-crime/data/raw/public/county_subdivision.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 173 features and 98 fields (with 4 geometries empty)
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -74 ymin: 41 xmax: -72 ymax: 42
## Geodetic CRS:  NAD83

Label	Alias	Definition	Type	Accuracy	Domain	Missing Data Value(s)	Missing Data Frequency
total_population	B01003_001	Total US population (Estimate)	…	…	…	…	…
age_20m	B01001_008	Population of Males aged 20	…	…	…	…	…
age_21m	B01001_009	Population of Males aged 21	…	…	…	…	…
age_22_24m	B01001_010	Population of Males aged 22-24	…	…	…	…	…
age_25_29m	B01001_011	Population of Males aged 25-29	…	…	…	…	…
age_30_34m	B01001_012	Population of Males aged 30-34	…	…	…	…	…
age_20f	B01001_032	Population of Females aged 20	…	…	…	…	…
age_21f	B01001_033	Population of Females aged 20	…	…	…	…	…
age_22_24f	B01001_034	Population of Females aged 22-24	…	…	…	…	…
age_25_29f	B01001_035	Population of Females aged 25-29	…	…	…	…	…
age_30_34f	B01001_036	Population of Females aged 30-34	…	…	…	…	…
education_total	B15003_001	Total population	…	…	…	…	…
education_assoc	B15003_021	Highest degree or the highest level of school completed = Associates degree	…	…	…	…	…
education_ba	B15003_022	Highest degree or the highest level of school completed = Bachelors Degree	…	…	…	…	…
education_ma	B15003_023	Highest degree or the highest level of school completed = Masters Degree	…	…	…	…	…
education_pro	B15003_024	Highest degree or the highest level of school completed = Profession School Degree	…	…	…	…	…
education_phd	B15003_025	Highest degree or the highest level of school completed = Doctorate Degree	…	…	…	…	…
median_income	B19013_001	Median Household Income	…	…	…	…	…
poverty_total_pop	B17001_001	Total Population	…	…	…	…	…
poverty_below	B17001_002	Income below the poverty level in last 12 months	…	…	…	…	…
unemployment_total	B23025_001	Total Population	…	…	…	…	…
unemployment_total_in_labor	B23025_002	Population in Labor Force	…	…	…	…	…
unemployment_unemployed	B23025_005	Unemployed population considered to be in labor force	…	…	…	…	…
housing_total	B25003_001	Occupied Housing Units	…	…	…	…	…
housing_renter	B25003_003	Renter occupied Housing Units	…	…	…	…	…
housing_units_total	B25024_001	Housing Units	…	…	…	…	…
housing_units_2	B25024_004	Housing Units w/ 2 units	…	…	…	…	…
housing_units_3_4	B25024_005	Housing Units w/ 3 or 4 units	…	…	…	…	…
housing_units_5_9	B25024_006	Housing Units w/ 5 to 9 units	…	…	…	…	…
housing_units_10_19	B25024_007	Housing Units w/ 10 to 19units	…	…	…	…	…
housing_units_20_49	B25024_008	Housing Units w/ 20-49 units	…	…	…	…	…
housing_units_50	B25024_009	Housing Units w/ 50 or more units	…	…	…	…	…
moved_total	B07001_001	Population 1 year or more in the US	…	…	…	…	…
moved_within_12_months	B07001_017	Population that has moved homes in the past 12 months	…	…	…	…	…
households_total	B11003_001	Family Type by Presence and Age of Own Children Under 18 Years	…	…	…	…	…
lone_parent_families_m	B11003_010	Male Housholder, no wife present	…	…	…	…	…
lone_parent_families_f	B11003_016	Female housholder, no husband present	…	…	…	…	…
hispanic	B03002_012	Hispanic	…	…	…	…	…
race_white	B03002_003	Not Hispanic or Latino, White alone	…	…	…	…	…
race_black	B03002_003	Not Hispanic or Latino, Black or African American alone	…	…	…	…	…
race_asian	B03002_006	Not Hispanic or Latino, Asian alone	…	…	…	…	…
race_native	B03002_005	Not Hispanic or Latino, American Indian and Alaska Native Alone	…	…	…	…
race_pacific	B03002_007	Not Hispanic or Latino, Native Hawaiian and Other Pacific Islander Alone	…	…	…	…	…
race_other	B03002_008	Not Hispanic or Latino, Some Other Race Alone	…	…	…	…	…
race_two_or_more	B03002_009	Not Hispanic or Latino, Two or more races	…	…	…	…	…

Connecticut Crime Rate/ Type

Title: CT
Abstract: BCT Census town level Crime Data
Spatial Coverage: Connecticut
Spatial Resolution: town
Spatial Representation Type: non-spatial
Temporal Coverage: 2013-2017
Temporal Resolution: 1 year
Lineage: gathered on 04/06/2024 from http://data.ctdata.org/dataset/ucr-crime-index
Distribution: Publicly available
Constraints: Public data
Data Quality: good, reported from local law enforcement agencies

Bias and threats to validity

The threat specifically relevant to this problem is the Modifiable Unit Area Problem since crime rates will have different social and spatial patterns at different scales. There are also potential sources of error related to endogeneity and spatial auto-correlation both of which are moderately accounted for in the original study. Additionally, the results do not have predictive power because the GWR is too regionally specific and over fit. Instead these results can be interpreted as exploratory requiring more rigorous research to contextualize and verify any findings. Bias is also inherent to crime data since crime is socially constructed and criminality is at least partially defined around race and class in America. Over-policing and over-reporting in Low Income areas and Black and brown neighborhoods introduces bias into the measurement of crime itself.

Data transformations / analysis

There are several methodological choices that the original authors did not specify, and which we will have to figure out by comparing results and summary statistics. Specifically, we need to choose a spatial weights matrix for the GWR. We will start with the default ArcGIS spatial matrix (since they used the ArcGIS tool for their analysis) and go from there. If we cannot figure out which one they used we will chose our own and compare results. There are also some transformation choices with the census data that we will have to figure out by comparing our data to the summary statistics provided (i.e what denominator for percentages).

Data transformations for Crime and Census data are provided in the following workflow:

Workflow

## `summarise()` has grouped output by 'Town'. You can override using the
## `.groups` argument.

Analysis

Summary Stats

###Crime Data

Statistic	Min	Median	Max	IQR	SD
Total Violent Crime	0	53	951	59	140
Total Property Crime	134	783	3911	1204	815

Table 1

Unplanned Deviation It is clear, since the minimum values are different, that the author treated some empty or 0 values as nulls. Since we have no way of discerning which ones these were we will move forward by treating all empty values as 0.

Visualize Crime

###Census

Statistic	Min	Median	Max	IQR	SD
age	5.92	14.61	41.40	5.86	5.36
poverty_rate	0.27	5.13	30.49	4.54	5.12
education	14.74	51.04	77.92	20.11	13.80
median_income	33841.00	85296.00	219868.00	28534.00	28102.93
unemployment_rate	1.21	5.58	16.02	2.61	2.27
rent_rate	2.26	18.98	76.20	15.58	13.67
multi_unit_rate	0.00	17.62	94.23	21.61	18.49
res_mobility	0.96	6.96	23.18	4.73	3.45
pop_density	11.38	169.49	3260.72	358.20	495.35
shannon_eq	0.07	0.24	0.65	0.19	0.15

Table 2

Unplanned Deviation Variables were calculated using each tables respective total population. We cross compared using the summary table (table 2), and were able to match most of the values. Population density, housing type (multi_unit_rate), and residential mobility calculations yielded slightly different results. For population density, this is likely because of minor differences sin calculating area. For housing type and residential mobility, we were unable to parse the differences. Discrepancies may be because the original authors cleaned the data and didn’t report it.

The Shannon index we calculated reported very different summary statistics compared to the original study. Initially, we thought this was a calculation error but we re-ran the analysis several ways (hand-built method and ChatGPT generated workflow) and got the same results. To further verify, I compared the spatial distribution of the Shannon index to maps of other diversity measures in CT and they were almost identical. This, along with some concerning deviations in our analysis, lead us to believe that the original authors incorrectly calculated the metric.

Basic Regression

###Property Crime

term	estimate	std.error	statistic	p.value
(Intercept)	529.27	64.16	8.2	<0.0001
pop_density	0.77	0.13	5.9	<0.0001
multi_unit_rate	15.31	3.49	4.4	<0.0001

###Violent Crime

term	estimate	std.error	statistic	p.value
(Intercept)	-71.24	21.56	-3.3	0.0012
pop_density	0.19	0.02	11.3	<0.0001
education	1.26	0.42	3.0	0.0029
poverty_rate	8.25	1.51	5.5	<0.0001
shannon_eq	-73.11	53.25	-1.4	0.1716

Table 3

The ordinary least squares regression results with Total Property Crime (really the crime rate per 100,000) as the response variable gave surprisingly similar results to the original OLS model in the study. The beta estimates were slightly different for both the predictor values, but this makes sense given that all 3 of the variables had minor discrepancies. We only used the predictor variables selected by the original authors. To expand in this section (and explore the tree of forking paths), it may make sense to do a variable selection process with our data too see if we may have chosen different predictors.

The OLS coefficients for Total Violent Crime were all similar to the original study except for the Shannon equability index (diversity), which didn’t even provide a significant result.

Moran’s I

In this section we only visualized the Local Moran’s I values, we did not calculate a Global Moran’s I for times sake.

Planned Deviation We had no idea what spatial weights matrix the original author used to calculate local Moran’s I scores so we went with what seems to be the default in ArcGIS: fixed distance based on the maximum of the nearest neighbor distances.

Ulanned Deviation It was difficult to determine the exact classification scheme used in ArcGIS for the cluster analysis, as the GUI offers multiple options and limited transparency. After researching the default settings and discussing with ChatGPT, we concluded that areas with statistically significant Local Moran’s I results were classified based on whether their own crime rate and the spatial lag (the average crime rate of neighboring areas) were above or below the global mean. This combination allowed us to assign clusters such as High-High, Low-Low, High-Low, and Low-High.

Figure 2

##GWR

## Adaptive q: 0.38 CV score: 53374871 
## Adaptive q: 0.62 CV score: 46640010 
## Adaptive q: 0.76 CV score: 47949831 
## Adaptive q: 0.65 CV score: 46898852 
## Adaptive q: 0.53 CV score: 46310686 
## Adaptive q: 0.54 CV score: 46175445 
## Adaptive q: 0.57 CV score: 46236944 
## Adaptive q: 0.55 CV score: 46178513 
## Adaptive q: 0.54 CV score: 46118022 
## Adaptive q: 0.54 CV score: 46134633 
## Adaptive q: 0.54 CV score: 46108956 
## Adaptive q: 0.54 CV score: 46091356 
## Adaptive q: 0.54 CV score: 46120000 
## Adaptive q: 0.54 CV score: 46098094 
## Adaptive q: 0.54 CV score: 46101446 
## Adaptive q: 0.54 CV score: 46093134 
## Adaptive q: 0.54 CV score: 46094716 
## Adaptive q: 0.54 CV score: 46091893 
## Adaptive q: 0.54 CV score: 46092195 
## Adaptive q: 0.54 CV score: 46091356

## Adaptive q: 0.38 CV score: 998007 
## Adaptive q: 0.62 CV score: 1036577 
## Adaptive q: 0.24 CV score: 1126640 
## Adaptive q: 0.47 CV score: 1033412 
## Adaptive q: 0.33 CV score: 1020719 
## Adaptive q: 0.39 CV score: 998600 
## Adaptive q: 0.38 CV score: 997949 
## Adaptive q: 0.36 CV score: 1004380 
## Adaptive q: 0.37 CV score: 1001825 
## Adaptive q: 0.38 CV score: 998121 
## Adaptive q: 0.38 CV score: 997871 
## Adaptive q: 0.38 CV score: 997839 
## Adaptive q: 0.38 CV score: 997840 
## Adaptive q: 0.38 CV score: 997838 
## Adaptive q: 0.38 CV score: 997838 
## Adaptive q: 0.38 CV score: 997838 
## Adaptive q: 0.38 CV score: 997838

We used the same method as the original author to chose an adaptive bandwith based on AIC minimization. We did not have to specify a spatial weights matrix for he GWR like we thought.

###Summary

Statistic	Min	Max	under_1.96	between_1.96_2.58	above_2.58
pop_density.1	0.44	1.53	0.59	0.83	0.15
multi_unit_rate.1	2.25	25.94	0.34	0.56	0.11
pop_density.2	0.04	0.46	0.03	0.88	0.09
education.1	-3.22	2.85	0.41	0.50	0.09
poverty_rate.1	-2.29	24.21	0.59	0.33	0.08
shannon_eq.1	-250.19	143.80	0.91	0.01	0.08

Table 4

Coefficient ranges for each of the models (Property/ Violent Crime) varried significantly between the original study and our reproduction. This difference can be explained by the discrepancies in the underlying data and the fact that GWR over fits (explains too much of the error).

Property Crime Model Visualized

Figure 3

Violent Crime Model Visualized

Figure 5

Results

For Each set of GWR maps, we visualized the coefficients based on quartile bins (same method as authors). This is not a particularly helpful way of visualizing coefficients, especially if using a diverging color scheme, since the sign of the coefficient has meaning and is misrepresented when the central color is not 0. We created a function for binning so that anyone who wishes to work further on this can easily change the visual representation of the data but did not implement a more statistically accurate color scheme. Despite this, many of the maps did maintain a similar spatial distribution of high and low coefficients (with respect to the mean) as compared to the original study. Some of the patterns, though, such as the distribution in the coefficients for the Shannon Equability Index were wildly different. This is pretty concerning, given the original author made some problematic and unfounded claims (based on these results) about the role of diversity in fostering crime.

Also, the spatial variation in t-scores was completely different. This makes sense given the ranges of coefficients were much different so the counties with beta values near zero changed between the studies. The only predictor (for both models) with significant results across the board was population density.

The truth is GWR and Moran’s I testing do not have much predictive power so many of the claims made by the author are unreasonable given the evidence. The only thing that can tell us is that there is or is not spatial variation in the relationship between the response and predictor variables, which our replication also shoes.

Discussion

Whilst replicating this study, we came across a number of deviations that we had to make, as well as some limitations and issues with the original study. One limitation in the conceptualization of this study lies in its reliance on predefined administrative boundaries, such as county subdivisions, which may not accurately reflect the social and environmental contexts influencing crime. This approach can lead to the Modifiable Areal Unit Problem (MAUP), where statistical results vary based on the spatial units used, potentially obscuring localized crime patterns and leading to ecological fallacies when inferring individual behaviors from aggregate data.

The study’s use of the Shannon Index — an ecological species diversity measure — as a metric for human diversity and how this impacts crime rates is especially problematic and misleading. When using the Shannon Index, diversity levels are considered as the level of evenness across eight racial groups, which does not represent true diversity, especially when analyzing a smaller geographic area such as Connecticut. Also, the assumptions made on the back of the Shannon index numbers are troubling. The study writes: “racial/ethnic heterogeneity weakens residents’ attachment to the areas where they live and reduces community organization and involvement”. The implication that community engagement is greater in less diverse communities is unfounded given both the original and replication study results, which showed coefficients ranging across space from both positive to negative. This implies that some geographically determined exogenous effect is more impact than diversity on crime rates. This is an especially unfounded claim given that beta values for this variable were not significant in both the OLS model and the GWR (save for a few counties). Additionally, the study’s focus on certain socioeconomic variables may overlook other influential factors like cultural dynamics, community cohesion, or informal social controls that are harder to quantify but significantly impact crime rates.

Regarding the measurement of crime, the UCR crime data is reported by county subdivision which may result in variations in law enforcement practices, reporting standards, and resource allocation across these jurisdictions. The result may be inconsistencies in crime data, complicating comparisons and trend analyses. Studies like this can influence policy decisions, potentially amplifying the effect of these inconsistencies. There is also always error and bias built into crime data since law enforcement historically over-polices and reports on marginalized groups — specifically Black Americans. In crime research this can create a sort of self fulfilling prophecy where over-reporting reinforces police bias, and that bias fuels over-reporting.

ACS has large margins of error for each statistic so when dealing with smaller, less populous geographic areas like Connecticut, these margins are often quite high. When these data are used as predictors in regression models or spatial analyses, the underlying uncertainty can lead to unstable or misleading results. This is part of a larger problem that we faced in our replication study, which is the original study’s lack of clarity regarding their choices in variables, models, and more. We have experienced a significant number of deviations in the data - and although they may be minor, these differences compound to create large gaps in the models and our ability to replicate the original study’s models.

The author importantly notes that discussions of crime should be more locally focused and context specific. Policy and policing of crime should reflect this context and the natural spatial variation of all forms of crime. The data and models used in the study show that there is spatial variation in the data but the models have little to no predictive power thus conclusions regarding the effect of diversity on crime and social capital, for example, should not be given too much weight. Further work and analysis needs to be done beyond the geographic weighted regression to apply conclusions to these relationships.

Integrity Statement

Include an integrity statement - The authors of this preregistration state that they completed this preregistration to the best of their knowledge and that no other preregistration exists pertaining to the same hypotheses and research.

This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](https://doi.org/10.17605/OSF.IO/W29MQ)

References

Bivand, Roger. 2025. Spdep: Spatial Dependence: Weighting Schemes, Statistics. https://github.com/r-spatial/spdep/.

Bivand, Roger S., Edzer Pebesma, and Virgilio Gómez-Rubio. 2013. Applied Spatial Data Analysis with R, Second Edition. Springer, NY. https://asdar-book.org/.

Bivand, Roger, and David W. S. Wong. 2018. “Comparing Implementations of Global and Local Indicators of Spatial Association.” TEST 27 (3): 716–48. https://doi.org/10.1007/s11749-018-0599-x.

Bivand, Roger, and Danlin Yu. 2024. Spgwr: Geographically Weighted Regression. https://github.com/rsbivand/spgwr/.

Henry, Lionel, and Hadley Wickham. 2025. Rlang: Functions for Base Types and Core r and Tidyverse Features. https://rlang.r-lib.org.

Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://here.r-lib.org/.

Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.

———. 2024. Sf: Simple Features for r. https://r-spatial.github.io/sf/.

Pebesma, Edzer, and Roger Bivand. 2023a. Spatial Data Science: With applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.

Pebesma, Edzer, and Roger S. Bivand. 2023b. Spatial Data Science with Applications in R. Chapman & Hall. https://r-spatial.org/book/.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Robinson, David, Alex Hayes, and Simon Couch. 2024. Broom: Convert Statistical Objects into Tidy Tibbles. https://broom.tidymodels.org/.

Roger Bivand. 2022. “R Packages for Analyzing Spatial Data: A Comparative Case Study with Areal Data.” Geographical Analysis 54 (3): 488–518. https://doi.org/10.1111/gean.12319.

Tennekes, Martijn. 2018. “tmap: Thematic Maps in R.” Journal of Statistical Software 84 (6): 1–39. https://doi.org/10.18637/jss.v084.i06.

———. 2025. Tmap: Thematic Maps. https://github.com/r-tmap/tmap.

Walker, Kyle. 2024. Tigris: Load Census TIGER/Line Shapefiles. https://github.com/walkerke/tigris.

Walker, Kyle, and Matt Herman. 2025. Tidycensus: Load US Census Boundary and Attribute Data as Tidyverse and Sf-Ready Data Frames. https://walker-data.com/tidycensus/.

Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Thomas Lin Pedersen, and Dana Seidel. 2023. Scales: Scale Functions for Visualization. https://scales.r-lib.org.

Zhu, Hao. 2024. kableExtra: Construct Complex Table with Kable and Pipe Syntax. http://haozhu233.github.io/kableExtra/.

CT Crime Reproduction Study

Gus Howard & Dermot McMillan

2025-05-20