Chapter 12 Hurricane Dorian

Install packages

install.packages("tidyverse")
install.packages("here")
install.packages("tidytext")
install.packages("tm")
install.packages("igraph")
install.packages("ggraph")
install.packages("qdapRegex")

Load packages

12.0.1 Load and organize data

Check & create private data folder

if(!dir.exists(here("data_private"))) {dir.create(here("data_private"))}
if(!dir.exists(here("data_public"))) {dir.create(here("data_public"))}

Set up data

Create a file path for Dorian data

# set file path
dorianFile <- here("data_private", "dorian_raw.RDS")
# show the file path
dorianFile
## [1] "C:/GitHub/opengisci/wt25_josephholler/data_private/dorian_raw.RDS"

Check if Dorian data exists locally. If not, download the data.

Check if Dorian data exists locally. If not, download the data.

if(!file.exists(dorianFile)){
  download.file("https://geography.middlebury.edu/jholler/wt25/dorian/dorian_raw.RDS", 
                dorianFile, 
                mode="wb")
  }

Load Dorian data from RDS file

dorian_raw <- readRDS(dorianFile)

View Dorian data column names

dorian_columns <- colnames(dorian_raw)

Save Dorian data column names to a CSV file for documentation and reference

write.csv(dorian_columns, here("data_public", "dorian_columns.csv"))

12.1 Temporal analysis

What data format is our created_at data using? Try the str function to reveal the structure of a data object. $ picks out a column of a data frame by name

str(dorian_raw$created_at)
##  POSIXct[1:209387], format: "2019-09-10 19:22:34" "2019-09-10 19:22:34" "2019-09-10 17:14:39" ...

Twitter data’s created_at column is in the POSIXct date-time format. Underneath the hood, all date-time data is stored as the number of seconds since January 1, 1970. We wrap that data around functions to transform it into human-readable formats. This format does not record a time zone and uses UTC for its time. Let’s convert that into local (EST) time for the majority of the region of interest using the with_tz function, part of the lubridate package that comes with tidyverse.

dorian_est <- dorian_raw |> 
  mutate(created_est = with_tz(created_at, tzone = "EST"))

Try graphing the tweets over time using geom_histogram. Once you get one graph working, keep copying the code into the subsequent block to make improvements.

dorian_est |> 
  ggplot() +
  geom_histogram(aes(x = created_est))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How are the number of bins– and therefore the bin widths– defined?
How is the x-axis formatted?

12.1.1 Customize bin widths

To control bin widths,geom_histogram accepts a binwidth parameter assuming the units are in seconds. Try setting the bin width to one hour, and create the graph.

Try setting the bin width to 6-hour intervals, or even days.

dorian_est |> 
  ggplot() +
  geom_histogram(aes(x = created_est), binwidth = 3600)

Reset the bin width to one hour (3600 seconds) before moving on.

12.1.2 Customize a date-time axis

To control a date-time x-axis, try adding a scale_x_datetime function.

Add a date_breaks parameter with the value "day" to label and display grid lines for each day.

Add a date_labels parameter to format labels according to codes defined in strptime. Look up documentation with ?strptime. Try "%b-%d" for a label with Month abbreviation and day.

Add a name parameter to customize the x-axis label.

dorian_est |> 
  ggplot() +
  geom_histogram(aes(x = created_est), binwidth = 3600) +
  scale_x_datetime(date_breaks = "day",
                   date_labels = "%b-%d",
                   name = "time")

Now, try limiting the x-axis to fewer days, e.g. Sept 5 through Sept 8.

scale_x_datetime accepts a limits parameter as a list of two dates. You can create an eastern time zone date with as.POSIXct("2019-09-05", tz="EST")

dorian_est |> 
  ggplot() +
  geom_histogram(aes(x = created_est), binwidth = 3600) +
  scale_x_datetime(date_breaks = "day",
                   date_labels = "%b-%d",
                   name = "time",
                   limits = c(as.POSIXct("2019-09-06", tz="EST"),
                              as.POSIXct("2019-09-09", tz="EST")
                              )
                   )
## Warning: Removed 118438 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale
## range (`geom_bar()`).

Finally, let’s remove the x-axis limits and switch from geom_histogram to geom_freqpoly

dorian_est |> 
  ggplot() +
  geom_freqpoly(aes(x = created_est), binwidth = 3600) +
  scale_x_datetime(date_breaks = "day",
                   date_labels = "%b-%d",
                   name = "time")

Note that temporal data can be binned into factors using the cut function. look up cut.POSIXt

12.2 Text analysis

refer to: https://www.earthdatascience.org/courses/earth-analytics/get-data-using-apis/text-mining-twitter-data-intro-r/

Tokenize the tweet content

dorian_words <- dorian_raw |> 
  select(status_id, text) |> 
  mutate(text = rm_hash(rm_tag(rm_twitter_url(tolower(text))))) |> 
  unnest_tokens(input = text, output = word, format = "html")

Remove stop words

dorian_words_cleaned <- dorian_words |> 
  filter(!(word %in% stop_words$word))

Remove search terms and junk words

junk <- c("hurricane",
          "sharpiegate",
          "dorian",
          "de",
          "https",
          "hurricanedorian",
          "por",
          "el",
          "en",
          "las",
          "del")

dorian_words_cleaned <- dorian_words_cleaned |> 
  filter(!(word %in% junk)) |> 
  filter(!(word %in% stop_words$word))

12.2.1 Graph word frequencies

dorian_words_cleaned %>%
  count(word, sort = TRUE) %>%
  head(15) %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot() +
  geom_col(aes(x = n, y = word)) +
  xlab(NULL) +
  labs(
    x = "Count",
    y = "Unique words",
    title = "Count of unique words found in tweets"
  )

12.2.2 Graph word associations

create n-gram data structure n-grams create pairs of co-located words wrapper function: unnest_ngrams skip_ngrams forms pairs separated by a defined distance

It would be keen to remove @ handles here as well. Also consider removing Spanish stop words https://kshaffer.github.io/2017/02/mining-twitter-data-tidy-text-tags/

dorian_text_clean <- dorian_raw |> 
  select(text) |> 
  mutate(text_clean = rm_hash(rm_tag(rm_twitter_url(tolower(text)))),
         text_clean = removeWords(text_clean, stop_words$word),
         text_clean = removeWords(text_clean, junk)) 

Separate and count ngrams

dorian_ngrams <- dorian_text_clean |>  
  unnest_tokens(output = paired_words,
                input = text_clean, 
                token = "ngrams") |> 
  separate(paired_words, c("word1", "word2"), sep = " ") |> 
  count(word1, word2, sort = TRUE)
## Warning: Expected 2 pieces. Additional pieces discarded in 1398438 rows [1,
## 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
## ...].

Graph ngram word associations

# graph a word cloud with space indicating association.
# you may change the filter to filter more or less than pairs with 25 instances
dorian_ngrams %>%
  filter(n >= 500 & !is.na(word1) & !is.na(word2)) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
  labs(
    title = "Hurricane Dorian Word Associations",
    x = "", y = ""
  ) +
  theme(
    plot.background = element_rect(
    fill = "grey95",
    colour = "black",
    size = 1
    ),
    legend.background = element_rect(fill = "grey95")
  )
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2
## 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this
## warning was generated.
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of
## ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this
## warning was generated.

12.3 Geographic analysis

Install and load packages for spatial analysis to the beginning of your markdown document, including:

  • sf for simple (geographic vector) features
  • tmap for thematic mapping
  • tidycensus for acquiring census data
install.packages("sf")
install.packages("tidycensus")

Load packages

dorianFile <- here("data_private", "dorian_raw.RDS")
dorian_raw <- readRDS(dorianFile)

12.3.1 Construct geographic points from tweets

Before conducting geographic analysis, we need to make geographic points for those Tweets with enough location data to do so. Our end goal will be to map Twitter activity by county, so any source of location data precise enough to pinpoint a user’s county will suffice. The three primary sources of geographic information in tweets are: - Geographic coordinates from location services - Place names that users tag to a specific tweet - Location description in users’ profiles

12.3.2 Location services

The most precise geographic data is the GPS coordinates from a device with location services, stored in the geo_coords column, which contains a vector of two decimal numbers.

In data science, unnest refers to actions in which when complex content from a single column is parsed out or divided out into numerous columns and/or rows. We can use unnest_wider to spread the list content of one column into multiple columns. The challenge is figuring out what the new unique column names will be so that the data frame still functions. In this case, the list items inside geo_coords do not have names. We must therefore use the names_sep argument to add a separator and suffix number to the resulting columns, producing the names geo_coords_1 and geo_coords_2

dorian_geo <- dorian_raw |> 
  unnest_wider(geo_coords, names_sep = "_")

We know that most of the geo_coords data is NA because Twitter users had location services disabled in their privacy settings.

How many Tweets had location services enabled, and thus not NA values for geographic coordinates? Answer with the next code block.

dorian_geo |> filter(!is.na(geo_coords_1)) |> count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  1274

We know that geographic coordinates in North America have negative x longitude values (west) and positive y latitude values (north) when the coordinates are expressed in decimal degrees. Discern which geo_coords column is which by sampling a 5 of them in the code block below. Hint: select can select a subset of columns using many helper functions, including starts_with to pick out columns that begin with the same prefix.

dorian_geo |> 
  filter(!is.na(geo_coords_1)) |> 
  select(starts_with("geo")) |> 
  sample_n(5)
## # A tibble: 5 × 2
##   geo_coords_1 geo_coords_2
##          <dbl>        <dbl>
## 1         35.6        -78.7
## 2         39.3        -74.6
## 3         32.8        -79.9
## 4         26.3        -80.1
## 5         34.2        -77.9

Replace ??? below with the correct column suffix.

The x coordinate (longitude) is stored in column geo_coords_???
The y coordinate (latitude) is stored in column geo_Coords_???

12.3.3 Place name

Many users also record a place name for their Tweet, but how precise are those place names?

Let’s summarize place types

count(dorian_raw, place_type)
## # A tibble: 6 × 2
##   place_type        n
##   <chr>         <int>
## 1 admin          2893
## 2 city          10462
## 3 country         346
## 4 neighborhood     43
## 5 poi             441
## 6 <NA>         195202

And graph a bar chart of place types

dorian_raw |> 
  filter(!is.na(place_type)) |> 
  count(place_type) |> 
  ggplot() +
  geom_col(aes(x = n, y = place_type))

Twitter then translated the place name into a bounding box with four coordinate pairs, stored as a list of eight decimal numbers in bbox_coords. We will need to find the center of the bounding box by averaging the x coordinates and averaging the y coordinates.

In the code block below, convert the single bbox_coords column into one column for each coordinate.

dorian_geo_place <- dorian_geo |> 
  unnest_wider(bbox_coords,
               names_sep = "_")

Sample 5 of the resulting bbox_coords columns to determine which columns may be averaged to find the center of the bounding box.

dorian_geo_place |> 
  filter(!is.na(bbox_coords_1)) |> 
  select(starts_with("bbox")) |> 
  sample_n(5)
## # A tibble: 5 × 8
##   bbox_coords_1 bbox_coords_2 bbox_coords_3 bbox_coords_4 bbox_coords_5
##           <dbl>         <dbl>         <dbl>         <dbl>         <dbl>
## 1         -82.5         -82.4         -82.4         -82.5          27.1
## 2         -81.3         -81.2         -81.2         -81.3          29.0
## 3         -83.7         -83.5         -83.5         -83.7          41.6
## 4         -81.3         -81.1         -81.1         -81.3          34.0
## 5         -95.7         -95.6         -95.6         -95.7          29.5
## # ℹ 3 more variables: bbox_coords_6 <dbl>, bbox_coords_7 <dbl>,
## #   bbox_coords_8 <dbl>

The minimum x coordinate (longitude) is stored in column bbox_coords_???
The maximum x coordinate (longitude) is stored in column bbox_coords_???
The minimum y coordinate (latitude) is stored in column bbox_coords_??? The maximum y coordinate (latitude) is stored in column bbox_coords_???

For only place names at least as precise as counties (poi, city, or neighborhood), calculate the center coordinates in columns named bbox_meanx and bbox_meany. It will be easier to find averages using simple math than to use rowwise and c_across for row-wise aggregate functions.

good_places <- c("poi", "city", "neighborhood")
dorian_geo_place <- dorian_geo_place |> 
  mutate(bbox_meanx = case_when(place_type %in% good_places ~ 
                                  (bbox_coords_1 + bbox_coords_2) / 2),
         bbox_meany = case_when(place_type %in% good_places ~ 
                                  (bbox_coords_6 + bbox_coords_7) / 2)
         )

Sample 5 rows of the place type column and columns starting with bbox for tweets with non-NA place types to see whether the code above worked as expected.

dorian_geo_place |> 
  filter(!is.na(place_type)) |> 
  select(place_type, starts_with("bbox")) |> 
  sample_n(5)
## # A tibble: 5 × 11
##   place_type bbox_coords_1 bbox_coords_2 bbox_coords_3 bbox_coords_4
##   <chr>              <dbl>         <dbl>         <dbl>         <dbl>
## 1 city               -80.6         -80.4         -80.4         -80.6
## 2 city               -74.3         -74.1         -74.1         -74.3
## 3 city               -79.1         -78.9         -78.9         -79.1
## 4 admin              -83.4         -78.5         -78.5         -83.4
## 5 city               -85.9         -85.8         -85.8         -85.9
## # ℹ 6 more variables: bbox_coords_5 <dbl>, bbox_coords_6 <dbl>,
## #   bbox_coords_7 <dbl>, bbox_coords_8 <dbl>, bbox_meanx <dbl>,
## #   bbox_meany <dbl>

To be absolutely sure, find the aggregate mean of bbox_meanx and bbox_meany by place_type

dorian_geo_place |> 
  group_by(place_type) |> 
  summarize(mean(bbox_meanx), mean(bbox_meany))
## # A tibble: 6 × 3
##   place_type   `mean(bbox_meanx)` `mean(bbox_meany)`
##   <chr>                     <dbl>              <dbl>
## 1 admin                      NA                 NA  
## 2 city                      -80.9               35.7
## 3 country                    NA                 NA  
## 4 neighborhood              -81.9               34.4
## 5 poi                       -78.8               34.7
## 6 <NA>                       NA                 NA

Finally, create xlng (longitude) and ylat (latitude) columns and populate them with coordinates, using geographic coordinates first and filling in with place coordinates second. Hint: if none of the cases are TRUE in a case_when function, then .default = can be used at the end to assign a default value.

dorian_geo_place <- dorian_geo_place |> 
  mutate(xlng = case_when(!is.na(geo_coords_2) ~ geo_coords_2,
                          .default = bbox_meanx),
         ylat = case_when(!is.na(geo_coords_1) ~ geo_coords_1,
                          .default = bbox_meany)
  )

Sample the xlng and ylat data along side the geographic coordinates and bounding box coordinates to verify. Re-run the code block until you find at least some records with geographic coordinates.

dorian_geo_place |> 
  filter(!is.na(xlng)) |> 
  select(xlng, ylat, geo_coords_1, geo_coords_2, bbox_meanx, bbox_meany, place_type) |> 
  sample_n(5)
## # A tibble: 5 × 7
##    xlng  ylat geo_coords_1 geo_coords_2 bbox_meanx bbox_meany place_type
##   <dbl> <dbl>        <dbl>        <dbl>      <dbl>      <dbl> <chr>     
## 1 -79.3  43.6           NA           NA      -79.3       43.6 city      
## 2 -81.6  41.5           NA           NA      -81.6       41.5 city      
## 3 -77.0  38.9           NA           NA      -77.0       38.9 city      
## 4 -76.9  34.9           NA           NA      -76.9       34.9 city      
## 5 -71.5  41.8           NA           NA      -71.5       41.8 city

12.3.4 Geographic point data

The sf pacakge allows us to construct geographic data according to open GIS standards, making tidyverse dataframes into geodataframes analagous to geopackages and shapefiles.

Select the status ID, created date-time, x and y coordinate columns. Filter the data for only those Tweets with geographic coordinates. Then, use the st_as_sf function to create geographic point data.

A crs is a coordinate reference system, and Geographic coordinates are generally stored in the system numbered 4326 (WGS 1984 geographic coordinates).

tweet_points <- dorian_geo_place |> 
  select(status_id, created_at, xlng, ylat) |> 
  filter(!is.na(xlng)) |> 
  st_as_sf(coords = c("xlng", "ylat"), crs = 4326)

Graph (map) the tweets!

tweet_points |> 
  ggplot() +
  geom_sf(alpha=0.01)

12.3.5 Aquire Census data

For details on the variables, see https://api.census.gov/data/2020/dec/dp/groups/DP1.html

countyFile <- here("data_public", "counties.RDS")
censusVariablesFile <- here("data_public", "census_variables.csv")

if(!file.exists(countyFile)){
  counties <- get_decennial(
    geography = "county",
    table = "DP1",
    sumfile = "dp",
    output = "wide",
    geometry = TRUE,
    keep_geo_vars = TRUE,
    year = 2020
  )
  
saveRDS(counties, countyFile)

census_variables <- load_variables(2020, dataset="dp")

write_csv(census_variables, censusVariablesFile)

} else{
  counties <- readRDS(countyFile)
  census_variables <- read_csv(censusVariablesFile)
}
## Rows: 320 Columns: 3
## ── Column specification ─────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, label, concept
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

12.3.6 Select counties of interest

select only the states you want, with FIPS state codes look up fips codes here: https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code

counties <- filter(
  counties,
  STATEFP %in% c(
    "54", "51", "50", "47", "45", "44", "42", "39", "37", "36", "05", "01",
    "34", "33", "29", "28", "25", "24", "23", "22", "21", "18", "17", "13",
    "12", "11", "10", "09"
  )
)

# saveRDS(counties, here("data_public", "counties_east.RDS"))

12.3.7 Map population density and tweet points

Calculate population density in terms of people per square kilometer of land. DP1_0001C is the total population ALAND is the area of land in square meters

counties <- mutate(counties,
                   popdensity = DP1_0001C / (ALAND / 1000000))

Map population density

We can use ggplot, but rather than the geom_ functions we have seen before for the data geometry, use geom_sf for our simple features geographic data.

cut_interval is an equal interval classifier, while cut_number is a quantile / equal count classifier function. Try switching between the two and regraphing the map below.

ggplot() +
  geom_sf(data = counties, 
    aes(fill = cut_number(popdensity, 5)), color = "grey") +
  scale_fill_brewer(palette = "GnBu") +
  guides(fill = guide_legend(title = "Population Density")) 

Add twitter points to the map as a second geom_sf data layer.

ggplot() +
  geom_sf(data = counties, 
    aes(fill = cut_number(popdensity, 5)), color = "grey") +
  scale_fill_brewer(palette = "GnBu") +
  guides(fill = guide_legend(title = "Population Density")) +
  geom_sf(data = tweet_points,
    colour = "purple", alpha = 0.05, size = 1
  ) +
  labs(title = "Tweet Locations After Hurricane Dorian") 

12.3.8 Join tweets to counties

  • Spatial functions for vector geographies almost universally start with st_
  • Twitter data uses a global geographic coordinate system (WGS 1984, with code 4326)
  • Census data uses North American geographic coordinates (NAD 1987, with code 4269)
  • Before making geographic comparisons between datasets, everything should be stored in the same coordinate reference system (crs).
  • st_crs checks the coordinate reference system of geographic data
  • st_transform mathematically translates coordinates from one system to another.

Transform Tweet points into the North American coordinate system and then check their CRS.

tweet_points <- st_transform(tweet_points, 4269)
st_crs(tweet_points)
## Coordinate Reference System:
##   User input: EPSG:4269 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Geodesy."],
##         AREA["North America - onshore and offshore: Canada - Alberta; British Columbia; Manitoba; New Brunswick; Newfoundland and Labrador; Northwest Territories; Nova Scotia; Nunavut; Ontario; Prince Edward Island; Quebec; Saskatchewan; Yukon. Puerto Rico. United States (USA) - Alabama; Alaska; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Hawaii; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming. US Virgin Islands. British Virgin Islands."],
##         BBOX[14.92,167.65,86.45,-40.73]],
##     ID["EPSG",4269]]
  • Here, we want to map the number, density, or ratio of tweets by county.
  • Later on, we may want to filter tweets by their location, or which county and state they are in.
  • For both goals, the first step is to join information about counties to individual tweets.
  • You’ve already seen data joined by column attributes (GDP per capita and life expectancy by country and year) using left_join. Data can also be joined by geographic location with a spatial version, st_join. If you look up the default parameters of st_join, it will be a left join with the join criteria st_intersects.

But first, which columns from counties should we join to tweet_points, if we ultimately want to re-attach tweet data to counties and we might want to filter tweets by state or by state and by county? Hint for those unfamiliar with Census data, the Census uses GEOIDs to uniquely identify geographic features, including counties. County names are not unique: some states have counties of the same name.

Select the GEOID and useful county and state names from counties, and join this information to tweet points by geographic location. In the select below, = is used to rename columns as they are selected.

Display a sample of the results and save these useful results as an RDS file.

counties_select_columns <- counties |> 
  select(GEOID, county = NAME.x, state = STATE_NAME)

tweet_points <- tweet_points |> st_join(counties_select_columns)

tweet_points |> sample_n(5)
## Simple feature collection with 5 features and 5 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -82.76477 ymin: 36.84176 xmax: -69.82584 ymax: 43.90888
## Geodetic CRS:  NAD83
## # A tibble: 5 × 6
##   status_id     created_at                      geometry GEOID county     state 
## * <chr>         <dttm>                       <POINT [°]> <chr> <chr>      <chr> 
## 1 117055457755… 2019-09-08 04:28:53 (-69.82584 43.90888) 23023 Sagadahoc  Maine 
## 2 117013477001… 2019-09-07 00:40:43 (-82.76477 39.97482) 39089 Licking    Ohio  
## 3 116967038334… 2019-09-05 17:55:25 (-76.35592 36.84176) 51740 Portsmouth Virgi…
## 4 116997307936… 2019-09-06 13:58:13 (-79.27257 43.62931) <NA>  <NA>       <NA>  
## 5 117002116673… 2019-09-06 17:09:18 (-79.27257 43.62931) <NA>  <NA>       <NA>
saveRDS(tweet_points, here("data_public", "tweet_locations.RDS"))

Pro Tips:

  • If you join data more than once (e.g. by re-running the same code block), R will create duplicate field names and start adding suffixes to them (e.g. .x, .y). If this happens to you, go back and recreate the data frame and then join only once.
  • Once a data frame is geographic, the geometry column is sticky: it stays with the data even if it is not explicitly selected. The only way to be rid of it is with a special st_drop_geometry() function. Fun fact: you cannot join two data frames by attribute (e.g. through left_join()) if they both have geometries.

Now, count the number of tweets per county and remove the geometry column as you do so.

tweet_counts <- tweet_points |> 
  st_drop_geometry() |> 
  count(GEOID)

tweet_counts
## # A tibble: 602 × 2
##    GEOID     n
##    <chr> <int>
##  1 01001     1
##  2 01003     5
##  3 01007     2
##  4 01015     6
##  5 01017     1
##  6 01021     1
##  7 01043     1
##  8 01045     1
##  9 01047     1
## 10 01051     3
## # ℹ 592 more rows

Join the count of tweets to counties and save counties with this data.

counties_tweet_counts <- counties |> 
  left_join(tweet_counts, by = "GEOID") 

counties_tweet_counts |> 
  select(-starts_with(c("DP1", "A")), -ends_with("FP")) |> 
  sample_n(5)
## Simple feature collection with 5 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -88.36278 ymin: 32.5842 xmax: -78.03319 ymax: 41.37651
## Geodetic CRS:  NAD83
## # A tibble: 5 × 11
##   COUNTYNS GEOID NAME.x    NAMELSAD         STUSPS STATE_NAME   LSAD  NAME.y    
## * <chr>    <chr> <chr>     <chr>            <chr>  <chr>        <chr> <chr>     
## 1 01480124 51069 Frederick Frederick County VA     Virginia     06    Frederick…
## 2 00161554 01057 Fayette   Fayette County   AL     Alabama      06    Fayette C…
## 3 01213674 42065 Jefferson Jefferson County PA     Pennsylvania 06    Jefferson…
## 4 00424241 17079 Jasper    Jasper County    IL     Illinois     06    Jasper Co…
## 5 00346824 13319 Wilkinson Wilkinson County GA     Georgia      06    Wilkinson…
## # ℹ 3 more variables: geometry <MULTIPOLYGON [°]>, popdensity <dbl>, n <int>

12.3.9 Calculate county tweet rates

Calculate the tweet_rate as tweets per person. Many counties remain NA because they had no tweets. Rather than missing data, this actually means there are zero tweets. Therefore, convert NA to 0 using replace_na() before calculating the tweet rate.

counties_tweet_counts <- counties_tweet_counts |> 
  mutate(
    n = replace_na(n, 0),
    tweet_rate = n / DP1_0001C * 10000
    )

counties_tweet_counts |> 
  select(-starts_with(c("DP1", "A")), -ends_with("FP")) |> 
  sample_n(5)
## Simple feature collection with 5 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -94.61409 ymin: 34.91246 xmax: -82.2613 ymax: 38.8525
## Geodetic CRS:  NAD83
## # A tibble: 5 × 12
##   COUNTYNS GEOID NAME.x    NAMELSAD         STUSPS STATE_NAME     LSAD  NAME.y  
## * <chr>    <chr> <chr>     <chr>            <chr>  <chr>          <chr> <chr>   
## 1 00424252 17101 Lawrence  Lawrence County  IL     Illinois       06    Lawrenc…
## 2 00069907 05147 Woodruff  Woodruff County  AR     Arkansas       06    Woodruf…
## 3 00516902 21111 Jefferson Jefferson County KY     Kentucky       06    Jeffers…
## 4 00758461 29013 Bates     Bates County     MO     Missouri       06    Bates C…
## 5 01008562 37089 Henderson Henderson County NC     North Carolina 06    Henders…
## # ℹ 4 more variables: geometry <MULTIPOLYGON [°]>, popdensity <dbl>, n <int>,
## #   tweet_rate <dbl>

Show a table of the county name and tweet rate for the 10 counties with the highest rates

counties_tweet_counts |> 
  arrange(desc(tweet_rate)) |> 
  select(NAME.y, tweet_rate) |> 
  st_drop_geometry() |> 
  head(10)
## # A tibble: 10 × 2
##    NAME.y                             tweet_rate
##    <chr>                                   <dbl>
##  1 Dare County, North Carolina             12.7 
##  2 Carteret County, North Carolina         11.2 
##  3 Nantucket County, Massachusetts         11.2 
##  4 Hyde County, North Carolina              8.72
##  5 Georgetown County, South Carolina        7.26
##  6 Charleston County, South Carolina        6.52
##  7 Horry County, South Carolina             5.95
##  8 Brunswick County, North Carolina         5.93
##  9 New Hanover County, North Carolina       5.58
## 10 Worcester County, Maryland               5.53

Should we multiply the tweet rate by a constant, so that it can be interpreted as, e.g. tweets per 100 people? What constant makes the best sense? Adjust and re-run the rate calculation code block above and note the rate you have chosen.

Save the counties data with Twitter counts and rates

saveRDS(counties_tweet_counts, here("data_public", "counties_tweet_counts.RDS"))

12.3.10 Map tweet rates by county

Make a choropleth map of the tweet rates!

ggplot() +
  geom_sf(data = counties_tweet_counts, 
    aes(fill = cut_interval(tweet_rate, 7)), color = "grey") +
  scale_fill_brewer(palette = "GnBu") +
  guides(fill = guide_legend(title = "Tweets per 10,000 People")) 

12.3.11 Conclusions

Many things here could still be improved upon! We could…

  • Use regular expressions to filter URLs, user handles, and other weird content out of Tweet Content.
  • Combine state names into single words for analysis, e.g. NewYork, NorthCarolina using str_replace
  • Calculate sentiment of each tweet
  • Draw word clouds in place of bar charts of the top most frequent words
  • Geocode profile location descriptions in order to map more content
  • Animate our map of tweets or our graphs of word frequencies by time
  • Switch from using ggplot for cartography to using tmap

For the second assignment, by afternoon class on Tuesday:

  • Complete and knit this Rmd
  • To the conclusions, add two or three ideas about improvements or elements of interactivity that we could add to this analysis.
  • Research and try to implement any one of the improvements suggested. This part will be evaluated more based on effort than on degree of success: try something out for 1-2 hours and document the resources you used and what you learned, whether it ultimately worked or not.

12.4 Map hurricane tracks with tmap

Install new spatial packages

install.packages("spdep")
install.packages("tmap")

Load packages

Let’s use the tmap package for thematic mapping. https://r-tmap.github.io/tmap/ Note that tmap is rolling out a new version, so some of the online documentation is ahead of the package installed by default.

12.5 Acquire hurricane track data

First, download NA (North Atlantic) point and line shapefiles from: https://www.ncei.noaa.gov/products/international-best-track-archive (go to the shapefiles data: https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r01/access/shapefile/ , and find the two files with “NA” for North America: points and lines) Unzip the contents (all 4 files in each zip) into your data_private folder.

Read in the lines data

hurricaneTracks <- st_read(here("data_private", "IBTrACS.NA.list.v04r01.lines.shp"))
## Reading layer `IBTrACS.NA.list.v04r01.lines' from data source 
##   `C:\GitHub\opengisci\wt25_josephholler\data_private\IBTrACS.NA.list.v04r01.lines.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 124915 features and 179 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -136.9 ymin: 7 xmax: 63 ymax: 83
## Geodetic CRS:  WGS 84

Filter for 2019, and for Dorian specifically.

tracks2019 <- hurricaneTracks |> filter(SEASON == 2019)
dorianTrack <- tracks2019 |> filter(NAME == "DORIAN")

Load hurricane points

hurricanePoints <- st_read(here("data_private", "IBTrACS.NA.list.v04r01.points.shp"))
## Reading layer `IBTrACS.NA.list.v04r01.points' from data source 
##   `C:\GitHub\opengisci\wt25_josephholler\data_private\IBTrACS.NA.list.v04r01.points.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 127219 features and 179 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -136.9 ymin: 7 xmax: 63 ymax: 83
## Geodetic CRS:  WGS 84

Filter for Dorian in 2019

dorianPoints <- hurricanePoints |> filter(SEASON == 2019, NAME == "DORIAN")

12.6 Map 2019 North Atlantic storm season

The view mode produces an interactive Leaflet map.

tmap_mode(mode = "view")
tm_shape(tracks2019) +
  tm_lines(col = "NAME",
           lwd = 3) +
tm_shape(dorianTrack) +
  tm_lines(col = "red",
           lwd = "USA_WIND")

12.7 Map Hurricane Dorian track

The plot mode produces a static map. The width of lines can represent the wind speed.

tmap_mode(mode = "plot")
## tmap mode set to plotting
tm_shape(dorianTrack) +
  tm_lines(
           lwd = "USA_WIND",
           scale = 20)
## Legend labels were too wide. Therefore, legend.text.size has been set to 0.64. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.

The size of bubbles can also represent wind speed. We can produce a more legible map by filtering for one point every 12 hours.

dorianPoints <- dorianPoints |> 
  mutate(hour = hour(as.POSIXlt(ISO_TIME)))

dorian6hrPts <- dorianPoints |> 
  filter(hour %% 12 == 0)

tmap_mode(mode = "plot")
tm_shape(dorian6hrPts) +
  tm_bubbles("USA_WIND",
             col = "USA_WIND",
             alpha = 0.5)