Text Analysis
Reading for discussion
What can you do with spatial data science? Things like this
- Zou, L., Lam, N. S. N., Cai, H., & Qiang, Y. (2018). Mining Twitter Data for Improved Understanding of Disaster Resilience. Annals of the American Association of Geographers, 108(5), 1422–1441. https://doi.org/10.1080/24694452.2017.1421897 download
Questions for discussion:
- What research questions are the authors asking?
- How do the authors deal with the problem of unequal Twitter participation over space? (both due to uneven population over space and due to uneven social media use by people)
- How do the authors analyze the timing of twitter activity?
- How does sentiment analysis work and what do the outputs of sentiment analysis mean?
- What is the story/interpretation of the tables and figures in the paper?
What have you learned?
What did you learn from the first assignment?
Time and Text Analysis
Wrapping up lecture…
- First,
pull
any changes into your local computer or, if you are using a new computer,clone
your repository from scratch. - I suggest adding a
data_public
folder to your repository, and save theLife_Expectancy.csv
file into it. - When you attend the next lecture, use the path
here("data_public", "Life_Expectancy.csv")
to load your data from CSV file.
Organizing today…
- Please save this Rmd file into your scripts folder.
- We’ll follow along with this instructional
Rmd
file today!
Date-time data
- We’ve been treating years as integers so far, but it’s not so easy when time includes the month, day, hours and minutes.
- Dates require special
date
data types. For Twitter, it’sPOSIXct
- You should expect to treat dates differently, and use special functions like
as.POSIXct
to create them ggplot
even has a special axis functionscale_*_datetime
type for temporal data
Cleaning Text
- non-text
- stop words
- search terms
unnest
text content intotokens
- can we remove handles, e.g.
@nytimes
from text content?
Word Frequency
- histograms
- word graphs
Word Association
n-grams
are pairs of words found in sequence- we can count common co-occurences of words
- build a graph data model of them
- and visualize the graph!