Guest Speaker

  • Surprise! Alumnus Liam Smith will join us at the beginning of class to speak about his experience learning data science and applying it to an internship with the National Park Service. Some Links

A Forum

Twitter/X Social Media Data

  • Characteristics of ``big data’’ generally and Twitter/X specifically
  • Legal and ethical constraints and considerations
  • Twitter application programming interface and queries
  • Twitter metadata and example Tweet
  • slides

Data Science Tools and Organization

  • Structure your project with folders for:
    • data_raw (for unmodified given files) data_private (for very large or secret files), and data_derived (for data files you create to share)
    • docs for your final documents or websites, e.g. your rendered .html files. GitHub and GitLabs use the docs folders to make websites like this one from repositories
    • code or scripts for, you know… your code.
  • Project management files
    • .Rproj is at the root of an R project and maintains information about the .RData file, any version control, and what files you have had open
    • .RData contains all data loaded in the R Environment (i.e. everything in the environment pane). It is convenient for temporary use on a local computer (e.g. for shutting down your computer after lecture and re-opening your project after lunch), but it is major problem to rely on it in the context of reproducibility, Git, and collaboration across different computers and team members.
    • .gitignore contains items & instructions for which files Git and GitHub should ignore.
    • .Rproj.user folder contains temporary files for the R project. Pay no attention.
    • .git folder contains essential information about the Git repository, e.g. the history of commits. Git maintains this folder and it is important for Git version control, so you should leave it alone. It may even be hidden from you.
  • here package helps you manage links to files
    • starts looking for files in the root folder of the project (where Rproj file is, or where Git repository starts)
    • searches for files and folders in a sequential list and builds an operating-system specific file path: e.g. here("data", "givens", "mytable.csv") will find "\data\givens\mytable.csv"

Managing the folder structure with Git:

  • Add data_private\ to your .gitignore so that private things will not be tracked
  • commit the changes to .gitignore
  • Git pays no attention to folders until they have content inside

Create a new folder with codde

The following line of code can check if a folder exists and if not, create it.

if(!dir.exists(here("data_private"))) {dir.create(here("data_private"))}

We can break down how it works below:

Code Purpose
here(“data_private”) creates a path to a data_private folder in your project
dir.exists(here(“data_private”) checks if the folder is there, resulting if TRUE if it is, and FALSE if it is not.
!dir.exists(here(“data_private”) ! negates the above, equivalent to checking if a folder does NOT exist
if( … ) { … } if() controls flow of code. If whatever is in the ( parenthesis ) is TRUE, then do the stuff inside the { curly brackets }. Otherwise, skip the stuff in curly brackets.
dir.create(here("data_private")) makes a new folder

Example demo repository

Example Repository Structure and File Management

First Assignment

Let’s start analyzing this data by investigating some preliminary questions about it’s quality and structure.

  • How many tweets are there?
  • How many tweets are tagged with geographic coordinates?
  • How many tweets are tagged with a place name, and how many are there by each place type?
  • Which users have the most followers, retweets, or quotes?
  • How many tweets are there from each country in the dataset?
  • Graph the frequency of tweets over time, by different time increments

Submit the first assignment by completing an Rmarkdown document and knitting it, including a graph of how many tweets are tagged with place names of which types, and another interesting and creative graph from this dataset.

Data is available in this private repository github.com/opengisci/opengisci_restricted

Transactions

  • Computational Notebooks are due Friday at 12 noon
  • First Assignment is due Monday at 5pm
  • There is one article to read for the Tuesday Workshop