Chapter 7 Web Scraping

Install and load packages

Install here package to help manage file paths

install.packages("here")
install.packages("tidyverse")
install.packages("rvest")

Load tidyverse and here packages

library(tidyverse)
library(here)
library(rvest)
## Warning: package 'rvest' was built under R version 4.3.3
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
urloi <- "https://en.wikipedia.org/wiki/Dolphin"

7.1 Scrape all paragraph text from a webpage

html_text cleans up html tags out of text html_elements selects content within particular html tags. p is for paragraph.

all_text <- urloi |> 
  read_html() |> 
  html_elements("p") |> 
  html_text()

all_text %>% head(5)
## [1] "\n\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [2] "A dolphin is an aquatic mammal in the clade Odontoceti (toothed whale). Dolphins belong to the families Delphinidae (the oceanic dolphins), Platanistidae (the Indian river dolphins), Iniidae (the New World river dolphins), Pontoporiidae (the brackish dolphins), and possibly extinct Lipotidae (baiji or Chinese river dolphin). There are 40 extant species named as dolphins.\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [3] "Dolphins range in size from the 1.7-metre-long (5 ft 7 in) and 50-kilogram (110-pound) Maui's dolphin to the 9.5 m (31 ft) and 10-tonne (11-short-ton) orca. Various species of dolphins exhibit sexual dimorphism where the males are larger than females. They have streamlined bodies and two limbs that are modified into flippers. Though not quite as flexible as seals, they are faster; some dolphins can briefly travel at speeds of 29 kilometres per hour (18 mph) or leap about 9 metres (30 ft).[1] Dolphins use their conical teeth to capture fast-moving prey. They have well-developed hearing which is adapted for both air and water; it is so well developed that some can survive even if they are blind. Some species are well adapted for diving to great depths. They have a layer of fat, or blubber, under the skin to keep warm in the cold water.\n"
## [4] "Dolphins are widespread. Most species prefer the warm waters of the tropic zones, but some, such as the right whale dolphin, prefer colder climates. Dolphins feed largely on fish and squid, but a few, such as the orca, feed on large mammals such as seals. Male dolphins typically mate with multiple females every year, but females only mate every two to three years. Calves are typically born in the spring and summer months and females bear all the responsibility for raising them. Mothers of some species fast and nurse their young for a relatively long period of time. \n"                                                                                                                                                                                                                                                                                 
## [5] "Dolphins produce a variety of vocalizations, usually in the form of clicks and whistles.\n"

7.2 Scrape a specific item from a webpage

Right-click and inspect a page to investigate how it’s structured. Right-click an item and copy x-path If the path has quotes in it, use alternative form of quotes (single instead of double or vis a versa)

/html/body/div/main/article/section/div/div/p[4]

all_text <- urloi |> 
  read_html() |> 
  html_element(xpath = "/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/figure[1]/figcaption") |> 
  html_text()

all_text
## [1] " A common bottlenose dolphin (Tursiops truncatus)"

7.3 Ethics of web scraping

In the United States, web scraping is legal. However, there are many illegal things you may do with the data you scraped, including copyright infringement. Poor scraping design may also crash web servers through denial of service. add /robots.txt to a web address to see what you are not supposed to crawl.

Web scrapers can add pauses into their code to detect scraping behavior. Rtweet was throttled and re-queried every 12 to 15 minutes.

7.4 Scrape data

Scrape data on Tom Brady from Wikipedia.

urloi <- "https://en.wikipedia.org/wiki/Tom_Brady"

Scrape table-based text from a page

all_tables <- urloi |> 
  read_html() |> 
  html_elements("table")  
#  html_text()

all_tables
## {xml_nodeset (58)}
##  [1] <table class="box-Very_long plainlinks metadata ambox ambox-style ambox- ...
##  [2] <table class="infobox vcard" style="width: 25em; line-height: 1.2em;">\n ...
##  [3] <table class="infobox" style="border-collapse:collapse; border-spacing:0 ...
##  [4] <table class="infobox" style="border-collapse:collapse; border-spacing:0 ...
##  [5] <table class="wikitable" style="text-align:center;"><tbody>\n<tr>\n<th r ...
##  [6] <table class="box-Cleanup_reorganize plainlinks metadata ambox ambox-sty ...
##  [7] <table class="wikitable" style="text-align:center">\n<caption>Pre-draft  ...
##  [8] <table class="wikitable"><tbody>\n<tr><th colspan="2">Legend\n</th></tr> ...
##  [9] <table class="wikitable" style="text-align:center;"><tbody>\n<tr>\n<th r ...
## [10] <table class="wikitable" style="text-align:center;"><tbody>\n<tr>\n<th r ...
## [11] <table class="wikitable" style="text-align:center;"><tbody>\n<tr>\n<th r ...
## [12] <table class="col-begin" role="presentation"><tbody><tr><td class="col-b ...
## [13] <table class="wikitable">\n<caption>\n</caption>\n<tbody>\n<tr>\n<th>Yea ...
## [14] <table class="wikitable">\n<caption>\n</caption>\n<tbody>\n<tr>\n<th>Yea ...
## [15] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
## [16] <table class="nowraplinks mw-collapsible mw-collapsed navbox-inner" styl ...
## [17] <table class="nowraplinks mw-collapsible mw-collapsed navbox-inner" styl ...
## [18] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
## [19] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
## [20] <table class="nowraplinks mw-collapsible mw-collapsed navbox-inner" styl ...
## ...

Hold on, scrape a specific table

reg_season <- urloi |> 
  read_html() |> 
  html_element(xpath = "/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[7]")  |> 
  html_table()

reg_season
## # A tibble: 25 × 23
##    Year  Team  Games Games Games Passing Passing Passing Passing Passing Passing
##    <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
##  1 Year  Team  GP    GS    Reco… Cmp     Att     Pct     Yds     Y/A     Lng    
##  2 2000  NE    1     0     —     1       3       33.3    6       2.0     6      
##  3 2001  NE    15    14    11–3  264     413     63.9    2,843   6.9     91     
##  4 2002  NE    16    16    9–7   373     601     62.1    3,764   6.3     49     
##  5 2003  NE    16    16    14–2  317     527     60.2    3,620   6.9     82     
##  6 2004  NE    16    16    14–2  288     474     60.8    3,692   7.8     50     
##  7 2005  NE    16    16    10–6  334     530     63.0    4,110   7.8     71     
##  8 2006  NE    16    16    12–4  319     516     61.8    3,529   6.8     62     
##  9 2007  NE    16    16    16–0  398     578     68.9    4,806   8.3     69     
## 10 2008  NE    1     1     1–0   7       11      63.6    76      6.9     26     
## # ℹ 15 more rows
## # ℹ 12 more variables: Passing <chr>, Passing <chr>, Passing <chr>,
## #   Rushing <chr>, Rushing <chr>, Rushing <chr>, Rushing <chr>, Rushing <chr>,
## #   Sacked <chr>, Sacked <chr>, Fumbles <chr>, Fumbles <chr>

Summarize average First you need unique column names! You can make a list of new column names as follows: colnames(reg_season) <- c("Year", "Team", ... "LastVar")

Better solution: paste data in row 1 into the column names Subset the first row using brackets [row(s), column(s)]

7.5 Cardinal sin of data science

Overwrite a dataset that you cannot get back. The solution to this is save a raw version of your data and work with a new version of the data. Please save an RDS file!

Paste with a sep option!

brady_data_clean <- reg_season
paste(colnames(brady_data_clean), reg_season[1, ])
##  [1] "Year Year"    "Team Team"    "Games GP"     "Games GS"     "Games Record"
##  [6] "Passing Cmp"  "Passing Att"  "Passing Pct"  "Passing Yds"  "Passing Y/A" 
## [11] "Passing Lng"  "Passing TD"   "Passing Int"  "Passing Rtg"  "Rushing Att" 
## [16] "Rushing Yds"  "Rushing Y/A"  "Rushing Lng"  "Rushing TD"   "Sacked Sck"  
## [21] "Sacked Yds"   "Fumbles Fum"  "Fumbles Lost"
colnames(brady_data_clean) <- paste(colnames(reg_season), reg_season[1, ])
brady_data_clean
## # A tibble: 25 × 23
##    `Year Year` `Team Team` `Games GP` `Games GS` `Games Record` `Passing Cmp`
##    <chr>       <chr>       <chr>      <chr>      <chr>          <chr>        
##  1 Year        Team        GP         GS         Record         Cmp          
##  2 2000        NE          1          0          —              1            
##  3 2001        NE          15         14         11–3           264          
##  4 2002        NE          16         16         9–7            373          
##  5 2003        NE          16         16         14–2           317          
##  6 2004        NE          16         16         14–2           288          
##  7 2005        NE          16         16         10–6           334          
##  8 2006        NE          16         16         12–4           319          
##  9 2007        NE          16         16         16–0           398          
## 10 2008        NE          1          1          1–0            7            
## # ℹ 15 more rows
## # ℹ 17 more variables: `Passing Att` <chr>, `Passing Pct` <chr>,
## #   `Passing Yds` <chr>, `Passing Y/A` <chr>, `Passing Lng` <chr>,
## #   `Passing TD` <chr>, `Passing Int` <chr>, `Passing Rtg` <chr>,
## #   `Rushing Att` <chr>, `Rushing Yds` <chr>, `Rushing Y/A` <chr>,
## #   `Rushing Lng` <chr>, `Rushing TD` <chr>, `Sacked Sck` <chr>,
## #   `Sacked Yds` <chr>, `Fumbles Fum` <chr>, `Fumbles Lost` <chr>

Can we graph Brady’s passing yards over time?

First remove line 1

brady_data_clean_types <- brady_data_clean |> 
  slice(-1) |> 
  filter(`Year Year` != "Career") |> 
  mutate(
    passing_yards_numeric = as.numeric(str_remove(`Passing Yds`, ",")),
    year_numeric = as.numeric(`Year Year`)
  )
brady_data_clean_types |> 
  ggplot() +
  geom_col(aes(x = year_numeric, y = passing_yards_numeric))