Chapter 3 Graph values and qualities of diamonds

there are many different functions / ways to explore data and graph it
these may be the most common/efficient/best ways, but there might be more specialized ways out there

3.1 Install the tidyverse package

install the tidyverse package with install.packages

install.packages("tidyverse")

use library function to load installed packages into R
needs to be run every time you start an R session
more efficient for R to load only the packages you need for a particular session / analysis

3.2 Explore Diamonds Dataset

diamonds is a dataset installed along with tidyverse

diamonds

## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ℹ 53,930 more rows

Let’s graph the weight of a diamond against its $ value using the ggplot function
Call up help on a function or package with a ? followed by a function name in the console. For example, ?ggplot brings up help on the graph plotting function.
creates a graphing object, in which all arguments are optional
ggplot technically does not need any parameters (to make a blank graph
graphing is like painting layers on a canvas
ggplot() starts a blank graphing canvas, on which you add data and aesthetic options.
add layers to ggplot with +
check out ggplot cheat sheet pdf from posit.co
help documentation and cheat sheet shows what you can customize
let’s make a scatterplot with geom_point
we need to give a dataset to ggplot and define x and y coordinates
aes tells R to make aesthetics using columns in your dataset

ggplot(data = diamonds) + geom_point(aes(x = carat, y = price))

Let’s interpret the graph

price increases with carat weight
clustering of data on thresholds: 1 carat, 1.5 carats, 2 carats
you need domain knowledge (content expertise) to explain patterns
more variation in price as weight increases

Reflection

Every graph will have this fundamental code structure, with modifications
copy code you have that works and modify what you need!

ggplot(data = diamonds) + geom_point(aes(x = depth, y = price))

Try graphing three variables: diamond color, carat and price. How could we approach this?

use scatterplot and symbolize the dots

ggplot(data = arrange(diamonds, desc(color))) + geom_point(aes(x = carat, y = price, color = color))

Interpret:

D appears more valuable than J
Few instances of very large D diamonds
Note that ggplot adds each color series one by one
many +’s in console indicates one of your lines of code is incomplete (missing parenthesis, started line with a +, etc.)

3.3 Facet the data

subdivide the data by a variable & create different graphs for each subgroup

ggplot(data = diamonds) + 
  geom_point(aes(x = carat, y = price, color = color)) +
  facet_wrap(~color)

Interpret:

now it’s possible to analyze trends of each color separately
maybe facet by cut instead

ggplot(data = diamonds) + 
  geom_point(aes(x = carat, y = price, color = color)) +
  facet_wrap(~cut)

Interpret:

as cut improves, price gets better
price improves faster by carat for better cuts

ggplot(data = diamonds) + 
  geom_point(aes(x = carat, y = price, color = color)) +
  facet_grid(clarity~cut)

Interpret:

Each column represents a cut, and each row represents a color
This visualization shows 5 variables
IF is more clear (valuable) than I1

3.4 Restart R and Remake last graph

If you get an error about a function not being found, you either mistyped its name or more likely, you need to load the library in which the function is found. You need to do this every time you open a new R session.

ggplot(data = diamonds) + 
  geom_point(aes(x = carat, y = price, color = color)) +
  facet_grid(clarity~cut)

ctrl+shift+r can add a section to an R script. Headings take the place of that in Rmd

3.5 Explore gap in prices between 1k and 2k

Let’s make a histogram showing the distribution of a single quantitative variable

ggplot(data = diamonds) + 
  geom_histogram(aes(x = price))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There’s your basic histogram. Graph shows the aggregation of diamond data; each bar’s width represents a price range. and its height represents the number of data points within that price range.

Interpretation: the vast majority of diamonds cost less than 2k, and a few diamonds cost more

We can optionally change the fill and outline colors.

ggplot(data = diamonds) + 
  geom_histogram(aes(x = price), 
                 fill = "darkblue",
                 color = "black")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s “zoom in” on the area of data with a gap in prices by changing the limits of the x-axis to 1000 and 2000

ggplot(data = diamonds) + 
  geom_histogram(aes(x = price), 
                 fill = "darkblue",
                 color = "black") +
  xlim(1000, 2000)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 44232 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 2 rows containing missing values or values outside the scale
## range (`geom_bar()`).

Data science cannot tell us why this price gap exists: you need domain knowledge to explain.