This vignette is based on sections of Edzer Pebesma’s Statistical modelling with stars objects: https://r-spatial.github.io/stars/articles/stars7.html

stars objects as data.frames

The analogy of stars objects to data.frame is this:

To see how this works with the 6-band example dataset, consider this:

library(stars)
l7 <- read_stars(here("L7.tif"))
l7
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##         Min. 1st Qu. Median     Mean 3rd Qu. Max.
## L7.tif     1      54     69 68.91242      86  255
## dimension(s):
##      from  to  offset delta                     refsys point x/y
## x       1 349  288776  28.5 SIRGAS 2000 / UTM zone 25S FALSE [x]
## y       1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE [y]
## band    1   6      NA    NA                         NA    NA
as.data.frame(l7) %>% head()
##          x       y band L7.tif
## 1 288790.5 9120747    1     69
## 2 288819.0 9120747    1     69
## 3 288847.5 9120747    1     63
## 4 288876.0 9120747    1     60
## 5 288904.5 9120747    1     61
## 6 288933.0 9120747    1     61

We see that we get one single variable with the object (array) name, and added columns with the dimension values (x, y, band). In a typical case, we would like to have the six bands distributed over six variables, and have a single observation (row) for each x/y pair. For this, we could use e.g. utils::unstack or dplyr::pivot_wider on this data.frame, but a more efficient way is to use the dedicated split method for stars objects, which resolves a dimension and splits it over attributes, one for each dimension value:

l7 %>% split("band") %>%
  as.data.frame() %>% 
  head()
##          x       y X1 X2 X3 X4 X5 X6
## 1 288790.5 9120747 69 56 46 79 86 46
## 2 288819.0 9120747 69 57 49 75 88 49
## 3 288847.5 9120747 63 52 45 66 75 41
## 4 288876.0 9120747 60 45 35 66 69 38
## 5 288904.5 9120747 61 52 44 76 92 60
## 6 288933.0 9120747 61 50 37 78 74 38

The reason that split is more efficient than the mentioned alternatives is that (i) split does not have to match records based on dimensions (x/y), and (ii) it works for out-of-memory (stars_proxy) arrays, in the chunked process/write loop of write_stars().

Principal components

First, let’s visualize six bands of data included in a Landsat 7 scene.

plot(l7, breaks = "equal", join_zlim = FALSE)

As you can see, many of the bands contain similar information.

We can visualize information from at most three bands at a time with red (band 3), green (band 2), and blue (band 1) color. For example, here is a true color visualization with red color = red energy, green = green energy, and blue = blue energy.

plot(l7, rgb = c(3,2,1) )

Or, here is a false color visualization with red color = infrared energy, green = red energy, and blue = green energy. In these visualizations, vegetation appears red and urban areas appear blue-green.

plot(l7, rgb = c(4,3,2) )

It looks like some of the bands might have correlated information. Let’s see…

cormatrix <- l7 %>%  split("band") %>% as.data.frame() %>% select(-c(1,2)) %>% cor()
cormatrix
##             X1          X2         X3         X4         X5        X6
## X1  1.00000000  0.97567501  0.8462535 -0.4732270 0.02842741 0.2459045
## X2  0.97567501  1.00000000  0.8510621 -0.4405952 0.01840558 0.2154793
## X3  0.84625347  0.85106208  1.0000000 -0.1065046 0.48748755 0.6481330
## X4 -0.47322703 -0.44059520 -0.1065046  1.0000000 0.63283395 0.3900406
## X5  0.02842741  0.01840558  0.4874875  0.6328339 1.00000000 0.9507440
## X6  0.24590445  0.21547930  0.6481330  0.3900406 0.95074402 1.0000000

Indeed, band 1 (Blue) is highly correlated with band 2 (green) and 3 (red). This means that when we look at a true color image, we’re seeing a lot of autocorrelated data and we’re missing out on a lot of other information collected by the satellite.

Bands 5 and 6 are also highly correlated with one another.

What if we could create one band that contains the most possible information from all six?

And then a second band containing the most possible remaining information?

and so on…

We probably wouldn’t even need all six bands to convey most of the information, because several of the bands are correlated with one another anyway.

Let’s calculate the principal components with principal component analysis (PCA).

tif = system.file("tif/L7_ETMs.tif", package = "stars")
r = split(read_stars(tif))
pc = prcomp(as.data.frame(r)[,-(1:2)]) # based on all data
out = predict(r, pc)
write_stars(merge(out), "pca.tif")

Display the variance explained by each component

summary(pc)
## Importance of components:
##                            PC1     PC2      PC3     PC4     PC5     PC6
## Standard deviation     53.4767 31.6520 13.66676 3.76537 3.14947 2.00866
## Proportion of Variance  0.7015  0.2458  0.04582 0.00348 0.00243 0.00099
## Cumulative Proportion   0.7015  0.9473  0.99310 0.99658 0.99901 1.00000

The first component explains 70% of variance in the six satellite image bands.
The second component explains 25% of the variance, for a cumulative 95% of all variance.
By the time we add a third component, we have already explained 99.7% of all of the variance from six bands of satellite data.

Let’s display the loadings for each principal component

pc$rotation
##           PC1        PC2         PC3         PC4         PC5         PC6
## X1 0.04706455 -0.4401597 -0.22069196  0.56918675 -0.09517157 -0.64985249
## X2 0.04856085 -0.4853616 -0.34138332  0.30211452  0.33805170  0.66330332
## X3 0.24563190 -0.5167370 -0.31139565 -0.72502261 -0.17800094 -0.13541961
## X4 0.23746268  0.5088377 -0.76133736  0.10559668 -0.29895023  0.06737388
## X5 0.71114486  0.1740749  0.06240705 -0.02946564  0.64498197 -0.20786127
## X6 0.61071777 -0.1202025  0.39275440  0.21697140 -0.58275733  0.26764983

How can we interpret this? The first component is dominated by information from Bands 5 and 6 (short wave infrared) The second component is dominated by a lack of visual light (bands 1 2 and 3) and presence of near infrared (band 4)

plot(merge(out), breaks = "equal", join_zlim = FALSE)

First of all, the band with the most contrast is PC1, in which water/land is very clear.
In PC2 and PC3, we can see more subtle features, including clouds/smoke, beaches, and vegetation / bare or built up areas. PC3 is starting to emphasize less important elements, including shadows in the topography and differences in the water surface. PC4 and above have very little contrast and increasing amounts of noise.

Let’s rescale the first three principal components for visualization in a single RGB image. I have arbitrarily inverted the second and third bands to improve the visualization. That is ok, because PCA loadings are arbitrarily negative or positive in direction. It’s the magnitude of rotation that matters, not the direction (+/-).

outmin <- min(out$PC1)
outmax <- max(out$PC1)
out$PC1 <- (out$PC1 - outmin) / (outmax - outmin) * 255
outmin <- min(out$PC2)
outmax <- max(out$PC2)
out$PC2 <- 255- (out$PC2 - outmin) / (outmax - outmin) * 255
outmin <- min(out$PC3)
outmax <- max(out$PC3)
out$PC3 <- 255- (out$PC3 - outmin) / (outmax - outmin) * 255

What does our false color composite of the first three components look like?

plot(merge(out[1:3]), interpolate = TRUE, breaks = "equal", rgb = c(1,2,3), maxColorValue = 255)