This vignette is based on sections of Edzer Pebesma’s Statistical modelling with stars objects: https://r-spatial.github.io/stars/articles/stars7.html
The analogy of stars objects to data.frame
is this:
To see how this works with the 6-band example dataset, consider this:
library(stars)
l7 <- read_stars(here("L7.tif"))
l7
## stars object with 3 dimensions and 1 attribute
## attribute(s):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## L7.tif 1 54 69 68.91242 86 255
## dimension(s):
## from to offset delta refsys point x/y
## x 1 349 288776 28.5 SIRGAS 2000 / UTM zone 25S FALSE [x]
## y 1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE [y]
## band 1 6 NA NA NA NA
as.data.frame(l7) %>% head()
## x y band L7.tif
## 1 288790.5 9120747 1 69
## 2 288819.0 9120747 1 69
## 3 288847.5 9120747 1 63
## 4 288876.0 9120747 1 60
## 5 288904.5 9120747 1 61
## 6 288933.0 9120747 1 61
We see that we get one single variable with the
object (array) name, and added columns with the dimension values (x, y,
band). In a typical case, we would like to have the six bands
distributed over six variables, and have a single observation (row) for
each x/y pair. For this, we could use
e.g. utils::unstack
or dplyr::pivot_wider
on
this data.frame, but a more efficient way is to use the dedicated
split
method for stars
objects, which resolves
a dimension and splits it over attributes, one for each dimension
value:
l7 %>% split("band") %>%
as.data.frame() %>%
head()
## x y X1 X2 X3 X4 X5 X6
## 1 288790.5 9120747 69 56 46 79 86 46
## 2 288819.0 9120747 69 57 49 75 88 49
## 3 288847.5 9120747 63 52 45 66 75 41
## 4 288876.0 9120747 60 45 35 66 69 38
## 5 288904.5 9120747 61 52 44 76 92 60
## 6 288933.0 9120747 61 50 37 78 74 38
The reason that split
is more efficient than the
mentioned alternatives is that (i) split
does not have to
match records based on dimensions (x/y), and (ii) it works for
out-of-memory (stars_proxy) arrays, in the chunked process/write loop of
write_stars()
.
First, let’s visualize six bands of data included in a Landsat 7 scene.
plot(l7, breaks = "equal", join_zlim = FALSE)
As you can see, many of the bands contain similar information.
We can visualize information from at most three bands at a time with red (band 3), green (band 2), and blue (band 1) color. For example, here is a true color visualization with red color = red energy, green = green energy, and blue = blue energy.
plot(l7, rgb = c(3,2,1) )
Or, here is a false color visualization with red color = infrared energy, green = red energy, and blue = green energy. In these visualizations, vegetation appears red and urban areas appear blue-green.
plot(l7, rgb = c(4,3,2) )
It looks like some of the bands might have correlated information. Let’s see…
cormatrix <- l7 %>% split("band") %>% as.data.frame() %>% select(-c(1,2)) %>% cor()
cormatrix
## X1 X2 X3 X4 X5 X6
## X1 1.00000000 0.97567501 0.8462535 -0.4732270 0.02842741 0.2459045
## X2 0.97567501 1.00000000 0.8510621 -0.4405952 0.01840558 0.2154793
## X3 0.84625347 0.85106208 1.0000000 -0.1065046 0.48748755 0.6481330
## X4 -0.47322703 -0.44059520 -0.1065046 1.0000000 0.63283395 0.3900406
## X5 0.02842741 0.01840558 0.4874875 0.6328339 1.00000000 0.9507440
## X6 0.24590445 0.21547930 0.6481330 0.3900406 0.95074402 1.0000000
Indeed, band 1 (Blue) is highly correlated with band 2 (green) and 3 (red). This means that when we look at a true color image, we’re seeing a lot of autocorrelated data and we’re missing out on a lot of other information collected by the satellite.
Bands 5 and 6 are also highly correlated with one another.
What if we could create one band that contains the most possible information from all six?
And then a second band containing the most possible remaining information?
and so on…
We probably wouldn’t even need all six bands to convey most of the information, because several of the bands are correlated with one another anyway.
Let’s calculate the principal components with principal component analysis (PCA).
tif = system.file("tif/L7_ETMs.tif", package = "stars")
r = split(read_stars(tif))
pc = prcomp(as.data.frame(r)[,-(1:2)]) # based on all data
out = predict(r, pc)
write_stars(merge(out), "pca.tif")
Display the variance explained by each component
summary(pc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 53.4767 31.6520 13.66676 3.76537 3.14947 2.00866
## Proportion of Variance 0.7015 0.2458 0.04582 0.00348 0.00243 0.00099
## Cumulative Proportion 0.7015 0.9473 0.99310 0.99658 0.99901 1.00000
The first component explains 70% of variance in the six satellite
image bands.
The second component explains 25% of the variance, for a cumulative 95%
of all variance.
By the time we add a third component, we have already explained 99.7% of
all of the variance from six bands of satellite data.
Let’s display the loadings for each principal component
pc$rotation
## PC1 PC2 PC3 PC4 PC5 PC6
## X1 0.04706455 -0.4401597 -0.22069196 0.56918675 -0.09517157 -0.64985249
## X2 0.04856085 -0.4853616 -0.34138332 0.30211452 0.33805170 0.66330332
## X3 0.24563190 -0.5167370 -0.31139565 -0.72502261 -0.17800094 -0.13541961
## X4 0.23746268 0.5088377 -0.76133736 0.10559668 -0.29895023 0.06737388
## X5 0.71114486 0.1740749 0.06240705 -0.02946564 0.64498197 -0.20786127
## X6 0.61071777 -0.1202025 0.39275440 0.21697140 -0.58275733 0.26764983
How can we interpret this? The first component is dominated by information from Bands 5 and 6 (short wave infrared) The second component is dominated by a lack of visual light (bands 1 2 and 3) and presence of near infrared (band 4)
plot(merge(out), breaks = "equal", join_zlim = FALSE)
First of all, the band with the most contrast is PC1, in which
water/land is very clear.
In PC2 and PC3, we can see more subtle features, including clouds/smoke,
beaches, and vegetation / bare or built up areas. PC3 is starting to
emphasize less important elements, including shadows in the topography
and differences in the water surface. PC4 and above have very little
contrast and increasing amounts of noise.
Let’s rescale the first three principal components for visualization in a single RGB image. I have arbitrarily inverted the second and third bands to improve the visualization. That is ok, because PCA loadings are arbitrarily negative or positive in direction. It’s the magnitude of rotation that matters, not the direction (+/-).
outmin <- min(out$PC1)
outmax <- max(out$PC1)
out$PC1 <- (out$PC1 - outmin) / (outmax - outmin) * 255
outmin <- min(out$PC2)
outmax <- max(out$PC2)
out$PC2 <- 255- (out$PC2 - outmin) / (outmax - outmin) * 255
outmin <- min(out$PC3)
outmax <- max(out$PC3)
out$PC3 <- 255- (out$PC3 - outmin) / (outmax - outmin) * 255
What does our false color composite of the first three components look like?
plot(merge(out[1:3]), interpolate = TRUE, breaks = "equal", rgb = c(1,2,3), maxColorValue = 255)