--- title: 2. Computing signal correlations description: Calculate correlations over space and time between multiple signals. output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{2. Computing signal correlations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The covidcast package provides some simple utilities for exploring the correlations between two signals, over space or time, which may be helpful for simple analyses and explorations of data. For these examples, we'll load confirmed case and death rates per 100,000 population, and restrict our analysis to counties with at least 500 total cases by August 15th. ```r library(covidcast) library(dplyr) start_day <- "2020-03-01" end_day <- "2020-08-15" iprop <- covidcast_signal( data_source = "jhu-csse", signal = "confirmed_7dav_incidence_prop", start_day = start_day, end_day = end_day) dprop <- covidcast_signal( data_source = "jhu-csse", signal = "deaths_7dav_incidence_prop", start_day = start_day, end_day = end_day) # Restrict attention to "active" counties with at least 500 total cases case_num <- 500 geo_values <- covidcast_signal( data_source = "jhu-csse", signal = "confirmed_cumulative_num", start_day = end_day, end_day = end_day) %>% filter(value >= case_num) %>% pull(geo_value) cases <- iprop %>% filter(geo_value %in% geo_values) deaths <- dprop %>% filter(geo_value %in% geo_values) ``` ## Correlations sliced by time The `covidcast_cor()` function is your primary way to calculate correlations. The first option we have is to "slice by time": this calculates, for each time, correlation between the signals over all geographic locations. This is obtained by setting `by = "time_value"`: ```r library(ggplot2) # Compute correlation per time, over all counties df_cor <- covidcast_cor(cases, deaths, by = "time_value") # Plot the correlation time series ggplot(df_cor, aes(x = time_value, y = value)) + geom_line() + labs(title = "Correlation between case and death rates", subtitle = sprintf("Per day, over counties with at least %i cases", case_num), x = "Date", y = "Correlation") ``` ![](figures/corr-unnamed-chunk-3-1.png) The above plot addresses the question: "on any given day, are case and death rates linearly associated, over US counties?". We might be interested in broadening this question, instead asking: "on any given day, do higher case rates tend to associate with higher death rates?", removing the dependence on a linear relationship. The latter can be addressed using Spearman correlation, accomplished by setting `method = "spearman"` in the call to `covidcast_cor()`. Spearman correlation is highly robust and invariant to monotone transformations (it doesn't rely on any particular functional form for the dependence between two variables). ### Lagged correlations We might also be interested in how case rates associate with death rates in the *future*. Using the `dt_x` parameter in `covidcast_cor()`, we can lag case rates back any number of days we want, before calculating correlations. Here we set `dt_x = -10`. This means that `cases` will be lagged by 10 days, meaning that cases on June 1st will be correlated with deaths on June 11th. (It might help to think of it this way: deaths on a certain day will be correlated with cases with an offset of -10 days.) ```r # Use Spearman correlation, with case rates and 10-day lagged case rates df_cor1 <- covidcast_cor(cases, deaths, by = "time_value", method = "spearman") df_cor2 <- covidcast_cor(cases, deaths, by = "time_value", dt_x = -10, method = "spearman") # Stack rowwise into one data frame, then plot time series df_cor <- rbind(df_cor1, df_cor2) df_cor$dt <- as.factor(c(rep(0, nrow(df_cor1)), rep(-10, nrow(df_cor2)))) ggplot(df_cor, aes(x = time_value, y = value)) + geom_line(aes(color = dt)) + labs(title = "Correlation between case and death rates", subtitle = sprintf("Per day, over counties with at least %i cases", case_num), x = "Date", y = "Correlation") + theme(legend.position = "bottom") ``` ![](figures/corr-unnamed-chunk-4-1.png) We can see that, for the most part, the Spearman measure has bolstered the correlations; and generally, lagging the case rates back by 10 days improves correlations, confirming case rates are better correlated with death rates 10 days from now. ## Correlations sliced by county The second option we have is to "slice by location": this calculates, for each geographic location, correlation between the time series of two signals. This is obtained by setting `by = "geo_value"`. We'll again look at correlations both for observations at the same time and for 10-day lagged case rates: ```r # Compute correlation per county, over all times df_cor1 <- covidcast_cor(cases, deaths, by = "geo_value") df_cor2 <- covidcast_cor(cases, deaths, by = "geo_value", dt_x = -10) # Stack rowwise into one data frame, then plot densities df_cor <- rbind(df_cor1, df_cor2) df_cor$dt <- as.factor(c(rep(0, nrow(df_cor1)), rep(-10, nrow(df_cor2)))) ggplot(df_cor, aes(value)) + geom_density(aes(color = dt, fill = dt), alpha = 0.5) + labs(title = "Correlation between case and death rates", subtitle = "Computed separately for each county, over all times", x = "Date", y = "Density") + theme(legend.position = "bottom") ``` ![](figures/corr-unnamed-chunk-5-1.png) Using some tricks, we can attach the necessary properties to the data frame so we can plot these correlations in space as a choropleth map, using `plot.covidcast_signal()`: ```r # Turn the data into a covidcast_signal data frame so it can be plotted df_cor2$time_value <- start_day df_cor2 <- as.covidcast_signal(df_cor2, geo_type = "county", signal = "correlations") # Plot choropleth maps, using the covidcast plotting functionality plot(df_cor2, title = "Correlations between 10-day lagged case and death rates", range = c(-1, 1), choro_col = c("orange", "lightblue", "purple")) ``` ![](figures/corr-unnamed-chunk-6-1.png) ## More systematic lag analysis You could also imagine trying to move the signals with various lags to see at what lag one signal is most correlated with the other. A simple way to achieve this: ```r # Loop over values for dt, and compute correlations per county dt_vec <- -(0:15) df_list <- vector("list", length(dt_vec)) for (i in seq_along(dt_vec)) { df_list[[i]] <- covidcast_cor(cases, deaths, dt_x = dt_vec[i], by = "geo_value") df_list[[i]]$dt <- dt_vec[i] } # Stack into one big data frame, and then plot the median correlation by dt df <- do.call(rbind, df_list) df %>% group_by(dt) %>% summarize(median = median(value, na.rm = TRUE), .groups = "drop_last") %>% ggplot(aes(x = dt, y = median)) + geom_line() + geom_point() + labs(title = "Median correlation between case and death rates", x = "dt", y = "Correlation") + theme(legend.position = "bottom", legend.title = element_blank()) ``` ![](figures/corr-unnamed-chunk-7-1.png) We can see that the median correlation between case and death rates (where the correlations come from slicing by location) is maximized when we lag the case incidence rates back 8 days in time.