American Kestrel near Saxapahaw
Bats of Bolin Creek, Chapel Hill
A Critical Skill for Modern Humans
HOW TO SHARPEN PENCILS from Pricefilms on Vimeo.
Bats and Public Health: An Emerging Concern
Since researchers identified Horseshoe Bats as a likely reservoir for the Severe Acute Respiratory Syndrome (SARS) coronavirus, research on the connection between bats and emerging infectious diseases (EIDs) has increased. Bats have been known reservoirs for the rabies virus since the early 1900s when vampire bats caused a pandemic of rabies in South American cattle. The enormous health and economic costs of disease outbreaks such as SARS make identifying the sources of disease and developing policies to prevent outbreaks imperative. Beginning in April 2012, another coronavirus, Middle East Respiratory Syndrome (MERS) began affecting humans in Saudi Arabia. Like SARS, bats are again implicated as possible reservoirs for this new disease.
Read the full paper here:
201311_bats_public_health
Quick! Grab the Camera!
Math is Beautiful
Here’s a fun video showing the mathematics of different phenomena in equation, visualization, and realization. This is best viewed in full screen mode.
BEAUTY OF MATHEMATICS from PARACHUTES.TV on Vimeo.
via FlowingData.
Battle for Bats
Battle For Bats: Surviving White Nose Syndrome from Ravenswood Media on Vimeo.
Bats are rapidly dying off across the eastern United States due to White Nose Syndrome. In less than 10 years, populations of certain bat species have gone from hundreds of thousands to near extinction.
Forget Big Data. MegaData is here.
MegaData sounds better don’t you think? Or maybe Megalodata.
Yesterday, Dr. Jeremy Wu spoke at our Biostatistics seminar about “Statistics 2.0” and big data. I think his main point was to get people thinking about where statistics is going as a field.
I had a couple of thoughts. First, rather than statistics version 2.0, we’re looking at statistics version 103.543. Statistics is well past version 2. Second, big mega data talk focuses on leveraging massive integrated datasets built for the most part by corporations and governments. This focus ignores the massive numbers of “small” datasets generated by individuals and small organizations. Not only are datasets getting larger, the tools to generate digital data are more widely available.
My question is: how can statisticians help the citizen, business owner, or community leader understand and make use of the “small” data they have?
UNIREP Anova Expected Mean Squares
“It is hoped that the material here will be sufficiently illustrative to show what is involved generally, and also to enable the reader to decide whether he wants to be involved generally” (Crowder and Hand)
When your statistics professor says to the class, “You should know how to derive these results (but you won’t be tested on this)” think to yourself: “I could do that.” Be satisfied at that point. Or be satisfied that there is likely to be one idiot in the class (like me) who takes the declaration seriously.
In today’s post, I will show how to derive the expected mean squares of the univariate repeated measures ANOVA model. I’ll start with the sum of squares for between groups. I’ll also end there, as I would like to keep my hair from turning gray faster than it already is.
First, the preliminaries must be hashed out. Assume a linear model for observation \(Y_{hlj}\) of individual \(h\) \((1, \dots, r_l)\) in group \(l\) \((1, \dots, q)\) at time (or repeated measure) \(j\) \((1, \dots, n)\) such that:
\[Y_{hlj} = \mu + \tau_l + \gamma_j + (\tau\gamma)_{lj} + b_{hl} + e_{hlj}\]
The parameters are:
- \(\mu\) is the overall mean.
- \(\tau_l\) is the deviation from \(\mu\) associated with group \(l\).
- \(\gamma_j\) is the deviation from \(\mu\) associated with time \(j\).
- \((\tau\gamma)_{lj}\) is the deviation associated with group \(l\) at time \(j\) (i.e., the interaction of time and group).
- \(b_{hl}\) is the random effect associated with unit \(h\) unit in group \(l\).
- \(e_{hlj}\) represents the within-unit sources of variation.
Additional assumptions
- \(b_{hl} \sim N(0, \sigma^2_b)\) and all independent.
- \(e_{hlj} \sim N(0, \sigma^2_e)\) and all independent.
- \(b_{hl}\) and \(e_{hlj}\) are mutually independent.
- To force identifiability of the parameter estimates, we make the these constraints:
\(\sum_{l=1}^q \tau_l = 0\), \(\sum_{j=1}^n \gamma_j = 0\), \(\sum_{l=1}^q (\tau\gamma)_{lj} = 0 = \sum_{j=1}^n (\tau\gamma)_{lj}\).
The distributional assumptions imply a compound symmetric covariance structure for \(\mathbf{Y}_{hl}\) . That is, \(\mathbf{Y}_{hl} = \sigma^2_b\mathbf{J}_n + \sigma^2_e\mathbf{I}_n\), where \(\mathbf{J}_n\) is a \(n \times n\) matrix of ones and \(\mathbf{I}_n\) is a \(n \times n\) identity matrix.
Define \(\bar{Y}_{.l.} = \sum_{h=1}^{r_l} \sum_{j=1}^n Y_{hlj}\) as the sample mean across groups and \(\bar{Y}_{…} = \sum_{l=1}^{q} \sum_{h=1}^{r_l} \sum_{j=1}^n Y_{hlj}\) as the overall sample mean. Now we can get started.
\[\begin{aligned}
E(MS_G) = & E\left(\frac{SS_G}{q-1}\right) = \frac{1}{q-1} E\left[\sum_{l=1}^qnr_l(\bar{Y}_{.l.} – \bar{Y}_{…})^2\right] \\ = & \frac{n}{q-1} \sum_{l=1}^qr_lE\left[(\bar{Y}_{.l.} – \bar{Y}_{…})^2\right]\\
= & \frac{n}{q-1} \sum_{l=1}^qr_lE\left[ \left( \sum_{h=1}^{r_l} \sum_{j=1}^n (\mu + \tau_l + \gamma_j + (\tau\gamma)_{lj} + b_{hl} + e_{hlj})/r_ln \right.\nonumber\right.\nonumber\\
& \qquad \left. \left. {} – \sum_{l=1}^{q} \sum_{h=1}^{r_l} \sum_{j=1}^n (\mu + \tau_l + \gamma_j + (\tau\gamma)_{lj} + b_{hl} + e_{hlj})/mn\right)^2\right] \\
= & \frac{n}{q-1} \sum_{l=1}^qr_lE\left[ \left(\tau_l + \frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl} + \frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj} \right.\nonumber\right.\nonumber\\
& \qquad \left. \left. {} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj}\right)^2 \right]
\end{aligned}\]
To see how \(\mu\), \(\gamma_j\), and \((\tau\gamma)_{lj}\) fall out, remember the constraints! For example, \[\sum_{h=1}^{r_l} \sum_{j=1}^n \frac{\tau_l}{r_ln} – \sum_{l=1}^{q} \sum_{h=1}^{r_l} \sum_{j=1}^n \frac{\tau_l}{mn} = \tau_l – \sum_{l=1}^{q} \frac{r_l\tau_l}{m} = \tau_l – 0 = \tau_l\].
Recall the following. \(E[X^2] = E[X]^2 + V(X)\). \(E(b_{hl}) = 0\). \(E(e_{hlj}) = 0\). \(E(\tau_l) = \tau_l\). \(V(\tau_l) = 0\) (since it is a constant). \(b_{hl}\) is independent of \(e_{hlj}\), so \(Cov(b_{hl}, e_{hlj}) = 0\).
\[\begin{aligned}
E(MS_G) = & \frac{n}{q-1} \sum_{l=1}^qr_l\left\{ \left[E\left(\tau_l + \frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl} + \frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj}\right)\right]^2 \right.\nonumber\\
& \qquad \qquad \left. {} + V\left(\tau_l + \frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl} + \frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj}\right) \right\} \\
= & \frac{n}{q-1} \sum_{l=1}^qr_l\left\{ \tau_l^2 + V\left(\frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl}\right) \right.\nonumber \\
& \qquad \qquad \qquad \left.{} + V\left(\frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj}\right) \right\}
\end{aligned}\]
Let’s take the variance terms one at a time.
\[\begin{aligned}
& V\left(\frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl}\right) \\
= & V\left(\frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl}\right) + V\left(\frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl}\right) – 2Cov\left( \frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl}, \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl}\right) \\
= & \frac{1}{r_l^2}\sum_{h=1}^{r_l} V(b_{hl}) + \frac{1}{m^2}\sum_{l=1}^{q}\sum_{h=1}^{r_l} V(b_{hl}) – \frac{2}{r_lm}Cov\left( \sum_{h=1}^{r_l} b_{hl}, \sum_{l=1}^{q}\sum_{h=1}^{r_l} b_{hl}\right) \\
= & \frac{1}{r_l^2}\sum_{h=1}^{r_l} V(b_{hl}) + \frac{1}{m^2}\sum_{l=1}^{q}\sum_{h=1}^{r_l} V(b_{hl}) – \frac{2}{r_lm}V\left( \sum_{h=1}^{r_l} b_{hl}\right) \\
= & \frac{1}{r_l^2}\sum_{h=1}^{r_l} V(b_{hl}) + \frac{1}{m^2}\sum_{l=1}^{q}\sum_{h=1}^{r_l} V(b_{hl}) – \frac{2}{r_lm}\sum_{h=1}^{r_l} V(b_{hl}) \\
= & \frac{1}{r_l}\sigma_b^2 + \frac{1}{m} \sigma_b^2 – 2\frac{1}{m}\sigma_b^2 \\
= & \sigma_b^2\left(\frac{1}{r_l} – \frac{1}{m} \right)
\end{aligned}\]
The fourth line in the above section follows from the assumption that the units are independent, thus \(Cov(b_{hl}, b_{hl})\) = 0 where the indexes are not equal.
Following a similar process, we can show:
\[V\left(\frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj}\right) = \sigma_e^2\left(\frac{1}{r_ln} – \frac{1}{mn} \right)\]
Now, let’s go back to where we left off.
\[\begin{aligned}
E(MS_G) & = \frac{n}{q-1} \sum_{l=1}^qr_l\left\{ \tau_l^2 + V\left(\frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n b_{hl}\right) \right.\nonumber \\
& \qquad \qquad \qquad \left.{} + V\left(\frac{1}{r_ln}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj} – \frac{1}{mn}\sum_{l=1}^{q}\sum_{h=1}^{r_l} \sum_{j=1}^n e_{hlj}\right) \right\} \\
& = \frac{n}{q-1} \sum_{l=1}^qr_l \tau_l^2 + \frac{n\sigma_b^2}{q-1} \sum_{l=1}^qr_l \left(\frac{1}{r_l} – \frac{1}{m} \right) + \frac{n\sigma_e^2}{q-1} \sum_{l=1}^qr_l \left(\frac{1}{r_ln} – \frac{1}{mn} \right) \\
& = \frac{n}{q-1} \sum_{l=1}^qr_l \tau_l^2 + \frac{n\sigma_b^2}{q-1}\left( \sum_{l=1}^q 1 – \frac{\sum_{l=1}^q r_l}{m} \right) + \frac{\sigma_e^2}{q-1} \left( \sum_{l=1}^q 1 – \frac{\sum_{l=1}^q r_l}{m} \right) \\
&= \frac{n}{q-1} \sum_{l=1}^qr_l \tau_l^2 + \frac{n\sigma_b^2}{q-1}\left( q -1 \right) + \frac{\sigma_e^2}{q-1} \left(q-1 \right) \\
& = \frac{n}{q-1} \sum_{l=1}^qr_l \tau_l^2 + n\sigma_b^2 + \sigma_e^2
\end{aligned}\]
And there we are.