Project Think Tank

Statistics in R - 8. One-Way ANOVA

One-Way Analysis of Variance

AKA: One-Way ANOVA

I. Purpose:

A one-way analysis of variance compares the means of two or more non-related groups of data. The null hypothesis of a one-way ANOVA states that there are no differences between the groups.

The ANOVA uses the F distribution, which is directly related to the t distribution. The t distribution is simply the square of the F distribution. However, the F distribution gives us more flexibility in terms of how many groups of means we can compare. The t distribution is limited to two groups, while the ANOVA can evaluate two or more groups.

The ANOVA can only describe whether there is a difference between the groups, not where the difference is between the groups. In an ANOVA run on two groups, it is obvious that if there is a difference, that the difference is between the two groups. However, when the ANOVA is run on three or more groups, it is not obvious whether the difference is between the first and second groups, the second and third groups, the first and third groups, or if there are differences present between all groups. Because of this, post hoc tests are run after the ANOVA to determine where the difference is.

In this analysis, we will be using the Tukey HSD (honestly significant difference) post hoc test. The Tukey HSD test is used for each pair (so for three groups there will be three tests run) to determine whether there is a difference in the pair. The null hypothesis of the Tukey HSD test is that there is no difference between the two groups being analyzed.

II. Formula:

One-Way ANOVA

Tukey HSD

Where Ma is the larger mean and Mb is the smaller mean of the two groups that are being compared. SE is standard error.

III. Code in R:

stacked<- stack(mydata)

Stacks the data so it can be used in the one way test

aov(VAR1 ~ VAR2, data=stacked)

This code runs the one way anova.

aov.out = aov(VAR1 ~ VAR2, data=stacked)

This code saves the output so we can run post hoc tests.

TukeyHSD(aov.out)

This code runs the Tukey HSD between all groups.

IV. Scenario:

Dr. Blank wants to examine the differences in amount of time spent studying between his undergraduate, master's, and doctoral students. He asks them to report the average number of hours spent studying per week over the semester. Using this data, he wants to run a one way ANOVA to determine if there is a difference between the three groups of students. If there is a difference, he wants to run a Tukey HSD post hoc test to determine where the differences are.

Hypotheses being tested:

H0 = Groups will all be the same.

μ undergrad = μ master = μ doctoral

H1 = At least two of the groups will be different.

μ undergrad ≠ μ master

μ master ≠ μ doctoral

μ undergrad ≠ μ doctoral

Instructions

1. Open oneway.csv file in R

onewayaov <- read.csv("onewayaov.csv", header=TRUE)

2. View data in R

onewayaov

3. Run descriptive statistics

summary(onewayaov$Undergrad)

summary(onewayaov$Master)

summary(onewayaov$Doctoral)

4. Run descriptive statistics for standard deviation

sd(onewayaov$Undergrad)

sd(onewayaov$Master)

sd(onewayaov$Doctoral)

5. Stack the data

stacked<-stack(onewayaov)

6. Define the variables

hours <- stacked$values

student <- stacked$ind

7. Run a two-tailed (default) one-way ANOVA

oneway.test(hours ~ student)

8. Save the output

aov.out = aov(hours ~ student, data=stacked)

9. Run a Tukey HSD test

TukeyHSD(aov.out)

V. Results Write-Up

One-Way Analysis of Variance

A two-tailed one-way analysis of variance was run to determine whether there was a difference in the amount of time studying in hours by undergraduate, master's, and doctoral students. There was a significant difference found among undergraduate, master's, and doctoral students in how much time they spent studying, F(2, 17.89) = 10.95, p < .01.

Tukey HSD Test

A Tukey HSD test was run to determine where the difference was in study hours among undergraduate, master's, and doctoral students. It was determined that there was a significant difference between undergraduate and doctoral students ( p < .01) and between master's and doctoral students (p = .04) in number of hours studied. No significant difference was found between undergraduate and master's students (p = .06) in number of hours studied.

Reference/Citation:

Caddick, Z., Leonard, M., and Laraway, S. Statistics in R. 2014.

Statistics in R - 7. Chi-Square Test

Chi-Square Test

I. Purpose:

The chi-square test is used to determine whether there is a difference in frequencies based on group membership.

II. Formula:

III. Code in R:

chisq.test(table(OBJECT))

Where OBJECT is previously defined.

IV. Scenario:
Professor Jackson's undergraduate political science class is learning about voter turnout with relation to political party alignment. She takes a poll in her class, asking students to write down their political party (Republican or Democrat) and whether or not they voted in the last election. She wants you to run a chi-square test to determine whether there is a significant difference in voter turnout for Republicans or Democrats. This is her data:

Political Party	Vote	No Vote
Republican	21	12
Democrat	32	15

Use R to run a chi-square test to determine whether the voter turnout was significantly different between the two political parties for Professor Jackson.

Instructions

1. Open chisquare.csv file in R

chisquare <- read.csv("chisquare.csv", header=TRUE)

2. View data in R

chisquare

3. Run descriptive statistics for data

summary(chisquare)

4. Run chi-square test

chisq.test(table(chisquare))

Note: The warning message “Chi-squared approximation may be incorrect” is due to having small expected values in the analysis, which can create incorrect p-values.

V. Results Write-Up

Chi-square

A chi-square test was run to determine whether there was a difference in voter turnout with regard to political party alignment No significant difference in voter turnout was found based on group membership of either Republican or Democrat, χ2(7) = 6.00, p = .54.

Reference/Citation:

Caddick, Z., Leonard, M., and Laraway, S. Statistics in R. 2014.

Statistics in R - 6. Pearson’s r Test

Pearson’s r Test

I. Purpose:

The Pearson statistic is used to determine whether there is a linear relationship between two continuous variables. This relationship is reported as a number between - 1.0 and 1.0. The strength of the relationship is represented by the absolute value of the statistic, with higher numbers representing stronger relationships.

This linear relationship can also be positive or negative. A positive relationship occurs when one variable increases while the other variable increases; this relationship is represented by a positive number. A negative relationship occurs when one variable increases while the other variable decreases; this relationship is represented by a negative number.

II. Formula:

III. Code in R:

cor.test(table$VAR1, table$VAR2)

Where table refers to dataset, and ‘VAR1’ and ‘VAR2’ refer to the variables being investigated.

IV. Scenario:

Mr. Demagio, who owns an ice cream parlor, would like to find out if there is a relationship between the day’s temperature and the number of ice cream sales on a given day. Mr. Demagio wants to know if it might be profitable to keep his store open for longer hours on warmer days, but first he wants to establish a relationship between temperature and sales. From the previous few months, he selected twenty days of data with both the day’s temperature of the number of ice creams his parlor sold.

Ice Cream		Temperature
118	111	88	93
140	148	94	104
126	68	97	77
97	73	96	82
100	130	95	99
147	90	101	85
113	67	98	82
100	137	87	96
103	100	85	89
129	109	95	92

Use R to run a Pearson statistic to determine if there is a relationship between the temperature and the number of ice cream sales.

Instructions

1. Open pearson.csv file in R

pearson <- read.csv("pearson.csv", header=TRUE)

2. View data in R

pearson

3. Run Pearson’s r

cor.test(pearson$icecream, pearson$temperature)

V. Results Write-Up

Pearson’s r

Pearson’s r was used to find if there was a linear relationship between the day’s temperature and ice cream sales. A significant relationship was found between temperature and ice cream sales, r = .86, p < .001. As the temperature rises, ice cream sales tend to increase as well.

Reference/Citation:

Caddick, Z., Leonard, M., and Laraway, S. Statistics in R. 2014.

Thursday, July 2, 2015

Statistics in R - 8. One-Way ANOVA

Statistics in R - 7. Chi-Square Test

Statistics in R - 6. Pearson’s r Test