Thursday, July 2, 2015

Statistics in R - 8. One-Way ANOVA

One-Way Analysis of Variance
AKA: One-Way ANOVA


I.  Purpose:
A one-way analysis of variance compares the means of two or more non-related groups of data. The null hypothesis of a one-way ANOVA states that there are no differences between the groups.
The ANOVA uses the F distribution, which is directly related to the t distribution. The t distribution is simply the square of the F distribution. However, the F distribution gives us more flexibility in terms of how many groups of means we can compare. The t distribution is limited to two groups, while the ANOVA can evaluate two or more groups.
The ANOVA can only describe whether there is a difference between the groups, not where the difference is between the groups. In an ANOVA run on two groups, it is obvious that if there is a difference, that the difference is between the two groups. However, when the ANOVA is run on three or more groups, it is not obvious whether the difference is between the first and second groups, the second and third groups, the first and third groups, or if there are differences present between all groups. Because of this, post hoc tests are run after the ANOVA to determine where the difference is.
In this analysis, we will be using the Tukey HSD (honestly significant difference) post hoc test. The Tukey HSD test is used for each pair (so for three groups there will be three tests run) to determine whether there is a difference in the pair. The null hypothesis of the Tukey HSD test is that there is no difference between the two groups being analyzed.


II.  Formula:
One-Way ANOVA
 
Tukey HSD
 
    Where Ma is the larger mean and Mb is the smaller mean of the two groups that are being compared. SE is standard error.


III.  Code in R:
stacked<- stack(mydata)
  Stacks the data so it can be used in the one way test
aov(VAR1 ~ VAR2, data=stacked)
  This code runs the one way anova.
aov.out = aov(VAR1 ~ VAR2, data=stacked)
  This code saves the output so we can run post hoc tests.
TukeyHSD(aov.out)
  This code runs the Tukey HSD between all groups.


IV.  Scenario:
Dr. Blank wants to examine the differences in amount of time spent studying between his undergraduate, master's, and doctoral students. He asks them to report the average number of hours spent studying per week over the semester. Using this data, he wants to run a one way ANOVA to determine if there is a difference between the three groups of students. If there is a difference, he wants to run a Tukey HSD post hoc test to determine where the differences are.


Hypotheses being tested:
H0 = Groups will all be the same.
μ undergrad = μ master = μ doctoral
H1 = At least two of the groups will be different.
μ undergrad μ master
μ master μ doctoral
μ undergrad μ doctoral


Instructions
1.  Open oneway.csv file in R
onewayaov <- read.csv("onewayaov.csv", header=TRUE)


2.  View data in R
onewayaov


3.  Run descriptive statistics
summary(onewayaov$Undergrad)
summary(onewayaov$Master)
summary(onewayaov$Doctoral)


4.  Run descriptive statistics for standard deviation
sd(onewayaov$Undergrad)
sd(onewayaov$Master)
sd(onewayaov$Doctoral)


5. Stack the data
stacked<-stack(onewayaov)


6. Define the variables
hours <- stacked$values
student <- stacked$ind


7.  Run a two-tailed (default) one-way ANOVA
oneway.test(hours ~ student)


8. Save the output
aov.out = aov(hours ~ student, data=stacked)


9. Run a Tukey HSD test
TukeyHSD(aov.out)


V.  Results Write-Up

One-Way Analysis of Variance
A two-tailed one-way analysis of variance was run to determine whether there was a difference in the amount of time studying in hours by undergraduate, master's, and doctoral students. There was a significant difference found among undergraduate, master's, and doctoral students in how much time they spent studying, F(2, 17.89) = 10.95, p < .01.


Tukey HSD Test
A Tukey HSD test was run to determine where the difference was in study hours among undergraduate, master's, and doctoral students. It was determined that there was a significant difference between undergraduate and doctoral students ( p < .01) and between master's and doctoral students (p = .04) in number of hours studied. No significant difference was found between undergraduate and master's students (p = .06) in number of hours studied.




Reference/Citation:
Caddick, Z., Leonard, M., and Laraway, S. Statistics in R. 2014.

Statistics in R - 7. Chi-Square Test

Chi-Square Test


I. Purpose:
The chi-square test is used to determine whether there is a difference in frequencies based on group membership.


II.  Formula:


III.  Code in R:
chisq.test(table(OBJECT))
  Where OBJECT is previously defined.

IV.  Scenario:
Professor Jackson's undergraduate political science class is learning about voter turnout with relation to political party alignment. She takes a poll in her class, asking students to write down their political party (Republican or Democrat) and whether or not they voted in the last election. She wants you to run a chi-square test to determine whether there is a significant difference in voter turnout for Republicans or Democrats. This is her data:


Political Party
Vote
No Vote
Republican
21
12
Democrat
32
15


Use R to run a chi-square test to determine whether the voter turnout was significantly different between the two political parties for Professor Jackson.  


Instructions
1. Open chisquare.csv file in R
chisquare <- read.csv("chisquare.csv", header=TRUE)


2. View data in R
chisquare


3.  Run descriptive statistics for data
summary(chisquare)


4. Run chi-square test
chisq.test(table(chisquare))
Note: The warning message “Chi-squared approximation may be incorrect” is due to having small expected values in the analysis, which can create incorrect p-values.
V.  Results Write-Up


Chi-square
A chi-square test was run to determine whether there was a difference in voter turnout with regard to political party alignment  No significant difference in voter turnout was found based on group membership of either Republican or Democrat, χ2(7) = 6.00, p = .54.



Reference/Citation:
Caddick, Z., Leonard, M., and Laraway, S. Statistics in R. 2014.

Statistics in R - 6. Pearson’s r Test

Pearson’s r Test


I.  Purpose:
The Pearson statistic is used to determine whether there is a linear relationship between two continuous variables. This relationship is reported as a number between - 1.0 and 1.0. The strength of the relationship is represented by the absolute value of the statistic, with higher numbers representing stronger relationships.
This linear relationship can also be positive or negative. A positive relationship occurs when one variable increases while the other variable increases; this relationship is represented by a positive number. A negative relationship occurs when one variable increases while the other variable decreases; this relationship is represented by a negative number.


II.  Formula:


III.  Code in R:
cor.test(table$VAR1, table$VAR2)
  Where table refers to dataset, and ‘VAR1’ and ‘VAR2’ refer to the variables being investigated.


IV.  Scenario:
Mr. Demagio, who owns an ice cream parlor, would like to find out if there is a relationship between the day’s temperature and the number of ice cream sales on a given day. Mr. Demagio wants to know if it might be profitable to keep his store open for longer hours on warmer days, but first he wants to establish a relationship between temperature and sales. From the previous few months, he selected twenty days of data with both the day’s temperature of the number of ice creams his parlor sold.
Ice Cream

Temperature
118
111

88
93
140
148

94
104
126
68

97
77
97
73

96
82
100
130

95
99
147
90

101
85
113
67

98
82
100
137

87
96
103
100

85
89
129
109

95
92


Use R to run a Pearson statistic to determine if there is a relationship between the temperature and the number of ice cream sales.


Instructions
1. Open pearson.csv file in R
pearson <- read.csv("pearson.csv", header=TRUE)


2. View data in R
pearson


3.  Run Pearson’s r
cor.test(pearson$icecream, pearson$temperature)


V.  Results Write-Up


Pearson’s r  
Pearson’s r was used to find if there was a linear relationship between the day’s temperature and ice cream sales. A significant relationship was found between temperature and ice cream sales, r = .86, p < .001. As the temperature rises, ice cream sales tend to increase as well.  




Reference/Citation:
Caddick, Z., Leonard, M., and Laraway, S. Statistics in R. 2014.