Title: | Companion to "Statistics Using R: An Integrative Approach" |
---|---|
Description: | Access to the datasets and many of the functions used in "Statistics Using R: An Integrative Approach". These datasets include a subset of the National Education Longitudinal Study, the Framingham Heart Study, as well as several simulated datasets used in the examples throughout the textbook. The functions included in the package reproduce some of the functionality of 'Stata' that is not directly available in 'R'. The package also contains a tutorial on basic data frame management, including how to handle missing data. |
Authors: | Daphna Harel [cre, aut], Sharon Weinberg [ctb], Sarah Abramowitz [ctb] |
Maintainer: | Daphna Harel <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.4 |
Built: | 2025-02-01 03:11:46 UTC |
Source: | https://github.com/cran/sur |
This dataset is used to illustrate the importance of statistical display as an adjunct to summary statistics. Anscombe (1973) fabricated four different bivariate datasets such that, for all datasets, the respective X and Y means, X and Y standard deviations, and correlations, slopes, intercepts, and standard errors of estimate are equal. Accordingly, without a visual representation of these four panels, one might assume that the data values for all four datasets are the same. Scatterplots illustrate, however, the extent to which these datasets are different from one another.
Anscombe
Anscombe
A data frame with 11 rows and 8 variables:
values of X for the first dataset
values of Y for the first dataset
values of X for the second dataset
values of Y for the second dataset
values of X for the third dataset
values of Y for the third dataset
values of X for the fourth dataset
values of Y for the fourth dataset
The dataset consists of the heights and weights of the 24 scoring leaders, 12 each from the U.S. Women’s and Men’s National Basketball Association, for the 2014 – 2015 season. These data are taken from the ESPN website at espn.com.
Basketball
Basketball
A data frame with 20 rows and 6 variables:
name of player
gender of player
height of player in inches
weight of player in pounds
number of games played
average total points scored per game
The data were collected to determine whether an increase in calcium intake reduces blood pressure among African-American adult males. The data are based on a sample of 21 African-American adult males selected randomly from the population of African-American adult males. Ten of the 21 men were randomly assigned to a treatment condition that required them to take a calcium supplement for 12 weeks. The remaining 11 men received a placebo for the 12 weeks. At both the beginning and the end of this time period, systolic blood pressure readings of all men were recorded. These data are adapted from the Data and Story Library (DASL) website.
Blood
Blood
A data frame with 21 rows and 4 variables:
case number
treatment condition
initial blood pressure
final blood pressure
Function to obtain a sampling distribution of means by bootstrapping.
boot.mean(x, B, n = length(x))
boot.mean(x, B, n = length(x))
x |
original sample, given as a numeric or logical object, to be used to generate bootstrapped samples. |
B |
number of bootstrapped samples to be generated by randomly sampling with replacement. |
n |
size of each bootstrapped sample. Default setting is the size of the original sample. |
A list with components:
Replications |
number of bootstrapped means computed. |
mean |
mean of bootstrapped means. |
se |
standard error, estimated as the standard deviation of bootstrapped means. |
bootstrap.samples |
means of bootstrapped samples. |
# using simple vector a = 1:10 set.seed(1234) boot.mean(a, B = 500) # using variable from data frame set.seed(1234) boot.mean(Framingham$AGE3, B = 1000)
# using simple vector a = 1:10 set.seed(1234) boot.mean(a, B = 500) # using variable from data frame set.seed(1234) boot.mean(Framingham$AGE3, B = 1000)
The data are based on a study by Willerman et al. (1991) of the relationships between brain size, gender, and intelligence. The research participants consisted of 40 right-handed introductory psychology students with no history of alcoholism, unconsciousness, brain damage, epilepsy, or heart disease who were selected from a larger pool of introductory psychology students with total Scholastic Aptitude Test Scores higher than 1350 or lower than 940. The students in the study took four subtests (Vocabulary, Similarities, Block Design, and Picture Completion) of the Wechsler (1981) Adult Intelligence Scale-Revised. Among the students with Wechsler full-scale IQ’s less than 103, 10 males and 10 females were randomly selected. Similarly, among the students with Wechsler full-scale IQ’s greater than 130, 10 males and 10 females were randomly selected, yielding a randomized blocks design. MRI scans were performed at the same facility for all 40 research participants to measure brain size. The scans consisted of 18 horizontal MRI images. The computer counted all pixels with non-zero gray scale in each of the 18 images, and the total count served as an index for brain size. The dataset and description are adapted from the Data and Story Library (DASL) website.
Brainsz
Brainsz
A data frame with 40 rows and 7 variables:
case number
gender of student
full-scale IQ score based on WAIS-R
verbal IQ score based on WAIS-R
performance IQ score based on WAIS-R
pixel count from 18 MRI scans
group membership based on FSIQ score
This dataset contains simulated data for the figures accompanying Exercise 14.1 of Chapter 14. The data represent the results of a fictional study to determine whether there is a relationship between gender, teaching method, and achievement in reading. Each set of scores reflects a scenario with a different relationship among the variables.
Chapter14_Figures
Chapter14_Figures
A data frame with 12 rows and 7 variables:
individual's sex
reading achievement score for first scenario
teaching method
reading achievement score for second scenario
reading achievement score for third scenario
reading achievement score for fourth scenario
reading achievement score for fifth scenario
Returns as a named vector the cumulative percentage frequency distribution of a variable x
at each unique value.
cumulative.table(x)
cumulative.table(x)
x |
object containing data for a single variable. |
If x
contains NA
values (missing data), the cumulative percentage table will not reach 100. The table will end with the cumulative percentage of non-missing data within the object; the value remaining after subtracting this value from 100 represents the percentage of NA
values within the object.
A named numeric vector containing cumulative percentage frequencies, named by unique values of x
and ordered numerically or alphabetically by name.
# using variable without NA values cumulative.table(NELS$famsize) # using variable with NA values cumulative.table(NELS$parmarl8)
# using variable without NA values cumulative.table(NELS$famsize) # using variable with NA values cumulative.table(NELS$parmarl8)
This dataset contains, for the smaller bill denominations, the value of the bill and the total value in circulation. The source for these data is The World Almanac and Book of Facts 2014.
Currency
Currency
A data frame with 5 rows and 3 variables:
denomination
total currency in circulation in U.S. dollars
total number of bills in circulation
A fabricated dataset constructed by Darlington (1990) to demonstrate the importance of including all relevant variables in an analysis. This dataset contains information about exercise, food intake, and weight loss for a fictional set of dieters.
Exercise
Exercise
A data frame with 10 rows and 4 variables:
case number
average daily number of hours exercised in that week
average daily number of calories consumed in one particular week that is more than a baseline of 1,000 calories, as measured in increments of 100 calories
number of pounds lost in that week
"Regression and linear models." Darlington, R. B. (1990, ISBN:978-0070153721)
This dataset contains simulated data for the figures accompanying Exercise 14.1 of Chapter 14. The data represent the results of a fictional study in which a college professor examines the effect of the grade level of the students and the time of the course on how well undergraduate students at her college do in her course.
Exercise14_5
Exercise14_5
A data frame with 40 rows and 3 variables:
time of day student takes the course
year of college in which the student is enrolled
final exam score
This dataset contains simulated data for Figure 15.1 of Chapter 15.
Figure15_1
Figure15_1
A list with 3 elements:
an integer-scaled independent variable
an integer-scaled outcome variable
frequency of value pair
This dataset contains simulated data for Figures 15.12 - 15.13 of Chapter 15.
Figure15_12
Figure15_12
A data frame with 9 rows and 4 variables:
a numeric independent variable for Figure 15.12
a numeric outcome variable for Figure 15.12
a numeric independent variable for Figure 15.13
a numeric outcome variable for Figure 15.13
This dataset contains simulated data for Figures 15.9 - 15.11 of Chapter 15.
Figure15_9
Figure15_9
A data frame with 24 rows and 4 variables:
a numeric independent variable for Figure 15.9
a numeric outcome variable for Figure 15.9
residual value for regression of y
on x
log of the outcome variable y
This dataset contains data on causes of death in New York City that were used for Figure 2.4 of Chapter 2.
Figure2_4
Figure2_4
A data frame with 591,200 rows and 1 variable:
cause of death
This dataset contains simulated test scores of Spanish fluency used to generate Figure 3.2 of Chapter 3.
Figure3_2
Figure3_2
A data frame with 100 rows and 1 variable:
score on test of Spanish fluency
This dataset contains simulated scores used to generate Figure 3.3 of Chapter 3.
Figure3_3
Figure3_3
A data frame with 45 rows and 1 variable:
numeric score from rectangular distribution
This dataset contains simulated scores used to generate Figure 3.5(A) of Chapter 3.
Figure3_5a
Figure3_5a
A data frame with 121 rows and 1 variable:
numeric score from a symmetric distribution
This dataset contains simulated scores used to generate Figure 3.5(B) of Chapter 3.
Figure3_5b
Figure3_5b
A data frame with 75 rows and 1 variable:
numeric score from a symmetric distribution
This dataset contains simulated scores used to generate Figures 3.6 ad 3.7 of Chapter 3.
Figure3_6and7
Figure3_6and7
A data frame with 69 rows and 2 variables:
numeric score from a distribution with severe negative skew
numeric score from a distribution with severe positive skew
This dataset contains simulated scores used to generate Figures 5.5(A) - 5.5(I) of Chapter 5.
Figure5_5
Figure5_5
A data frame with 10 rows and 18 variables:
days elapsed in a given year
days remaining in that same year
age of elementary school student
number of seconds to run a 100-yard dash
introversion score of adolescent boy
aggression score of adolescent boy
moodiness score of college freshman
English ability score of college freshman
weight of male college student
achievement score in statistics of male college student
expected grade in course of college student
course evaluation score given by college student
IQ score of child in grades K – 3
reading achievement score of child in grades K – 3
arithmetic reasoning score of elementary school student
arithmetic fundamentals score of elementary school student
diameter of tree
circumference of tree
The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease among a population of non-institutionalized people in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects. The study began in 1956 and 5,209 subjects were initially enrolled in the study. In our dataset, we included variables from the first examination in 1956 and the third examination, in 1968. Clinic examination data has included cardiovascular disease risk factors and markers of disease such as blood pressure, blood chemistry, lung function, smoking history, health behaviors, ECG tracings, echocardiography, and medication use. Through regular surveillance of area hospitals, participant contact, and death certificates, the Framingham Heart Study reviews and adjudicates events for the occurrence of any of the following types of coronary heart disease(CHD): angina pectoris, myocardial infarction, heart failure, and cerebrovascular disease.
Framingham
Framingham
A data frame with 400 rows and 33 variables:
case number
sex
serum cholesterol (mg/dL) at initial examination
age (years) at initial examination
systolic blood pressure (mmHg) at initial examination
diastolic blood pressure (mmHg) at initial examination
indicator that participant currently is a cigarette smoker at initial examination
cigarettes smoked per day at initial examination
Body Mass Index (kg/(M*M)) at initial examination
indicator that participant is diabetic at initial examination
use of anti-hypertensive medication at initial examination
ventricular rate (beats/min) at initial examination
casual glucose (mg/dL) at initial examination
prevalent CHD (angina pectoris, myocardial infarction, or coronary insufficiency) at initial examination
days since initial examination
days from initial examination to any CHD event
serum cholesterol (mg/dL) at third examination
age (years) at third examination
systolic blood pressure (mmHg) at third examination
diastolic blood pressure (mmHg) at third examination
indicator that participant currently is a cigarette smoker at third examination
cigarettes smoked per day at third examination
Body Mass Index (kg/(M*M) at third examination
indicator that participant is diabetic at third examination
use of anti-hypertensive medication at third examination
ventricular rate (beats/min) at third examination
casual glucose (mg/dL) at third examination
prevalent CHD (angina pectoris, myocardial infarction, or coronary insufficiency) at third examination
days since initial examination at third examination
HDL cholesterol (mg/dL) at third examination
LDL cholesterol (mg/dL) at third examination
days from initial examination to any CHD event at third examination
indicator of event of hospitalized myocardial infarction, angina pectoris, coronary insufficiency, or fatal CHD by the end of the study
The associated dataset is a subset of the data collected as part of the Framingham study and includes laboratory, clinic, questionnaire, and adjudicated event data on 400 participants. These participants for the dataset have been chosen so that among all male participants, 100 smokers and 100 non-smokers were selected at random. A similar procedure resulted in 100 female smokers and 100 female non-smokers. This procedure resulted in an over-sampling of smokers. The data for each participant is on one row. People who had any type of CHD in the initial examination period are not included in the dataset.
This dataset contains the fat grams and calories associated with the different types of hamburger sold by McDonald’s. The data are from McDonald’s Nutrition Information Center.
Hamburger
Hamburger
A data frame with 5 rows and 4 variables:
type of burger
grams of fat
total calories
cheese added
This dataset contains fabricated data for the temperature, relative humidity, and ice cream sales for 30 days randomly selected between May 15th and September 6th.
IceCream
IceCream
A data frame with 30 rows and 4 variables:
case number
temperature in degrees Fahrenheit
number of ice cream bars sold
relative humidity
On February 12, 1999, for only the second time in the nation’s history, the U.S. Senate voted on whether to remove a president, based on impeachment articles passed by the U.S. House. Professor Alan Reifman of Texas Tech University created the dataset consisting of descriptions of each senator that can be used to understand some of the reasons that the senators voted the way they did. The data are taken from the Journal of Statistics Education [online].
Impeach
Impeach
A data frame with 100 rows and 11 variables:
senator’s name
state the senator represents
geographic region of the U.S.
vote on perjury
vote on obstruction of justice
total number of guilty votes
political party of senator
conservatism score, defined as the senator’s degree of ideological conservatism, based on 1997 voting records as judged by the American Conservative Union, where the scores ranged from 0 to 100 and 100 is most conservative
state voter support for Clinton, defined as the percent of the vote Clinton received in the 1996 presidential election in the senator’s state
year the senator’s seat is up for reelection
indicator for whether the senator is in their first-term
This dataset is a subset of data from a study by Susan Tomasi and Sharon L. Weinberg (1999), which profiled learning disabled students in an urban setting. According to Public Law 94.142, enacted in 1976, a team may determine that a child has a learning disability (LD) if a severe discrepancy exists between a child’s actual achievement in, for example, math or reading, and his or her intellectual ability. The dataset consists of six variables, described below, on 105 elementary school children from an urban area who were classified as LD and who, as a result, had been receiving special education services for at least three years. Of the 105 children, 42 are female and 63 are male. There are two main types of placements for these students: part-time resource room placements, in which the students get additional instruction to supplement regular classroom instruction, and self-contained classroom placements, in which students are segregated full time. In this dataset, 66 students are in resource room placements while 39 are in self-contained classroom placements. For inferential purposes, we consider the children in the dataset to be a random sample of all children attending public elementary school in a certain city who have been diagnosed with learning disabilities. Many students in the dataset have missing values for either math or reading comprehension, or both. Such omissions can lead to problems when generalizing results. There are statistical remedies for missing data that are beyond the scope of this text. In this case, we will assume that there is no pattern to the missing values, so that our sample is representative of the population.
Learndis
Learndis
A data frame with 105 rows and 6 variables:
student’s grade level
student’s gender
type of placement: “RR” for part time in resource room or “MIS” for full time in self-contained classroom
reading comprehension score, with possible range of 0 to 200
math comprehension score, with possible range of 0 to 200
student’a intellectual ability, as measured by IQ score with possible range of 0 to 200
"Classifying children as learning disabled: An analysis of current practice in an urban setting." Tomasi, S., & Weinberg, S. L. (1999) <doi:10.2307/1511150>
Function to test the homogeneity of variance for two populations, an assumption of the independent samples t-test. The null hypothesis tested is that the two population variances are equal; the alternative is that the two population variances are not equal.
levenes.test(y, group)
levenes.test(y, group)
y |
outcome variable of interest, given as a numeric object. |
group |
a factor or character object with two levels indicating group membership. |
An anova table containing test results: two values for degrees of freedom, the F-value, and the p-value.
# using simple data frame value = c(7,2,4,4,8,3,61,2,80,4) grp = rep(c("A","B"), each = 5) ex_data = data.frame(value = value, grp = grp) levenes.test(ex_data$value, group = ex_data$grp) # using variable without NA values levenes.test(NELS$famsize, group = NELS$gender) # using variable with NA values levenes.test(NELS$achrdg12, group = NELS$gender)
# using simple data frame value = c(7,2,4,4,8,3,61,2,80,4) grp = rep(c("A","B"), each = 5) ex_data = data.frame(value = value, grp = grp) levenes.test(ex_data$value, group = ex_data$grp) # using variable without NA values levenes.test(NELS$famsize, group = NELS$gender) # using variable with NA values levenes.test(NELS$achrdg12, group = NELS$gender)
Returns the leverage values for a linear regression model.
leverage(x)
leverage(x)
x |
linear regression model given as an |
A numeric vector of leverage values.
lm
, rstudent()
, cooks.distance()
mod = lm(Framingham$SYSBP1 ~ Framingham$TOTCHOL1 + Framingham$AGE1) leverage(mod)
mod = lm(Framingham$SYSBP1 ~ Framingham$TOTCHOL1 + Framingham$AGE1) leverage(mod)
This dataset contains fabricated data for a single survey item measured on a Likert scale. It is given that a survey was administered to 30 individuals and included an item measuring assertiveness by having the individual indicate agreement with the statement: “I have the ability to stand up for my own rights without denying the rights of others.” The response options were: 1 = "strongly agree"; 2 = "agree"; 3 = "neutral"; 4 = "disagree"; 5 = "strongly disagree." Notice that on this scale, high scores are associated with low levels of assertiveness.
Likert
Likert
A data frame with 30 rows and 1 variable:
five-point Likert-scale score of assertiveness, with high scores associated with low levels of assertiveness
Function to plot the estimated density values of a variable as a line.
line.graph(x, ...)
line.graph(x, ...)
x |
numeric object to be plotted. |
... |
additional arguments to be passed to the |
A line graph of the estimated density distribution of a variable.
line.graph(Temp$Temperature[Temp$City == "SanFrancisco"]) line.graph(IceCream$barsold)
line.graph(Temp$Temperature[Temp$City == "SanFrancisco"]) line.graph(IceCream$barsold)
This fictional dataset contains the treatment group number and the manual dexterity scores for 30 individuals selected by the director of a drug rehabilitation center. There are three treatments, and the individuals are randomly assigned ten to a treatment. After five weeks of treatment, a manual dexterity test is administered for which a higher score indicates greater manual dexterity.
ManDext
ManDext
A data frame with 30 rows and 3 variables:
manual dexterity score
individual’s sex
treatment group assignment
This is a second fictional dataset that expands on ManDext
, adding predicted outcome variables from regression analyses under alternative scenarios.
ManDext2
ManDext2
A data frame with 30 rows and 9 variables:
manual dexterity score
individual’s sex
treatment group assignment
predicted outcome for disordinal interaction scenario
predicted outcome for ordinal interaction scenario
predicted outcome for first no-interaction scenario
predicted outcome for second no-interaction scenario
predicted outcome for third no-interaction scenario
predicted outcome for fourth no-interaction scenario
The dataset contains the year and percentage of twelfth graders who have ever used marijuana for several recent years. The source for these data is The World Almanac and Book of Facts 2014.
Marijuana
Marijuana
A data frame with 23 rows and 2 variables:
year for which data was collected
percentage of twelfth graders who reported that they have ever used marijuana
In response to pressure from federal and state agencies to monitor school effectiveness in the United States, the National Center of Education Statistics (NCES) of the U.S. Department of Education conducted a survey in the spring of 1988, the National Education Longitudinal Study (NELS). The participants consisted of a nationally representative sample of approximately 25,000 eighth graders to measure achievement outcomes in four core subject areas (English, history, mathematics, and science), in addition to personal, familial, social, institutional, and cultural factors that might relate to these outcomes. Details on the design and initial analysis of this survey may be referenced in Horn, Hafner, and Owings (1992). A follow-up of these students was conducted during tenth grade in the spring of 1990; a second follow-up was conducted during the twelfth grade in the spring of 1992; and, finally, a third follow-up was conducted in the spring of 1994.
NELS
NELS
A data frame with 500 rows and 48 variables:
case number
indicator for whether advanced math taken in eighth grade
urbanicity, a measure of the type of environment in which the student lives
geographic region of school
student's gender
student’s family size
parents' marital status in eighth grade
home language background
self-concept in eighth grade
self-concept in tenth grade
self-concept in twelfth grade
school type in eighth grade
likert-scale variable classifying student agreement with the statement, “My teachers are interested in students”
number of times late for school in twelfth grade
number of times skipped/cut classes in twelfth grade
number of times student missed school in twelfth grade
indicator for whether advanced placement program taken
time spent on homework weekly in school per week in twelfth grade
time spent on homework out of school per week in twelfth grade
time spent weekly on extracurricular activities in twelfth grade, in hours
indicator for whether computer owned by family in eighth grade
type of high school program
units in English (NAEP), or number of years of English taken in high school
units in mathematics (NAEP), or number of years of math taken in high school
units in calculus (NAEP), or number of years of calculus taken in high school
school average daily attendance rate
number of advanced placement courses offered by school
indicator for whether nursery school attended
indicator for whether algebra taken in eighth grade
number of post-secondary institutions attended
highest level of education expected
expected income at age 30, in dollars
reading achievement in eighth grade
math achievement in eighth grade
science achievement in eighth grade
social studies achievement in eighth grade
reading achievement in tenth grade
math achievement in tenth grade
science achievement in tenth grade
social studies achievement in tenth grade
reading achievement in twelfth grade
math achievement in twelfth grade
science achievement in twelfth grade
social studies achievement in twelfth grade
indicator for whether smoked cigarettes ever
indicator for whether ever binged on alcohol
indicator for whether smoked marijuana ever
socioeconomic status score, ranging from 0 to 35, and given as a composite of father’s education level, mother’s education level, father’s occupation, mother’s education, and family income
For this dataset, we have selected a sub-sample of 500 cases and 48 variables. The cases were sampled randomly from the approximately 5,000 students who responded to all four administrations of the survey, who were always at grade level (neither repeated nor skipped a grade), and who pursued some form of post-secondary education. The particular variables were selected to explore the relationships between student and home background variables, self-concept, educational and income aspirations, academic motivation, risk-taking behavior, and academic achievement.
"A profile of American eighth-grade mathematics and science instruction." Horn, L., Hafner, & Owings (1992) <https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=92486>
For one variable, returns a frequency distribution table given in percentages. For two variables, returns a contingency table given in percentages.
percent.table(x, y = NULL)
percent.table(x, y = NULL)
x |
object containing data for a single variable. |
y |
optional second object to create a contingency table given in percentages. Default setting ignores second object by setting |
A table of frequency percentages (for one variable) or a contingency table of percentages (for two variables).
# frequency table for one variable percent.table(NELS$region) # cross-tabulation for two variables percent.table(Wages$south,Wages$occup)
# frequency table for one variable percent.table(NELS$region) # cross-tabulation for two variables percent.table(Wages$south,Wages$occup)
This dataset contains data on a fabricated random sample of 200 individuals, 100 females and 100 males, drawn from a population of interest. There are only two variables, both of which are categorical: gender and political party affiliation.
Politics
Politics
A data frame with 200 rows and 2 variables:
individual's gender
individual's political party affiliation
Function to obtain the standard error of the skewness of a distribution of values.
se.skew(x)
se.skew(x)
x |
numeric object containing the values for a variable. |
Standard error of skewness is computed on non-missing values using the following equation.
Standard error of skewness for x
.
se.skew(Temp$Temperature[Temp$City == "Springfield"]) se.skew(Temp$Temperature[Temp$City == "SanFrancisco"])
se.skew(Temp$Temperature[Temp$City == "Springfield"]) se.skew(Temp$Temperature[Temp$City == "SanFrancisco"])
Function to obtain the skewness value of a distribution of values.
skew(x)
skew(x)
x |
numeric object containing the values for a variable. |
Skewness value computed on non-missing values using the ratio of to
.
Skewness value of x
.
skew(IceCream$relhumid) skew(IceCream$temp)
skew(IceCream$relhumid) skew(IceCream$temp)
Returns the ratio of a distribution's skewness value to its standard error of skewness.
skew.ratio(x)
skew.ratio(x)
x |
numeric object containing the values for a variable. |
skew.ratio
relies on the functions skew
and se.skew
to compute the skewness value and standard error of skewness, respectively.
Skewness ratio of x
.
# skew ratio computed two ways skew.ratio(NELS$achmat12) skew(NELS$achmat12) / se.skew(NELS$achmat12)
# skew ratio computed two ways skew.ratio(NELS$achmat12) skew(NELS$achmat12) / se.skew(NELS$achmat12)
This dataset includes different educational measures of the 50 states and Washington, D.C. These data are from The 2014 World Almanac and Book of Facts.
States
States
A data frame with 51 rows and 10 variables:
name of state
region of the country in which the state is located
total public school enrollment 2011 - 2012
average number of pupils per teacher 2011 - 2012
average annual salary for public school teachers 2011 - 2012
average expenditure per pupil 2011 - 2012
average SAT Critical Reading score 2013
average SAT Math score 2013
average SAT Writing score 2013
percentage of eligible students taking the SAT 2012
This dataset includes data on 12 statisticians who each have contributed significantly to the field of modern statistics.
Statisticians
Statisticians
A data frame with 12 rows and 5 variables:
name of statistician
gender of statistician, where 1 = “Female” and 2 = “Male”
year of birth
year of death
number of references in The American Statistician, 1995-2005
Students at Ohio State University conducted an experiment in the fall of 1993 to explore the nature of the relationship between a person's heart rate and the frequency at which that person stepped up and down on steps of various heights. The response variable, heart rate, was measured in beats per minute. For each person, the resting heart rate was measured before a trial (HRInit) and after stepping (HRFinal). There were two different step heights (Height): 5.75 inches (coded as 1 = Low), and 11.5 inches (coded as 2 = High). There were three rates of stepping (Freq): 14 steps/min. (coded as 1 = Slow), 21 steps/min. (coded as 2 = Medium), and 28 steps/min. (coded as 3 = Fast). This resulted in six possible height/frequency combinations. Each subject performed the activity for three minutes. Subjects were kept on pace by the beat of an electric metronome. One experimenter counted the subject's heart rate, in beats per minute, for 20 seconds before and after each trial. The subject always rested between trials until her or his heart rate returned to close to the beginning rate. Another experimenter kept track of the time spent stepping. Each subject was always measured and timed by the same pair of experimenters to reduce variability in the experiment. The dataset and description are adapted from the Data and Story Library (DASL) website.
Stepping
Stepping
A data frame with 30 rows and 6 variables:
overall performance order of the trial
subject and experimenters' block number
step height
rate of stepping
resting heart rate of the subject before a trial, in beats per minute
final heart rate of the subject after a trial, in beats per minute
This dataset gives the average monthly temperatures (in degrees Fahrenheit) for Springfield, MO and San Francisco, CA. These data are from Burrill and Hopensperger (1993).
Temp
Temp
A data frame with 24 rows and 2 variables:
city where temperature was measured
average monthly temperature, in degrees Fahrenheit
"Exploring Statistics with the T1-81" Burrill, G., & Hopensperger, P. (1993, ISBN:9780201524321)
Function to obtain the mode(s) of a distribution.
the.mode(x)
the.mode(x)
x |
object containing data for a single variable. |
A numeric vector of the value(s) of the distribution that have the highest frequency of occurrence.
# single mode for factor variable the.mode(NELS$urban) # bimodal numeric variable a = c(14,24,62,12,12,12,36,17,11,99,99,99) the.mode(a)
# single mode for factor variable the.mode(NELS$urban) # bimodal numeric variable a = c(14,24,62,12,12,12,36,17,11,99,99,99) the.mode(a)
This simulated dataset consists of the number of hours eight individuals spend at the gym on a weekly basis along with measures of their upper body strength.
UpperBodyStrength
UpperBodyStrength
A data frame with 8 rows and 3 variables:
number of hours spent at the gym weekly
upper body strength score
individual's gender
This is a subsample of 100 males and 100 females randomly selected from the 534 cases that comprised the 1985 Current Population Survey in a way that controls for highest education level attained. The sample of 200 contains 20 males and 20 females with less than a high school diploma, 20 males and 20 females with a high school diploma, 20 males and 20 females with some college training, 20 males and 20 females with a college diploma, and 20 males and 20 females with some graduate school training. The data include information about gender, highest education level attained, and hourly wage.
Wages
Wages
A data frame with 400 rows and 9 variables:
case number
number of years of education
indicator for whether individual lives in the South
individual’s sex
number of years of work experience
wage (dollars per hour)
occupation category
marital status
highest education level