MANE 3332.05
Lecture 17
Agenda
- Midterm exams are not graded; still contacting students who missed
- Linear Combination Practice Problems (assigned 10/28, due 10/30)
- Chapter Six
- Attendance
- Questions?
Handouts
Numerical Summaries
-
Called Descriptive Statistics in Chapter 6
- Descriptive statistics help us understand the location or central tendency of data and the scatter or variability in data
- Included in all statistical software packages, R does a good job calculating descriptive statistics
Central Tendency
-
Ostle, et. al. (1996) define central tendency as "the tendency of sample data to cluster about a particular numerical value"
-
Population mean
- Sample mean
-
Sample median - middle value
-
Sample mode - most commonly occuring number(s)
Measures of Variability
-
There are several statistics that measure the variability or spread present in data
-
Population variance
- Sample variance
- Shortcut (Computational) Formula
- Standard deviation is often used because it is measured in the original units
R Function Summary - Data Frame
- R code
summary(midterm)
- Output is from Spring 2024 results

R Function Summary - Variable
- R code
summary(midterm$MidtermExam)
- Output is from Spring 2024 results

R Function Describe
- Summary() does not report variability
- Describe() has to be imported
- Describe() is part of the package psych
- R Code for descriptive statistics using psych package
library(psych)
describe(midterm)
- Psych package output from Spring 2024

Describe Output, part 2

Calculating Quantiles

Quantile Example

Exploratory Data (Graphical) Analysis
-
Exploratory data analysis (EDA) is the use of graphical procedures to analyze data.
-
John Tukey was a pioneer in this field and invented several of the procedures
-
Tools include stem-and-leaf diagrams, box plots, time series plots and digidot plots
Stem and Leaf Diagram
-
Excellent tool that maintains data integrity
-
The stem is the leading digit or digits
-
The leaf is the remaining digit
-
Make sure to include units
-
R Code
stem(midterm$MidtermExam)
Stem and Leaf Example
- R output of a Stem and Leaf diagram

Histogram
-
A histogram is a barchart displaying the frequency distribution information
-
There are three types of histograms: frequency, relative frequency and cumulative relative frequency
-
R code
hist(midterm$MidtermExam)
Histogram Example
- R output of histogram

Boxplot
-
Graphical display that simultaneously describes several important features of a data set such as center, spread, departure from symmetry and outliers
-
Requires the calculation of quantiles (quartiles)
Box Plot 1

Box Plot 2

Box Plot 3
- R code for Box Plot
boxplot(midterm$MidtermExam,xlab='Score',main='Boxplot of Midterm Exam Scores')
- R Box Plot output

Time Series Plot
-
A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say \(x\)) and the horizontal axis denotes time
-
Excellent tool for detecting:
-
trends,
-
cycles,
-
other non-random patterns
-
Time Series Plot in R

Probability Plotting
-
Probability plotting is a graphical method of determining whether sample data conform to a hypothesized distribution
-
Used for validating assumptions
-
Alternative to hypothesis testing
Construction
-
Sort the data from smallest to largest, . $$ x_{(1)},x_{(2)},\ldots,x_{(n)} $$
-
Calculate the observed cumulative frequency \((j-0.5)/n\)
For the normal distribution find \(z_j\) that satisfies
- Plot \(z_j\) versus \(x_{(j)}\) on special graph paper
Usage
- If the data plots as a straight line, the assumed distribution is correct

Probability Plot Example 1 in R

Probability Plot Example 2
- Difficulty from example one is how close to straight is "good enough"
- Add confidence bands to normal probability plot
- Requires package car to be added to R
- If all points are within the band, we are 95% confident that the sample is from a normal distribution. However if one or more points are not within band, the data is not from a normal distribution

Multivariate Data
Matrix of Scatter Plot in R

Covariance in R

Correlation
