MANE 3332.05

Lecture 17

Agenda

Midterm exams are not graded; still contacting students who missed
Linear Combination Practice Problems (assigned 10/28, due 10/30)
Chapter Six
Attendance
Questions?

Handouts

Numerical Summaries

Called Descriptive Statistics in Chapter 6
- Descriptive statistics help us understand the location or central tendency of data and the scatter or variability in data
- Included in all statistical software packages, R does a good job calculating descriptive statistics

Central Tendency

Ostle, et. al. (1996) define central tendency as "the tendency of sample data to cluster about a particular numerical value"
Population mean

\[ \mu=\frac{1}{N}\sum_{i=1}^Nx_i \]

Sample mean

\[ \bar{x}=\hat{\mu}=\frac{1}{n}\sum_{i=1}^nx_i \]

Sample median - middle value
Sample mode - most commonly occuring number(s)

Measures of Variability

There are several statistics that measure the variability or spread present in data
Population variance

\[ \sigma^2=\frac{\sum_{i=1}^N\left(x_i-\mu\right)^2}{N} \]

Sample variance

\[ s^2=\hat{\sigma}^2=\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}{n-1} \]

Shortcut (Computational) Formula

\[ s^2=\frac{\sum_{i=1}^nx_i^2-\frac{\left(\sum_{i=1}^nx_i\right)^2}{n}}{n-1} \]

Standard deviation is often used because it is measured in the original units

\[ \sigma=\sqrt{\sigma^2};\;s=\sqrt{s^2} \]

R Function Summary - Data Frame

R code

summary(midterm)

Output is from Spring 2024 results

Descriptive Statistics

R Function Summary - Variable

R code

summary(midterm$MidtermExam)

Output is from Spring 2024 results

Descriptive Statistics

R Function Describe

Summary() does not report variability
Describe() has to be imported
Describe() is part of the package psych
R Code for descriptive statistics using psych package

library(psych)
describe(midterm)

Psych package output from Spring 2024

Describe() Output

Describe Output, part 2

Describe Output

Calculating Quantiles

reference for calculating quantiles

Quantile Example

Exploratory Data (Graphical) Analysis

Exploratory data analysis (EDA) is the use of graphical procedures to analyze data.
John Tukey was a pioneer in this field and invented several of the procedures
Tools include stem-and-leaf diagrams, box plots, time series plots and digidot plots

Stem and Leaf Diagram

Excellent tool that maintains data integrity
The stem is the leading digit or digits
The leaf is the remaining digit
Make sure to include units
R Code

stem(midterm$MidtermExam)

Stem and Leaf Example

R output of a Stem and Leaf diagram

Stem and Leaf Plot of Midterm Exam Scores

Histogram

A histogram is a barchart displaying the frequency distribution information
There are three types of histograms: frequency, relative frequency and cumulative relative frequency
R code

hist(midterm$MidtermExam)

Histogram Example

R output of histogram

Histogram of Midterm Exam Scores

Boxplot

Graphical display that simultaneously describes several important features of a data set such as center, spread, departure from symmetry and outliers
Requires the calculation of quantiles (quartiles)

Box Plot 1

Box plot with explanation

Box Plot 2

examples of boxplots

Box Plot 3

R code for Box Plot

boxplot(midterm$MidtermExam,xlab='Score',main='Boxplot of Midterm Exam Scores')

R Box Plot output

Boxplot of Midterm Exam Scores

Time Series Plot

A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say $x$) and the horizontal axis denotes time
Excellent tool for detecting:
- trends,
- cycles,
- other non-random patterns

Time Series Plot in R

Time Series Plot

Probability Plotting

Probability plotting is a graphical method of determining whether sample data conform to a hypothesized distribution
Used for validating assumptions
Alternative to hypothesis testing

Construction

Sort the data from smallest to largest, . $$ x_{(1)},x_{(2)},\ldots,x_{(n)} $$
Calculate the observed cumulative frequency $(j-0.5)/n$

For the normal distribution find $z_j$ that satisfies

\[ \frac{j-0.5}{n}=P(Z\leq z_j)=\Phi(z_j) \]

Plot $z_j$ versus $x_{(j)}$ on special graph paper

Usage

If the data plots as a straight line, the assumed distribution is correct

normal probability plots from textbook, figure 6.21 on page 215

Probability Plot Example 1 in R

Normal Probability Plot

Probability Plot Example 2

Difficulty from example one is how close to straight is "good enough"
Add confidence bands to normal probability plot
- Requires package car to be added to R
- If all points are within the band, we are 95% confident that the sample is from a normal distribution. However if one or more points are not within band, the data is not from a normal distribution

QQ Plot with band

Multivariate Data

Matrix of Scatter Plot in R

Scatter Plots

Covariance in R

Covariance Matrix

Correlation

Correlation Matrix