Skip to content

MANE 3332.05

Lecture 17

Agenda

  • Midterm exams are not graded; still contacting students who missed
  • Linear Combination Practice Problems (assigned 10/28, due 10/30)
  • Chapter Six
  • Attendance
  • Questions?

Handouts


Numerical Summaries

  • Called Descriptive Statistics in Chapter 6

    • Descriptive statistics help us understand the location or central tendency of data and the scatter or variability in data
    • Included in all statistical software packages, R does a good job calculating descriptive statistics

Central Tendency

  • Ostle, et. al. (1996) define central tendency as "the tendency of sample data to cluster about a particular numerical value"

  • Population mean

\[ \mu=\frac{1}{N}\sum_{i=1}^Nx_i \]
  • Sample mean
\[ \bar{x}=\hat{\mu}=\frac{1}{n}\sum_{i=1}^nx_i \]
  • Sample median - middle value

  • Sample mode - most commonly occuring number(s)


Measures of Variability

  • There are several statistics that measure the variability or spread present in data

  • Population variance

\[ \sigma^2=\frac{\sum_{i=1}^N\left(x_i-\mu\right)^2}{N} \]
  • Sample variance
\[ s^2=\hat{\sigma}^2=\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}{n-1} \]
  • Shortcut (Computational) Formula
\[ s^2=\frac{\sum_{i=1}^nx_i^2-\frac{\left(\sum_{i=1}^nx_i\right)^2}{n}}{n-1} \]
  • Standard deviation is often used because it is measured in the original units
\[ \sigma=\sqrt{\sigma^2};\;s=\sqrt{s^2} \]

R Function Summary - Data Frame

  • R code
summary(midterm)
  • Output is from Spring 2024 results

Descriptive Statistics


R Function Summary - Variable

  • R code
summary(midterm$MidtermExam)
  • Output is from Spring 2024 results

Descriptive Statistics


R Function Describe

  • Summary() does not report variability
  • Describe() has to be imported
  • Describe() is part of the package psych
  • R Code for descriptive statistics using psych package
library(psych)
describe(midterm)
  • Psych package output from Spring 2024

Describe() Output


Describe Output, part 2

Describe Output


Calculating Quantiles

reference for calculating quantiles


Quantile Example

Quantile Example


Exploratory Data (Graphical) Analysis

  • Exploratory data analysis (EDA) is the use of graphical procedures to analyze data.

  • John Tukey was a pioneer in this field and invented several of the procedures

  • Tools include stem-and-leaf diagrams, box plots, time series plots and digidot plots


Stem and Leaf Diagram

  • Excellent tool that maintains data integrity

  • The stem is the leading digit or digits

  • The leaf is the remaining digit

  • Make sure to include units

  • R Code

stem(midterm$MidtermExam)

Stem and Leaf Example

  • R output of a Stem and Leaf diagram

Stem and Leaf Plot of Midterm Exam Scores


Histogram

  • A histogram is a barchart displaying the frequency distribution information

  • There are three types of histograms: frequency, relative frequency and cumulative relative frequency

  • R code

hist(midterm$MidtermExam)

Histogram Example

  • R output of histogram

Histogram of Midterm Exam Scores


Boxplot

  • Graphical display that simultaneously describes several important features of a data set such as center, spread, departure from symmetry and outliers

  • Requires the calculation of quantiles (quartiles)

Box Plot 1

Box plot with explanation


Box Plot 2

examples of boxplots


Box Plot 3

  • R code for Box Plot
boxplot(midterm$MidtermExam,xlab='Score',main='Boxplot of Midterm Exam Scores')
  • R Box Plot output

Boxplot of Midterm Exam Scores


Time Series Plot

  • A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say \(x\)) and the horizontal axis denotes time

  • Excellent tool for detecting:

    • trends,

    • cycles,

    • other non-random patterns


Time Series Plot in R

Time Series Plot


Probability Plotting

  • Probability plotting is a graphical method of determining whether sample data conform to a hypothesized distribution

  • Used for validating assumptions

  • Alternative to hypothesis testing


Construction

  1. Sort the data from smallest to largest, . $$ x_{(1)},x_{(2)},\ldots,x_{(n)} $$

  2. Calculate the observed cumulative frequency \((j-0.5)/n\)

For the normal distribution find \(z_j\) that satisfies

\[ \frac{j-0.5}{n}=P(Z\leq z_j)=\Phi(z_j) \]
  1. Plot \(z_j\) versus \(x_{(j)}\) on special graph paper

Usage

  • If the data plots as a straight line, the assumed distribution is correct

normal probability plots from textbook, figure 6.21 on page 215


Probability Plot Example 1 in R

Normal Probability Plot


Probability Plot Example 2

  • Difficulty from example one is how close to straight is "good enough"
  • Add confidence bands to normal probability plot
    • Requires package car to be added to R
    • If all points are within the band, we are 95% confident that the sample is from a normal distribution. However if one or more points are not within band, the data is not from a normal distribution

QQ Plot with band


Multivariate Data

Matrix of Scatter Plot in R

Scatter Plots


Covariance in R

Covariance Matrix


Correlation

Correlation Matrix