MANE 3332.05

Lecture 18

Agenda

Midterm exams are not graded; still contacting students who missed
Linear Combination Practice Problems (assigned 10/28, due 10/30)
Linea Combination Quiz (assigned 10/30, due 11/4)
Complete Chapter Six and Start Chapter 7
Attendance
Questions?

Handouts

Chapter 6, continued

Calculating Quantiles

reference for calculating quantiles

Quantile Example

Exploratory Data (Graphical) Analysis

Exploratory data analysis (EDA) is the use of graphical procedures to analyze data.
John Tukey was a pioneer in this field and invented several of the procedures
Tools include stem-and-leaf diagrams, box plots, time series plots and digidot plots

Stem and Leaf Diagram

Excellent tool that maintains data integrity
The stem is the leading digit or digits
The leaf is the remaining digit
Make sure to include units
R Code

stem(midterm$MidtermExam)

Stem and Leaf Example

R output of a Stem and Leaf diagram

Stem and Leaf Plot of Midterm Exam Scores

Histogram

A histogram is a barchart displaying the frequency distribution information
There are three types of histograms: frequency, relative frequency and cumulative relative frequency
R code

hist(midterm$MidtermExam)

Histogram Example

R output of histogram

Histogram of Midterm Exam Scores

Boxplot

Graphical display that simultaneously describes several important features of a data set such as center, spread, departure from symmetry and outliers
Requires the calculation of quantiles (quartiles)

Box Plot 1

Box plot with explanation

Box Plot 2

examples of boxplots

Box Plot 3

R code for Box Plot

boxplot(midterm$MidtermExam,xlab='Score',main='Boxplot of Midterm Exam Scores')

R Box Plot output

Boxplot of Midterm Exam Scores

Time Series Plot

A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say $x$) and the horizontal axis denotes time
Excellent tool for detecting:
- trends,
- cycles,
- other non-random patterns

Time Series Plot in R

Time Series Plot

Probability Plotting

Probability plotting is a graphical method of determining whether sample data conform to a hypothesized distribution
Used for validating assumptions
Alternative to hypothesis testing

Construction

Sort the data from smallest to largest, . $$ x_{(1)},x_{(2)},\ldots,x_{(n)} $$
Calculate the observed cumulative frequency $(j-0.5)/n$

For the normal distribution find $z_j$ that satisfies

\[ \frac{j-0.5}{n}=P(Z\leq z_j)=\Phi(z_j) \]

Plot $z_j$ versus $x_{(j)}$ on special graph paper

Usage

If the data plots as a straight line, the assumed distribution is correct

normal probability plots from textbook, figure 6.21 on page 215

Probability Plot Example 1 in R

Normal Probability Plot

Probability Plot Example 2

Difficulty from example one is how close to straight is "good enough"
Add confidence bands to normal probability plot
- Requires package car to be added to R
- If all points are within the band, we are 95% confident that the sample is from a normal distribution. However if one or more points are not within band, the data is not from a normal distribution

QQ Plot with band

Multivariate Data

Matrix of Scatter Plot in R

Scatter Plots

Covariance in R

Covariance Matrix

Correlation

Correlation Matrix

Chapter 7 Overview

Chapter 7 contains a detailed explanation of point estimates for parameters
Much of this chapter is of a highly statistical nature and will not be covered in this course
Key concepts we will discuss are:
- Statistical inference
- Statistic
- Sampling distribution
- Point estimator
- Unbiased estimate
- MVUE estimator
- Central limit theorem
- Sampling distributions

Statistical Inference

Montgomery gives the following description of statistical inference.

The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. There methods utilize the information contained in a sample from the population in drawing conclusions. This chapter begins our study of the statistical methods used for inference and decision making.
Statistical inference may be divided into two major areas: parameter estimation and hypothesis testing

Point Estimate

Montgomery states that "In practice, the engineer will use sample data to compute a number that is in some sense a reasonable value (or guess) of the true mean. This number is called a point estimate."
Discuss examples
A formal definition of a point estimate is

A point estimate of some population parameter $\theta$ is a single numerical value $\hat{\theta}$ of a statistic $\hat{\Theta}$. The statistic $\hat{\Theta}$ is called the point estimate.
Notice the use of the "hat" notation to denote a point estimate

Statistic

Point estimate requires a sample of random observations, say $X_1,X_2,\ldots,X_n$
Any function of the sampled random variables is called a statistic
The function of the random variables is itself a random variable
Thus, the sample mean $\bar{x}$ and the sample variance $s^2$ are both statistics and random variables

Properties of point estimators

We would like point estimates to be both accurate and precise
An unbiased estimator addresses the accuracy criteria
A minimum variance unbiased estimator addresses the precision criteria

Unbiased Estimator

The point estimator $\hat{\Theta}$ is an unbiased estimator for the parameter $\theta$ if

\[ E\left(\hat{\Theta}\right)=\theta \]

If the point estimator is not unbiased, then the difference

\[ E\left(\hat{\Theta}\right)-\theta \]

is called the bias of the estimator $\hat{\Theta}$

MVUE

Montgomery gives the following definition of a minimum variance unbiased estimator (MVUE)

If we consider all unbiased estimators of $\theta$, the one with the smallest variance is called the minimum variance unbiased estimator
An import fact is that the sample mean $\bar{x}$ is the MVUE for $\mu$ when the data comes from a normal distribution

Accuracy vs. Precision

graph of accuracy vs. precision

Sampling Distribution

The probability distribution of a statistic is called a sampling distribution

Central Limit Theorem

Definition of the Central Limit Theorem is

If $X_1,X_2,\ldots,X_n$ is a random sample of size $n$ taken from a population (either finite or infinite) with mean $\mu$ and finite variance $\sigma^2$, and if $\overline{X}$ is the sample mean, the limiting form of the distribution of

\[ Z=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}} \]

as $n\rightarrow\infty$, is the standard normal distribution

Important result because for sufficiently large $n$, the sampling distribution of $\overline{X}$ is normally distribution
This is a fundamental result that will be used extensively in the next four chapters of the textbook.