MANE 3332.05
Lecture 18
Agenda
- Midterm exams are not graded; still contacting students who missed
- Linear Combination Practice Problems (assigned 10/28, due 10/30)
- Linea Combination Quiz (assigned 10/30, due 11/4)
- Complete Chapter Six and Start Chapter 7
- Attendance
- Questions?
Handouts
Chapter 6, continued
Calculating Quantiles

Quantile Example

Exploratory Data (Graphical) Analysis
-
Exploratory data analysis (EDA) is the use of graphical procedures to analyze data.
-
John Tukey was a pioneer in this field and invented several of the procedures
-
Tools include stem-and-leaf diagrams, box plots, time series plots and digidot plots
Stem and Leaf Diagram
-
Excellent tool that maintains data integrity
-
The stem is the leading digit or digits
-
The leaf is the remaining digit
-
Make sure to include units
-
R Code
stem(midterm$MidtermExam)
Stem and Leaf Example
- R output of a Stem and Leaf diagram

Histogram
-
A histogram is a barchart displaying the frequency distribution information
-
There are three types of histograms: frequency, relative frequency and cumulative relative frequency
-
R code
hist(midterm$MidtermExam)
Histogram Example
- R output of histogram

Boxplot
-
Graphical display that simultaneously describes several important features of a data set such as center, spread, departure from symmetry and outliers
-
Requires the calculation of quantiles (quartiles)
Box Plot 1

Box Plot 2

Box Plot 3
- R code for Box Plot
boxplot(midterm$MidtermExam,xlab='Score',main='Boxplot of Midterm Exam Scores')
- R Box Plot output

Time Series Plot
-
A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say \(x\)) and the horizontal axis denotes time
-
Excellent tool for detecting:
-
trends,
-
cycles,
-
other non-random patterns
-
Time Series Plot in R

Probability Plotting
-
Probability plotting is a graphical method of determining whether sample data conform to a hypothesized distribution
-
Used for validating assumptions
-
Alternative to hypothesis testing
Construction
-
Sort the data from smallest to largest, . $$ x_{(1)},x_{(2)},\ldots,x_{(n)} $$
-
Calculate the observed cumulative frequency \((j-0.5)/n\)
For the normal distribution find \(z_j\) that satisfies
- Plot \(z_j\) versus \(x_{(j)}\) on special graph paper
Usage
- If the data plots as a straight line, the assumed distribution is correct

Probability Plot Example 1 in R

Probability Plot Example 2
- Difficulty from example one is how close to straight is "good enough"
- Add confidence bands to normal probability plot
- Requires package car to be added to R
- If all points are within the band, we are 95% confident that the sample is from a normal distribution. However if one or more points are not within band, the data is not from a normal distribution

Multivariate Data
Matrix of Scatter Plot in R

Covariance in R

Correlation

Chapter 7 Overview
-
Chapter 7 contains a detailed explanation of point estimates for parameters
-
Much of this chapter is of a highly statistical nature and will not be covered in this course
-
Key concepts we will discuss are:
-
Statistical inference
-
Statistic
-
Sampling distribution
-
Point estimator
-
Unbiased estimate
-
MVUE estimator
-
Central limit theorem
-
Sampling distributions
-
Statistical Inference
-
Montgomery gives the following description of statistical inference.
The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. There methods utilize the information contained in a sample from the population in drawing conclusions. This chapter begins our study of the statistical methods used for inference and decision making.
-
Statistical inference may be divided into two major areas: parameter estimation and hypothesis testing
Point Estimate
-
Montgomery states that "In practice, the engineer will use sample data to compute a number that is in some sense a reasonable value (or guess) of the true mean. This number is called a point estimate."
-
Discuss examples
-
A formal definition of a point estimate is
A point estimate of some population parameter \(\theta\) is a single numerical value \(\hat{\theta}\) of a statistic \(\hat{\Theta}\). The statistic \(\hat{\Theta}\) is called the point estimate.
-
Notice the use of the "hat" notation to denote a point estimate
Statistic
-
Point estimate requires a sample of random observations, say \(X_1,X_2,\ldots,X_n\)
-
Any function of the sampled random variables is called a statistic
-
The function of the random variables is itself a random variable
-
Thus, the sample mean \(\bar{x}\) and the sample variance \(s^2\) are both statistics and random variables
Properties of point estimators
-
We would like point estimates to be both accurate and precise
-
An unbiased estimator addresses the accuracy criteria
-
A minimum variance unbiased estimator addresses the precision criteria
Unbiased Estimator
- The point estimator \(\hat{\Theta}\) is an unbiased estimator for the parameter \(\theta\) if
- If the point estimator is not unbiased, then the difference
is called the bias of the estimator \(\hat{\Theta}\)
MVUE
-
Montgomery gives the following definition of a minimum variance unbiased estimator (MVUE)
If we consider all unbiased estimators of \(\theta\), the one with the smallest variance is called the minimum variance unbiased estimator
-
An import fact is that the sample mean \(\bar{x}\) is the MVUE for \(\mu\) when the data comes from a normal distribution
Accuracy vs. Precision

Sampling Distribution
- The probability distribution of a statistic is called a sampling distribution
Central Limit Theorem
-
Definition of the Central Limit Theorem is
If \(X_1,X_2,\ldots,X_n\) is a random sample of size \(n\) taken from a population (either finite or infinite) with mean \(\mu\) and finite variance \(\sigma^2\), and if \(\overline{X}\) is the sample mean, the limiting form of the distribution of
as \(n\rightarrow\infty\), is the standard normal distribution
-
Important result because for sufficiently large \(n\), the sampling distribution of \(\overline{X}\) is normally distribution
-
This is a fundamental result that will be used extensively in the next four chapters of the textbook.