Skip to content Skip to navigation

Connexions

You are here: Home » Content » Collaborative Statistics: Glossary

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

    • External bookmarks
  • E-mail the authors

Recently Viewed

This feature requires Javascript to be enabled.

Collaborative Statistics: Glossary

Module by: Dr. Barbara Illowsky, Susan Dean

Summary: This module contains a number of glossary terms related to elementary statistics. This module represents the combined glossary information for the Collaborative Statistics textbook/module (col10522).

Note: You are viewing an old version of this document. The latest version is available here.

Glossary

Addition Rule:
For any events A A size 12{A} {} and B B size 12{B} {} in the sample space P ( A or B ) = P ( A ) + P ( B ) P ( A and B ) P ( A or B ) = P ( A ) + P ( B ) P ( A and B ) size 12{P \( A bold "or"B \) =P \( A \) +P \( B \) -P \( A bold "and"B \) } {} .
Analysis of Variance:
Also referred to as ANOVA. A method of testing whether or not the means of three or more populations are equal. The method is applicable if:
  • All populations of interest are normally distributed.
  • The populations have equal standard deviations.
  • Samples (not necessarily of the same size) are randomly and independently selected from each population.
The test statistic for analysis of variance is the F-ratio.
AND:
Logical operation over the subsets of a set. In statistics, if A A size 12{A} {} and BB size 12{B} {}{} are any two events (subsets in the sample space), then the event “ AA size 12{A} {} and BB size 12{B} {}” consists of all possible outcomes that are common for both AA size 12{A} {} and BB size 12{B} {}.
Arithmetic Mean:
The sum of the values divided by the number of values. The notation for the mean of a sample is x¯ x . The notation for the mean of a population is μμ.
Average:
A number that describes the central tendency of the data. There are a number of specialized averages, including the arithmetic mean, weighted mean, median, mode, and geometric mean.
Bayes' Theorem:
Developed by Reverend Bayes in the 1700s). A rule designed to find the probability of one event, AA size 12{A} {}, occurring, given that a finite set of another events, {Bi,i=1,2,...,l}{Bi,i=1,2,...,l size 12{B rSub { size 8{i} } ,i=1,2, "." "." "." ,l} {}}, has occurred.
Bernoulli Trials:
An experiment with the following characteristics:
  • There are only 2 possible outcomes called “success” and “failure” for each trial.
  • The probabilities pp of success and q = 1-pq=1-p for failure are the same for any trial.
Bias:
A possible consequence if certain members of the population are denied the chance to be selected for the sample.
Binomial Distribution:
A discrete random variable (RV) which arises from the Bernoulli trials with the next additional requirements. There are fixed number, n, of independent trials. “Independent” means that the result to any trial (for example, trial 1) in no way affects the answer to all the following trials, and all trials are conducted under the same conditions. Under these circumstances the binomial RV XX size 12{X} {} is defined as the number of success in n trials. The notation is: XX~ B ( n , p )B(n,p); the domain is {0,1,2 ,...,n}{0,1,2,...,n} the mean is μ=np μ np , and the variance is σ 2 = df σ 2 =df. The probability to have exactly xx successes in nn trials is P ( X = x ) = n x p x q n x P(X=x)= n x p x q n x .
Central Limit Theorem:
Given a random variable (RV) with known mean μμ and known variance σσ 22 size 12{ {} rSup { size 8{2} } } {}, we are sampling with size n and we are interested in two new RV - sample mean, XˉXˉ size 12{ { bar {X}}} {},and sample sum,ΣΣ XX size 12{X} {}. If the size nn of the sample is sufficiently large, then XˉXˉ size 12{ { bar {X}}} {} N μ σ 2 n N μ σ 2 n and ΣXΣX size 12{X} {}N ( , n σ 2 )N(,n σ 2 ). In words, if the size n of the sample is sufficiently large, then the distribution of the sample means and the distribution of the sample sums will approximate a normal distribution regardless of the shape of the population. And even more, the mean of the sampling distribution will equal the population mean and mean of sampling sums will equal n times the population mean. The standard deviation of the distribution of the sample means, σ n σ n , is called standard error of the mean.
Charts:
Special graphical formats used to visualize a frequency distribution. They include, but are not limited to: histograms, frequency polygons, cumulative frequency polygons, box plots, stemplots, bar charts, Venn and tree diagrams, and pie charts. Some of them, together with explanations for which kind of chart fits better to the given situation, you can find in Descriptive Statistics.
Class Mark:
Midpoint of the class.
Chi-square Distribution:
The distribution with following characteristics:
  • The random variable (RV) is continuous and takes only nonnegative values (in fact, it is a sum of squares of kk size 12{k} {} independent normal distributions).
  • There is a "family" of Chi-squared distributions. Each representative of the family is completely defined by the number of degrees of freedom, k1k1 size 12{k - 1} {}, where kk size 12{k} {}is a number of categories (not a size of sample).
  • The pdf is positively skewed, however, as kk size 12{k} {} increases (kk size 12{k} {}>90), the distribution begins to approximate the normal distribution.
The notation is: XX size 12{X} {}χ2dfχ2df size 12{c rSup { size 8{2} rSub { size 8{ ital "df"} } } } {}; the mean μ = dfμ = df size 12{μ" = df"} {}; the variance σ2 = dfσ2 = df size 12{s rSup { size 8{2} } " = df"} {}. Chi-squared distribution is used to calculate the test statistic in Goodness-of-fit Test (to determine if a population follows specified distribution), Test of Independence (to determine if two factors are or are not related), and Test for Single Variance.
Class:
The interval in which the data are booked. It is convenient to group outcomes into the classes when working with large mass of data, particularly when data is continuous. In this case it is easier to visualize data. For example, every bar in histogram corresponds to one class and the midpoint of the interval can be chosen as a representative of all outcomes in the class. Midpoint of the class often called a class mark.
Cluster Sampling:
A procedure used if the population is dispersed over a wide geographic area. The area is divided in some way into smaller units (counties, precincts, blocks, etc.) called primary units. Then a few primary units are chosen, and a random sample is selected from each unit.
Coefficient of Correlation:
A measure developed by Karl Pearson (early 1900s) that gives the strength of association between the independent variable and the dependent variable. The formula is:
r = n XY ( X ) ( Y ) [ n X 2 ( X ) 2 ] [ n Y 2 ( Y ) 2 ] , r = n XY ( X ) ( Y ) [ n X 2 ( X ) 2 ] [ n Y 2 ( Y ) 2 ] , size 12{r= { {n Sum { ital "XY"} - \( Sum {X \) \( Sum {Y \) } } } over { sqrt { \[ n Sum {X rSup { size 8{2} } - \( Sum {X \) rSup { size 8{2} } \] \[ n Sum {Y rSup { size 8{2} } - \( Sum {Y \) rSup { size 8{2} } \] } } } } } } } ,} {} (1)
where n is the number of data points. The coefficient cannot be more then 1 and less then -1. The closer the coefficient is to ±1±1 size 12{ +- 1} {}, the stronger the evidence of a significant linear relationship between XX size 12{X} {} and YY size 12{Y} {}.
Cumulative Distribution Function (CDF):
Given a quantitative random variable (RV) [that is, given ( XX size 12{X} {}, PDF) for discrete RV and ( XX size 12{X} {}, pdf) for continuous RV we consider for all xx in the domain of XX the events {set of all outcomes that are less or equal to xx size 12{x} {}}. The probability distribution P ( X x ) P( X x ) is called Cumulative distribution function.
Cumulative Relative Frequency:
The concept applies to an ordered set of observations from smallest to largest, or vise versa. Cumulative relative frequency is the sum of relative frequencies for all values that are less than or equal to the given value.
Complement Event:
The event consisting of all outcomes that are not in the given event.
Conditional Probability:
The likelihood that an event will occur given that another event has already occurred.
Confidential Interval:
An interval estimate for unknown population parameter. This depends on:
  • The desired confidence level.
  • What is known for the distribution information (for ex., known variance).
  • Gathering from the sampling information.
Confidence Level:
The percent expression for the probability that the confidence interval contains the true population parameter. That is, for ex., if CL=90%, then in 90 out of 100 samples the interval estimate will enclose the true population parameter.
Contingency Table:
The method of displaying a frequency distribution in case of dependable (contingent) variables; the table provides the easy way to calculate conditional probabilities.
Continuous RV:
A random variable (RV) with continuous domain.

Example:

The height of trees in the forest is a continuous RV.

Correlation Analysis:
A group of statistical procedures used to measure the strength of the relationship between two variables.
Counting Principal:
If there are mm size 12{m} {} ways of doing one thing and nn size 12{n} {} ways of doing another, then there are m×nm×n size 12{m times n} {} ways of doing both.

Example:

A cafe offers m=5m=5 size 12{m=5} {} kinds of coffee and n=7n=7 size 12{n=7} {} kinds of cake. There are 35 ways to serve coffee with cake.

Critical Value:
The dividing point between the region where the null hypothesis is not rejected and the region where it is rejected. For a one-tailed test, there is only one critical value; for a two-tailed test, there are two critical values—one in each tail— with the same absolute value and opposite signs.
Data:
A set of observations (a set of possible outcomes). Most data can be put into two groups: qualitative (hair color, ethnic groups and many other attributes of population) and quantitative (distance traveled to college, number of children in a family, etc.). In its turn quantitative data can be separated into two subgroups: discrete and continuous. Roughly speaking, data is discrete if it is result of counting (a number of student of the given ethnic group in a class, a number of books on a shelf, etc.), and data is continuous if it is result of measuring (distance traveled, weight of luggage, etc.)
Degrees of Freedom (df):
The number of objects in a sample that are free to vary.
Dependant Samples:
Samples chosen from several populations in such a way that they are not independent of each other. Paired samples are dependent because the same individual or item is a member of both samples.

Example:

If the test scores of 13 individuals were recorded before a new teaching method was introduced, and then after using the new method, the two paired samples would be considered dependent.

Descriptive Statistics:
The methods to describe the important characteristics of a data; for example, charts, frequency distribution, measures of central tendency and measures of spread and skewness.
Discrete RV:
A random variable (RV) that can assume only countable set of values.

Example:

  • Face nominations of cubic die ={1,2,3,4,5,6}={1,2,3,4,5,6} size 12{ {}= lbrace 1,2,3,4,5,6 rbrace } {}.
  • A number of accidents on HW280 at Thanksgiving Holidays).

Domain:
A set of possible values for (independent) variable. Domain is a very important part of the definition of a function. For example, the equation y=x2y=x2 size 12{y=x rSup { size 8{2} } } {} defines one-to-one function if domain is the set of nonnegative real numbers and not one-to-one function if domain is the set of all real numbers.

Example:

  • We are interested in the longevity of human life in years; the domain is {0,1,2,3...,120}{0,1,2,3...,120} size 12{ lbrace 0,1,2,3 "." "." "." ,"120" rbrace } {}.
  • We are interested in the suit when dealing with the regular deck; the domain is { ; ; ; } { ; ; ; } .

Equally Likely:
Each outcome of an experiment has the same probability.
Error Bound for a Population Mean (EBM):
The margin of error. Depends on the confidence level, sample size, and known or estimated population standard deviation.
Error Bound for a Proportion (EBP):
The margin of error. Depends on the confidence level, sample size, and the estimated (from the sample) proportion of success.
Event:
A subset in the set of all outcomes of an experiment. The set of all outcomes of an experiment is called a sample space and denoted, as a rule, by S. An event is any arbitrary subset in S: it can contain one outcome, two outcomes, and even no outcomes (empty subset) or all of them (sample space). Standard notations for events are capital letters such as A, B, C, etc.
Exhaustive:
Each outcome must appear in one class (category).
Expected Value:
Expected arithmetic average when an experiment is repeated many times. (Called also mean). Notations: E(x),μE(x),μ size 12{E \( x \) ,μ} {} For discrete random variable (RV) with probability distribution function P(x)=P(X=x)P(x)=P(X=x) size 12{P \( x \) =P \( X=x \) } {} the definition also can be written in the form E(X)=μ=xP(x)E(X)=μ=xP(x) size 12{E \( X \) =μ= Sum { ital "xP" \( x \) } } {}.
Exponential Distribution:
Continuous random variable (RV) that appears when we are interested in intervals of time between some random events, for example, the length of time between emergency arrivals at a hospital. Notation: X~Exp(m)X~Exp(m) size 12{X "~" ital "Exp" \( m \) } {}; the mean is μ=1mμ=1m size 12{μ= { {1} over {m} } } {}, and the variance is σ 2 = 1 m 2 σ 2 = 1 m 2 , the probability density function is f(x)=memx,f(x)=memx, size 12{f \( x \) = ital "me" rSup { size 8{- ital "mx"} } ," "} {} x 0 x 0 and cumulative distribution is P(Xx)=1emxP(Xx)=1emx size 12{P \( X <= x \) =1-e rSup { size 8{- ital "mx"} } } {}.
Experiment:
A planned activity carried out under controlled conditions.
F Distribution:
Developed by Sir Ronald Fisher. The distribution with following characteristics:
  • The random variable (RV) is a ratio (called F-ratio) of two sums of weighted squares; so it is continuous and takes only nonnegative value.
  • The pdf is positively skewed approaching the x-axis never touching it.
  • There is a "family" of F distributions.
Every representative of the family is completely defined by 2 parameters: a number of degrees of freedom in the numerator in F-ratio and the number of degrees of freedom in the denominator in F-ratio. Used to calculate the test statistic in testing of 2 population variances and in ANOVA problems.
Frequency Distribution:
A grouping of data into mutually exclusive classes showing the number of outcomes in each class.
Frequency:
A number of times a value of the data is occurred in the set of all data.
Geometric Distribution:
A discrete random variable (RV) which arises from the Bernoulli trials with the next additional requirement: we keep repeating trials until the first success. Under these circumstances the geometric variable XX is defined as the number of trials until the first success. The notation is: XX G ( p )G(p); the domain is {1,2,...,n}{1,2,...,n} size 12{ lbrace 1,2, "." "." "." ,n rbrace } {}; the mean is μ = 1 p μ= 1 p , and the variance is σ2=1p(1p1).σ2=1p(1p1). size 12{s rSup { size 8{2} } = { {1} over {p} } cdot \( { {1} over {p} } -1 \) "." } {} The probability to have exactly x failures before the first success is given by the formula: P(X=x)=p(1p)x1P(X=x)=p(1p)x1 size 12{P \( X=x \) =p \( 1 - p \) rSup { size 8{x - 1} } } {}.
Geometric Mean:
The nth root of the product of n the values.
Hypergeometric Probability:
A discrete random variable (RV) with characteristics:
  • There is a fixed number of trials.
  • The probability of success is not the same from trial to trial, so it is not Bernoulli trials.
The typical example is sampling from a mixture of two groups of items, when we are interested in the only one. XX is defined as the number of successes out of the total number chosen. The notation is: X~H(r,b,n).X~H(r,b,n). size 12{X "~" H \( r,b,n \)} {}, where rr = number of items in the group of interest, bb = number of items in the group not of interest, and nn = number of items chosen.
Hypothesis Testing:
Based on sample evidence procedure to determine whether the hypothesis stated is a reasonable statement and cannot be rejected, or is unreasonable and should be rejected.
Hypothesis:
A statement about the value of a population parameter. In case of two hypotheses, the statement assumed to be true is called null hypothesis (notation H0H0 size 12{H rSub { size 8{0} } } {}) and contradictory statement is called alternate hypothesis (notation HaHa size 12{H rSub { size 8{a} } } {}).
Independent Events:
The occurrence of one event has no effect on the probability of the occurrence of any other event. Events A and B are independent if one of the following is true: (1). P ( A 2 B ) = P ( A ) ;P( A 2 B)=P(A); (2) P ( B 2 A ) = P ( B ) ;P( B 2 A)=P(B); (3) P ( A and B ) = P ( A ) P ( B )P(AandB)=P(A)P(B).
Independent Samples:
Samples that are not related in any way.
Inferential Statistics :
also called statistical inference or inductive statistics. This facet of statistics deals with estimating a population parameter based on a sample statistic. For example, if 4 out of the 100 calculators sampled are defective we might infer that 4 percent of the production is defective.
Interquartile Range (IRQ):
The distance between the third quartile and the first quartile.
Interval Estimate:
The based on sample information interval within which a population parameter probably lies.
Level of Significance of the Test :
Also often referred as preconceived α or probability of Type I error. The probability to reject the null hypothesis when the null hypothesis, in fact, is true.
Linear Regression Equation :
A linear equation in the form y ^ = a + bX y ^ =a+bX, that defines the relationship between two variables. It is used to predict dependent variable YY based on a selected value of independent variable XX.
Mean:
A number to measure the central tendency (average), shortening from arithmetic mean. By definition, the mean for a sample (usually denoted by XˉXˉ size 12{ { bar {X}}} {}) is Xˉ=Sum of all values in the sampleNumber of values in the sampleXˉ=Sum of all values in the sampleNumber of values in the sample size 12{ { bar {X}}= { {"Sum of all values in the sample"} over {"Number of values in the sample"} } } {}, and the mean for a population (usually denoted by μμ size 12{m} {}) is μ=Sum of all values in the populationNumber of values in the populationμ=Sum of all values in the populationNumber of values in the population size 12{m= { {"Sum of all values in the population"} over {"Number of values in the population"} } } {}.
Median:
A number that separates ordered data into halves: half the values are the same number or smaller than the median and half the values are the same number or larger than the median. The median may or may not be part of the data.
Mode:
The value that appears most frequently in a set of data.
Multiplication Rule:
For any events A and B in the sample space, {} P ( A and B ) = P ( A B ) P ( B ) = P ( B A ) P ( A ) . P ( A and B ) = P ( A B ) P ( B ) = P ( B A ) P ( A ) . size 12{P \( A bold "and"B \) =P \( A \lline B \) cdot P \( B \) =P \( B \lline A \) cdot P \( A \) "." } {}
Mutually Exclusive:
An observation cannot fall into more than one class (category). Being in one category prevents being in a mutually exclusive category.
Normal Distribution:
A continuous random variable (RV) with pdf= 1 σ 2π e -(x-μ) 2 2 σ 2 pdf= 1 σ 2π e -(x-μ) 2 2 σ 2 , where μμ is the mean of the distribution and σσ is its standard deviation. Notation: XX ~ N μ σ 2 N μ σ 2 . If μ=0μ=0 and σ=1σ=1, the RV is called standard normal distribution, or z-score.
One-Tailed Test:
Used when the alternate hypothesis states a direction, such as Ha:m > 40. Rejection region is only in one tail (the right tail).
OR:
Logical operation over the subsets of a set. In statistics, if AA and BB are any two events (subsets in the sample space), then the event “AA or BB” consists of all outcomes that are in AA, or in BB, or in both AA and BB.
Outcome (observation):
A particular result of an experiment.
Outlier:
An observation that does not fit the rest of the data.
Parameter:
A numerical characteristic of the population.

Example:

The mean price to rent a 1-bedroom apartment in California.

pdf:
PDF:
Percentile:
A number that separates 11001100 size 12{ { {1} over {"100"} } } {}of the data.

Example:

Let a data set contain 200 ordered observations starting with {2.3,2.7,2.8,2.9,2.9,3.0...}{2.3,2.7,2.8,2.9,2.9,3.0...} size 12{ lbrace 2 "." 3,2 "." 7,2 "." 8,2 "." 9,2 "." 9,3 "." 0 "." "." "." rbrace } {}. Then the first percentile is (2.7+2.8)2=2.75(2.7+2.8)2=2.75 size 12{ { { \( 2 "." 7+2 "." 8 \) } over {2} } =2 "." "75"} {}, because 1% of the data is to the left of this point on the number line and 99% of the data is on its right. The second percentile is (2.9+2.9)2=2.9(2.9+2.9)2=2.9 size 12{ { { \( 2 "." 9+2 "." 9 \) } over {2} } =2 "." 9} {}, separating 2% of the data. Percentiles may or may not be part of the data. (In this example, the first percentile is not in the data, but the second percentile is.). The median of the data is the second quartile and is the 50-th percentile at the same time. The first and third quartiles are 25th and 75th percentiles, respectively.

Point Estimate:
A single number computed from a sample and used to estimate a population parameter.
Poisson Distribution:
A discrete random variable (RV) is the number of times a certain event will occur in a specific period of time, or in specific area, or any other units of measurement. The characteristics of the variable are: the probability that an event occurs in a given unit is the