Sunday, October 12, 2014

Confidence Level and Confidence Interval

Being confident make one's self more reassured. Briefly, explanations below are for two sided confidence levels/intervals in order to simplify the idea. Saying "two sided" gives initial impression that there is something like two limits, yeah they are: upper and lower limits where the confidence interval lies in between.

Example: Let's look at the population of a specific mobile phone model. Suppose we are now interested in the 'weight' property. We found that weight property follows a normal distribution with mean value of 120 grams and a standard deviation of 1.4 grams.

Weight ~ Normal (Mu, Sigma) = Normal (120, 1.4)

This understanding means that majority of mobiles tested will weigh very closely to 120 grams. Yes, there should be fluctuations above and below the mean value but surely that still relatively close to mean value.

Suppose a question: do you expect weights like: 121, 119.5, 122.1, 118.9?
Answer: Yes, I surely expect such values.

Another question: do you expect weights like: 158, 67, 140.8, 82.5?
Answer: No! This seems impossible.

For any normal distribution, values goes to + and - infinity. But, as we seen, it has no sense to consider those values away from the mean value as they mostly will not occur (in statistics we say they have extremely low probability).

Here it comes: confidence level means we consider only those values (within the confidence interval) that will mostly be seen. The most popular confidence level is 95% which means we focus on 95% of possible data/values. Far values away from mean value mostly will not be seen, thus we sacrificed by them (5% of data).

The 5% is usually called as alpha which means the percentage of sacrificed data due to being so far of mean value. When having two sided confidence interval, alpha is divided into to halves (0.025 each): upper and lower, which is logical to represent upper and lower far values.

For example above, simple calculations lead to find the 95% confidence interval which is [117.26, 122.74] defining [lower, upper] limits respectively. This result simply means that 95% of mobiles will be found to weigh between 117.26 and 122.74 grams.
Note: the confidence interval depends on the population variance or standard deviation. Larger variance means wider confidence interval and vice versa.

Tuesday, September 30, 2014

Conclusions of Hypothesis Testing

A general hypothesis is defined as following (eg a hypothesis on the population mean):

H0: Mu = Mu0
H1: Mu !=  Mu0

OK, apart from we have a two or one sided hypothesis, after performing the checking and statistical tests: our conclusion should be one of the following:
  • Rejecting the null hypothesis (H0).
  • Failing to reject the null hypothesis (H0).
The following statements for conclusions are not accurate:
  • Accepting the null hypothesis (H0).
  • Accepting the alternative hypothesis (H1).

But why?

When we fail to reject H0, it does not mean we accept H0 as a fact because we still could not prove it as a fact. But what happened is that we failed to prove it to be false. This goes like following: we have suspected new factors may affected the population mean, then we have taken all possible evidences and checking, but all checking failed to prove our suspects.

As well, rejecting H0 does not mean accepting H1 as a fact. What happens in this case is we prove, statistically, that H0 is false but not necessary H1 is true fact. Simply: our evidences and checks for the mean proved that it has changed, but we still have no guarantee that it changed into H1 region or this was due to different reasons/factors.

Saturday, September 27, 2014

Null and Alternative Hypotheses

Fine... Constructing a statistical hypothesis is mainly to define what's called the null and the alternative hypotheses. Mostly in academic life, students are given the hypothesis to test. But in research or real experiment, constructing the hypotheses correctly is such vital step toward inference or statistical decision.

Let's focus on hypotheses for population mean, for simplicity goals...

Null hypothesis is mainly our initial information or primary belief about the population. Let's consider the production of 500 ml bottles. The 500 ml is the mean capacity of bottles capacity. Since this is the information we know from previous knowledge, it will be our null hypothesis.

We write:
H0: Mu = 500

Note: null hypothesis always has equality sign!

Test for a hypothesis is usually done when we are worried that some factors affected the population, or population has changed for any possible reasons.
OK, so the null hypothesis is the information we know primarily before the suspected change. So, here come the alternative hypothesis to be defined. The alternative hypothesis is usually the region where we suspect or worry that population has changed into.

For bottles capacity example: the change in 500 ml bottle capacity (for any reason) is a bad issue to encounter. For production: it's bad that our bottles be either less than or greater than 500 ml. Lower capacity means unmatched regulations and higher means extra content added.

This is called two sided hypothesis because either change up/down is undesired. Thus, we define our alternative hypothesis as:

H1: Mu != 500 (!= here means not equal)

The other type of hypothesis is to be one sided, when we interested only in (> or <) in alternative hypothesis. In such cases, we suspect/only worried that some factors changed our population toward one direction (up only, down only).

Wednesday, September 24, 2014

Understanding the distribution of sample mean (x_bar)

Cool, say now we have a huge population with characteristics (Mu, Sigma^2). When doing a study by sampling, we take a random sample (size n items) and then perform the study on the sample and conclude results back for the population.

From Central Limit Theorem, we know that the sample mean will always follow a normal distribution apart from what the population distribution is, such that:

x_bar ~ N (Mu, Sigma^2/n)
or say:
Expected (x_bar) = Mu
Variance (x_bar) = Sigma^2/n

Well, let's see a simple illustrating example: Suppose we have a population with mean Mu=100.
Now, we have taken a sample, and computed the sample mean, x_bar. We mostly will have x_bar near 100 but not exactly 100. OK, let take another 9 separate samples... suppose these results:

First sample --> x_bar = 99.8
Second sample --> x_bar = 100.1
10th sample --> x_bar = 100.3

What we see that the sample mean is usually close to real population mean, that is the meaning of the expected value of x_bar will be Mu.

Regarding the variance of sample mean (x_bar), variance will always decrease as sample size increase (sample variance=Sigma^2/n) which is natural behavior. We may think of this as the larger sample size we use, we tend to have more precise values for population mean.
When sample size goes to infinity (theoretically), the x_bar variance will be zero. The reason here is that the sample will be exactly the same as the population (all items). Thus, sample mean will give the real exact value for the population mean. There will be no variability in the sample mean because the it fully represents the population mean.

Tuesday, September 23, 2014

The Fact and the Hypothesis

A good fact to submit is that we can't easily know the exact truth values/parameters of a population. Mostly, population parameters also change slightly by time and/or affected by different surrounding factors.

Example: a production line for the 500 ml bottles is assumed to produce a population of bottles such that mean value of bottles capacity is exactly 500 ml.

Nice, but what happens in realty?

In realty, several factors will mostly affect the production: human factors, machine factors, environment temperature...etc. Also, each new bottle will contribute in the population mean value. This means a continuous slight change, either up or down, of the mean capacity.

Here comes the hypothesis!

As you see, the ground truth value for population mean is difficult to be exactly determined. However, we have general assumptions/expectations.
OK, constructing a hypothesis should always be driven by our initial knowledge and expectations about the population.
Testing the hypothesis is statistical checking methods to judge these beliefs. Test results should conclude/push more beliefs into either:
  • Failure to say the parameter (eg mean value) has changed. Example: the 500 ml bottles capacity should be considered stable/no change at 500 ml. This conclusion is known as (failing to reject the null hypothesis).
  • Rejecting this initial assumption (we call it later the null hypothesis). This means that some factors affected the production and the mean value has changed to different values (up or down). Then, further monitoring/improvements should be decided to solve the issue. This conclusion is known as (rejecting the null hypothesis).

Monday, September 22, 2014

Standard Normal Distribution, what does Z mean?

You mostly know: the Standard Normal Distribution is the special case of Normal Distribution, given that:

Mean: Mu=0.0
Variance: Sigma^2=1.0

Cool, as shown below: the family of normal distributions mainly vary in their mean value and/or their variance.

The standard one plays the role of being the reference distribution. Well, we can convert any normal random variable to corresponding interpretations in the standard form. Hence, we simplify different computations only using standard normal distribution.

OK, let's assume X is a normal distribution with mean Mu and variance Sigma^2. We can convert to a standard normally distributed random variable by following:


Here, we got Z as the standard normal distribution. OK, but what does this mean?
  • Any point in X (with Mu, Sigma) can be dealt exactly as the converted point in Z with mean Mu=0.0 and Sigma=1.0.
  • The numerator means: how much is the distance or difference between X and the mean Mu. The division by Sigma means: how many Sigmas is that difference? Totally: Z means how many Sigmas the distance between X and Mu is. This information is sufficient in further computations using only the standard normal distribution.

Sunday, September 21, 2014

Normal Probability Distribution

Also called Gaussian distribution. OK, many things in this world tends, and should do, to be normally distributed.
Any distribution is a representation of how the information or data is distributed. We mainly look for its central tendency (mean) and variability (variance). That's why the normal distribution is usually written as:

N ~ (Mu, Sigma^2)

For example: the weight of most adult (who still youth) people will normally be centered around some values. Yes, you right there is a diversity: some are slim and some are obese.

We may expect the average weight for people (example: ages 20 to 30) to be between 70 to 74 kg. OK, let's consider it as 72 (this is the mean value).

Let x represents the weight of a random person. Thus,

Expected Value [x] = mean [x] = Mu = 72 kg

If we have a sample, we can compute the variance (sigma^2) to indicate variability. But we may here think as following:

Variance = Sigma^2 = Expected Value [(x-Mu)^2]
Standard Deviation = Sigma = square root [variance]

Got it? The variance is just the squared expected (average) difference between values of x and its mean Mu.

Assume that the weights could vary (in average) +4 or -4 kgs from the mean value. Thus, we have

Sigma approx= 4
Variance approx= 16

We may conclude, the probability distribution of youth people weight:

weight = x ~ N (72, 16)

Note: this is just an illustrative example where real information may be different depending on location or other factors.

Facts for any normally distributed data:
  • Within 1 sigma distance/difference from the mean value (to left and right), there exist about 68% of data.
  • Within 2 sigmas distance/difference from the mean value (to left and right), there exist about 95% of data.

Saturday, September 20, 2014


Variability (as name inspires) show how much values/elements are different/diverse from each others. When all values are very close to the mean value, it means we have little variability or diversity.

The common used variability measure is the variance (or standard deviation, which is the positive square root of variance). Besides, the range measure can indicate the amount of variability within a group of data.

I guess the formula to compute variability is known, but let's show it again:

Suppose we had the following measurements data {5.2, 5.3, 4.95, 5.17, 5.22}.
The diversity/variability within values seems small by inspection. Calculating the variance s^2=0.01717 which is clearly a small value.

But let's see this group of data {30, 45, 39, 28, 42}.
The diversity/variability within values seems large by inspectionCalculating the variance s^2=55.7 which is clearly a large value.

Central Tendency

Any population is usually best described by its central tendency and variability measures. Well, these parameters should be used to best interpreting for information on a specific property under study.

Central tendency measures the value that mostly whole data/elements are centered around or grouped at. Which means, data values tend to be like/close to this value.

The most common methodology the describe the central tendency is the mean value, also called the expected value, or average value (usually for samples). There still more methods to indicate central tendency such as mode and median.

The average (or mean) of a sample of data is just the summation divided by the data count. Whereas, the mode is the most repeated/frequent value within our data. Also the median represent the middle ordered value when sorting the data group in ascending manner.

Example: the water (or juice) bottles production is a population where we can think of the bottle capacity as an important feature/property. When we look at whole 500 ml bottles, capacity values tend to be 500 ml (which is the mean or expected value). Notice that despite all bottles capacities should be close to 500 ml, they must not be exactly 500 ml. So, some can be 499.5 ml, 500.2...etc. In general, values tend to a specific center, which is the mean value.

Friday, September 19, 2014

Sample size and measurements accuracy

Seems obvious that larger sample size will give more close results to what happens in realty. Cool, if it's easy to select large elements to represent the sample, then we should do. Usually a sample of size 30 to 40 elements is great to have.

We should increase the sample size whenever we:
  • Can easily handle/select sample items (randomly).
  • Perform the measurement with little effort/time consumption.
Notice these are highly dependent on the problem or population nature and situations. Another point to keep is to have accurate measurements for the sample as possible.

In conclusion, two important factors should be regarded:
  • Good sample size as large as possible. This guarantees more information diversity is included in our sample. Thus it's more close to describe the whole population facts.
  •  Using accurate measuring tools/equipment when measuring the required property of the sample because these measurements will judge next steps. This requirement becomes more critical in small sample sizes.

How to be "random" in sample selection?

Well, no fixed criteria in statistics. Yes, there exist many theories and methodologies to create random sequences or numbers. However, this will depend on the population nature, situations and surrounding environment.

In general, we may think like following:
  • Many persons/machines to share in sample selection is better than to be done by only one. Diversity leads to better randomness.
  • Selecting in different times/situations/places is better than to do at once in order to get more randomness.
  • Changing methodologies/media of selection may help.
  • Combining two or more random samples create better random sample.

Anyway, since our goal is to get accurate inferences, we should try as possible to be randomized in sample selection.

The "Sample"

Anytime you aim to perform a study on the entire population, you will surely find that this task will be:
  • Much time and/or efforts consuming as populations are normally huge.
  • Impossible if the population is infinite (such as products).
Here comes the role of taking samples. Yes! we just take a sample from the whole population, perform the study on the chosen sample, apply the results back to our population.

This is the core of inferential statistics because what we do is to infer parameters/properties of the population using information from a small sample.

Well, this does not mean we will obtain 100% exact accurate estimations or inferences. But to be as close as possible, sample elements should be taken randomly! At least, being random in sample selection will mostly include the diversity of information/facts within our population.

Population examples

Whenever a university want to do a study on their entire students, thus the entire students are the meant population. That does not depend on what we are going to study: students' ages, disciplines trends, performance...etc

In industry, a company that produces pens of a specific type would like usually to re-evaluate the production. Here, their population is the entire production of pens which is obvious to be infinite. Their study may aim to examine major properties such as: length, shape, weight, life period...etc

OK, let's look for environmental example. The ministry of environment in some city used to check the pollutant concentration in a runny river. Based on the results they will advise people either to use or not the water of the river, eg for swimming, washing or irrigating. The whole water amount will be their population here that should be studied.

Wednesday, September 17, 2014

The "Population"

We always used to say "population" to mean the whole count of people in a country. Yeh, for example the population of Singapore is in the range of 5 to 6 millions.

Cool, what happens in statistics that we also call the whole count or entire amount of things under study as a "population".

Example: a company that produces pens, here: the whole amount of products "pens" that was produced, and to be produced in future is considered as our "population".

You see? yep a "population" is usually a large count/amount of items. It could be infinite as well.

So, can you think of several examples as "populations" ?

Tuesday, September 16, 2014

Welcome to test of hypothesis

Hello All

This will be the blog to understand statistical hypotheses testing!

Course on Udemy:

Now $0.0 (limited promotion)