Business Statistics Notes
STATISTICS
“A
knowledge of statistics is like a knowledge of foreign language of algebra; it
may prove of use at any time under any circumstance”……………………………………...Bowley.
A.L.
Bowley has defined statistics as: (i) statistics is the
science of counting, (ii) Statistics may rightly be called the science of
averages, and (iii) statistics is the science of measurement of social organism
regarded as a whole in all its manifestations.
Boddington
defined as: Statistics is the science of estimates and probabilities.
Seligman
explored
that statistics is a science that deals with the methods of collecting,
classifying, presenting, comparing and interpreting numerical data collected to
throw some light on any sphere of
enquiry.
Spiegal
defines
statistics highlighting its role in decision-making particularly under
uncertainty.
CHARACTERSTICS OF
STATISTICS
·
Statistics are the aggregates of facts.
·
Statistics are affected by a number of
factors.
·
Statistics must be reasonably accurate.
·
Statistics must be collected in a systematic
manner.
·
Collected in a systematic manner for a
pre-determined purpose.
TYPES OF DATA AND DATA
SOURCES
Statistical
data are the basic raw material of statistics.
Any
object subject phenomenon, or activity that generates data through this process
is termed as a variable. In other words, a variable is one that shows a degree
of variability when successive measurements are recorded.
In
statistics, data are classified into two broad categories: quantitative
data and qualitative data.
Quantitative
data
are those that can be quantified in definite units of measurement. These refer to characteristics whose
successive measurements yield quantifiable observations. Depending on the
nature of the variable observed for measurement, quantitative data can be further
categorized as continuous and discrete data. Obviously, a variable may be a
continuous variable or a discrete variable.
Continuous
Data A
continuous variable is the one that can assume any value between any two
points on a line segment, thus representing an interval of values. Thus, the
data recorded on these and similar other characteristics are called continuous
data. It may be noted that a continuous variable assumes the finest unit of
measurement. Finest in the sense that it enables measurements to the maximum
degree of precision.
Discrete
data A discrete
variable is the one whose outcomes are measured in fixed numbers. Such
data are essentially count data.
Qualitative
data A characteristic is qualitative in nature when
its observations are defined and noted in terms of the presence or absence of a
certain attribute in discrete numbers. These data are further classified as
nominal and rank data.
Nominal
data are the outcome of classification into two or more
categories of items or units comprising
a sample or a population according to some quality characteristic. Classification of students
according to sex (as males and females), of workers according to skill (as
skilled, semi-skilled, and unskilled), and of employees according to the level
of education (as matriculates, undergraduates, and post-graduates), all result
into nominal data. Given any such basis
of classification, it is always possible to assign each item to a particular
class and make a summation of items belonging to each class. The count data so
obtained are called nominal data.
Rank
data, on the other hand, are the result of assigning ranks
to specify order in terms of the integers 1,2,3, ..., n. Ranks may be assigned
according to the level of performance in a test.
Data
sources could be seen as of two types, viz., secondary and
primary. The two can
be
defined as under:
(i)
Secondary data: They already exist in some form:
published or unpublished - in an identifiable secondary source. They are,
generally, available from published source(s), though not necessarily in the
form actually required.
(ii)
Primary data: Those data which do not already exist
in any form, and thus have to be
collected for the first time from the primary source(s). By their very nature, these data require fresh and
first-time collection covering the whole population or a sample drawn from it.
TYPES
OF STATISTICS
Descriptive
statistics deals with collecting, summarizing, and simplifying
data, which are otherwise quite unwieldy and voluminous. It seeks to achieve
this in a manner that meaningful conclusions can be readily drawn from the
data. Descriptive statistics may thus be seen as comprising methods of bringing
out and highlighting the latent characteristics present in a set of numerical
data.
Inferential
statistics, also known as inductive statistics, goes beyond
describing a given problem situation by means of collecting, summarizing, and
meaningfully presenting the related data. Instead, it consists of methods that
are used for drawing inferences, or making broad generalizations, about a
totality of observations on the basis of knowledge about a part of that totality.
Inferential
statistics helps to evaluate the risks involved in reaching inferences or
generalizations about an unknown population on the basis of sample information.
Limitations
of statistics
(i)
Sources of data not given
(ii)
(ii) Defective data
(iii)
Unrepresentative sample
(iv)
Inadequate sample
(v)
Unfair Comparisons
(vi)
Unwanted conclusions
(vii)
Confusion of correlation and
causation
(viii)
CENTRAL
TENDENCY
ARITHMETIC MEAN
Adding all the observations and dividing the sum by
the number of observations results the arithmetic mean.
For grouped data, arithmetic mean may be calculated
by applying any of the following methods:
(i)
Direct method, (ii)
Short-cut method , (iii) Step-deviation method.
It may be noted that the mid-point of each class is
taken as a good approximation of the true mean of the class.
CHARACTERISTICS OF THE ARITHMETIC MEAN
1. The
sum of the deviations of the individual items from the arithmetic mean is
always zero.
2. The
sum of the squared deviations of the individual items from the arithmetic mean
is always minimum.
3. As
the arithmetic mean is based on all the items in a series, a change in the
value of any item will lead to a change in the value of the arithmetic mean.
4. In
the case of highly skewed distribution, the arithmetic mean may get distorted
on account of a few items with extreme values.
MEDIAN
Median is defined as the value of the middle item
(or the mean of the values of the two middle items) when the data are arranged
in an ascending or descending order of magnitude. Thus, in an ungrouped
frequency distribution if the n values are arranged in ascending or descending
order of magnitude, the median is the middle value if n is odd. When n is even,
the median is the mean of the two middle values.
To understand these, we should first know that the
median belongs to a general class of statistical descriptions called fractiles.
A fractile is a value below that lays a given fraction of a set of data. In the
case of the median, this fraction is one-half (1/2).
Deciles (where the series is divided into 10 parts)
and percentiles (where the series is divided into 100 parts).
CHARACTERISTICS OF THE MEDIAN
1.
Unlike the arithmetic mean, the median
can be computed from open-ended distributions. This is because it is located in
the median class-interval, which would not be an open-ended class.
2. The median can also be
determined graphically whereas the arithmetic mean cannot be ascertained in
this manner.
3. As it is not influenced
by the extreme values, it is preferred in case of a distribution having extreme
values.
4. In case of the
qualitative data where the items are not counted or measured but are scored or
ranked, it is the most appropriate measure of central tendency.
MODE
The mode is another measure of central tendency. It
is the value at the point around which the items are most heavily concentrated.
Mode = 3 median - 2 mean
And it can give only approximate results. As such,
its frequent use should be avoided. However,
when mode is ill defined or the series is bimodal
(as is the case in the present example) it may be used.
RELATIONSHIPS OF THE MEAN, MEDIAN AND MODE
(i)
When a distribution is symmetrical, the
mean, median and mode are the same.
(ii)
In case, a distribution is skewed to the
right, then mean> median> mode.
Generally, income distribution is skewed to the right where a large number of
families have relatively low income and a small number of families have
extremely high income.
(iii)
When a distribution is skewed to the left, then mode> median> mean. This is because here mean is pulled down
below the median by extremely
low values.
(iv)
Given the mean and median of a unimodal
distribution, we can determine whether it is skewed to the right or left. When
mean> median, it is skewed to the right; when median> mean, it is skewed
to the left. It may be noted that the median is always in the middle between
mean and mode.
BEST MEASURE OF CENTRAL TENDENCY
The arithmetic mean is
the sum of the values divided by the total number of observations in the
series.
The median is the value of the
middle observation that divides the series into two equal parts.
Mode is the value around
which the observations tend to concentrate.
GEOMETRIC MEAN
The geometric mean is more important than the
harmonic mean. Geometric mean is defined at the nth root of the product of n
observations of a distribution.
Similarly, if there are three observations, then we
have to calculate the cube root of the product of these three observations; and
so on.
When the number of items is large, it becomes
extremely difficult to multiply the numbers and to calculate the root. To
simplify calculations, logarithms are used.
The geometric mean is most suitable in the following
three cases:
1.
Averaging
rates of change.
2.
The compound interest formula.
3.
Discounting, capitalization
This process of ascertaining the present value of
future income by using the interest rate is known as discounting.
ADVANTAGES OF G. M.
1. Geometric mean is based
on each and every observation in the data set.
2. It is rigidly defined.
3. It is more suitable
while averaging ratios and percentages as also in calculating growth
rates.
4. As compared to the
arithmetic mean, it gives more weight to small values and less weight to large
values. As a result of this characteristic of the geometric mean, it is
generally less than the arithmetic mean. At times it may be equal to the
arithmetic mean.
5. It is capable of
algebraic manipulation. If the geometric mean has two or more series is known
along with their respective frequencies. Then a combined geometric mean can be
calculated by using the logarithms.
LIMITATIONS OF G.M.
1. As compared to the
arithmetic mean, geometric mean is difficult to understand.
2. Both computation of
the geometric mean and its interpretation are rather difficult.
3. When there is a
negative item in a series or one or more observations have zero value, then the
geometric mean cannot be calculated.
In view of the limitations mentioned above, the
geometric mean is not frequently used.
HARMONIC MEAN
The harmonic mean is defined as the reciprocal of
the arithmetic mean of the reciprocals of individual observations.
Symbolically,
The calculation of harmonic mean becomes very
tedious when a distribution has a large number of observations.
The main advantage of the harmonic mean is
that it is based on all observations in a distribution and is amenable to
further algebraic treatment. When we desire to give greater weight to smaller
observations and less weight to the larger observations, then the use of
harmonic mean will be more suitable
Limitations of the harmonic mean.
First, it is difficult to
understand as well as difficult to compute.
Second, it cannot be calculated
if any of the observations is zero or negative.
Third, it is only a summary
figure, which may not be an actual observation in the distribution.
It is worth noting that the harmonic mean is always
lower than the geometric mean, which is lower than the arithmetic mean. This is
because the harmonic mean assigns lesser importance to higher values. Since the
harmonic mean is based on reciprocals, it becomes clear that as reciprocals of
higher values are lower than those of lower values, it is a lower average than
the arithmetic mean as well as the geometric mean.
QUADRATIC MEAN
Geometric mean is the antilogarithm of the
arithmetic mean of the logarithms, and the harmonic mean is the reciprocal of
the arithmetic mean of the reciprocals. Likewise, the quadratic mean (Q) is the
square root of the arithmetic mean of the squares.
the quadratic mean can be used while averaging
deviations when the standard deviation is to be calculated.
Q>x>G>H provided that
all the individual observations in a series are positive and all of them are not the same.
DISPERSION AND SKEWNESS
The dispersion or variability provides us one more
step in increasing our understanding of the pattern of the data. Further, a
high degree of uniformity (i.e. low degree of dispersion) is a desirable
quality.
1. "Dispersion is the
measure of the variation of the items." -A.L. Bowley
2. "The degree to which
numerical data tend to spread about an average value is called the variation of
dispersion of the data."
-Spiegel
3. Dispersion
or spread is the degree of the scatter or variation of the variable about a
central value." -Brooks
& Dick
4. "The measurement of the
scatterness of the mass of figures in a series about an average is called
measure of variation or dispersion." -Simpson & Kajka
Since measures of dispersion give an average of the
differences of various items from an average, they are also called averages of
the second order. An average is more meaningful when it is examined in the
light of dispersion.
SIGNIFICANCE AND PROPERTIES OF MEASURING VARIATION
1. Measures of variation
point out as to how far an average is representative of the mass. When
dispersion is small, the average is a typical value in the sense that it
closely represents the individual value and it is reliable in the sense that it
is a good estimate of the average in the corresponding universe. On the other
hand, when dispersion is large, the average is not so typical, and unless the
sample is very large, the average may be quite unreliable.
2. Another purpose of
measuring dispersion is to determine nature and cause of variation in order to
control the variation itself. In matters of health variations in body
temperature, pulse beat and blood pressure are the basic guides to diagnosis.
Prescribed treatment is designed to control their variation. In industrial
production efficient operation requires control of quality variation the causes
of which are sought through inspection is basic to the control of causes of
variation. In social sciences a special problem requiring the measurement of
variability is the measurement of "inequality" of the distribution of
income or wealth etc.
3. Measures of dispersion
enable a comparison to be made of two or more series with regard to their
variability. The study of variation may also be looked upon as a means of
determining uniformity of consistency. A high degree of variation would mean
little uniformity or consistency whereas a low degree of variation would mean
great uniformity or consistency.
4. Many powerful
analytical tools in statistics such as correlation analysis. the testing of
hypothesis, analysis of variance, the statistical quality control, regression
analysis is based on measures of variation of one kind or another.
MEAURES OF DISPERSION
There are five measures of dispersion: Range,
Inter-quartile range or Quartile Deviation, Mean deviation, Standard Deviation,
and Lorenz curve. Among them, the first four are mathematical methods and the
last one is the graphical method.
RANGE
The simplest measure of dispersion is the range,
which is the difference between the maximum value and the minimum value of
data.
When the sample size is very small, the range is
considered quite adequate measure of the variability. Thus, it is widely used
in quality control where a continuous check on the variability of raw materials
or finished products is needed. The range is also a suitable measure in weather
forecast.
Limitations of range, which
are as follows:
1. It is based only on
two items and does not cover all the items in a distribution.
2. It is subject to wide
fluctuations from sample to sample based on the same population.
3. It fails to give any
idea about the pattern of distribution. This was evident from the data given in
Examples 1 and 3.
4. Finally, in the case
of open-ended distributions, it is not possible to compute the range.
QUARTILE DEVIATION
The interquartile range or the quartile deviation is
a better measure of variation in a distribution than the range. Here, avoiding
the 25 percent of the distribution at both the ends uses the middle 50 percent
of the distribution. In other words, the interquartile range denotes the
difference between the third quartile and the first quartile.
Symbolically, interquartile range = Q3- Q1
Semi interquartile range or Quartile deviation = (Q3
– Ql)/2
When quartile deviation is small, it means that
there is a small deviation in the central 50 percent items.
It may be noted that in a symmetrical distribution,
the two quartiles, that is, Q3 and QI are equidistant from the median.
Symbolically,
M-QI = Q3-M
It may be noted that interquartile range or the
quartile deviation is an absolute measure of dispersion.
MERITS OF QUARTILE DEVIATION
1. As compared to range, it is considered a superior
measure of dispersion.
2. In the case of open-ended distribution, it is
quite suitable.
3. Since it is not influenced by the extreme values
in a distribution, it is particularly suitable in highly skewed or erratic
distributions.
MEAN DEVIATION
The mean deviation is also known as the average
deviation. As the name implies, it is the average of absolute amounts by which
the individual items deviate from the mean. Since the positive deviations from
the mean are equal to the negative deviations, while computing the mean
deviation, we ignore positive and negative signs.
MERITS OF MEAN DEVIATION
1. A major advantage of
mean deviation is that it is simple to understand and easy to calculate.
2. It takes into
consideration each and every item in the distribution. As a result, a change in
the value of any item will have its effect on the magnitude of mean deviation.
3. The values of extreme
items have less effect on the value of the mean deviation.
4. As deviations are taken
from a central value, it is possible to have meaningful comparisons of the
formation of different distributions.
LIMITATIONS OF MEAN DEVIATION
1. It
is not capable of further algebraic treatment.
2. At
times it may fail to give accurate results. The mean deviation gives best
results when deviations are taken from the median instead of from the mean. But
in a series, which has wide variations in the items, median is not a
satisfactory measure.
3. Strictly
on mathematical considerations, the method is wrong as it ignores the algebraic
signs when the deviations are taken from the mean.
In view of these limitations, it is seldom used in
business studies. A better measure known as the standard deviation is more
frequently used.
STANDARD DEVIATION
The standard deviation is similar to the mean
deviation in that here too the deviations are measured from the mean. At the
same time, the standard deviation is preferred to the mean deviation or the
quartile deviation or the range because it has desirable mathematical
properties.
Mean of the squared deviations is known as the
variance.
The actual mean would turn out to be in fraction,
calculating deviations from the mean would be too cumbersome.
USES OF THE STANDARD DEVIATION
The standard deviation is a frequently used measure
of dispersion. It enables us to determine as to how far individual items in a
distribution deviate from its mean. In a symmetrical, bell-shaped curve:
(i) About 68 percent of
the values in the population fall within:
+ 1 standard deviation from the mean.
(ii) About 95 percent of
the values will fall within +2 standard deviations from the mean.
(iii) About 99 percent of
the values will fall within + 3 standard deviations from the mean.
The standard deviation is an absolute measure of
dispersion as it measures variation in the same units as the original data. As
such, it cannot be a suitable measure while comparing two or more
distributions.
STANDARDISED VARIABLE, STANDARD SCORES
The variable Z = (x - x )/s or (x - μ)/μ, which
measures the deviation from the mean in units of the standard deviation, is
called a standardised variable. Since both the
numerator and the denominator are in the same units, a standardised variable is
independent of units used.
If deviations from the mean are given in units of
the standard deviation, they are said to be expressed in standard units or
standard scores.
Through this concept of standardised variable,
proper comparisons can be made between individual observations belonging to two
different distributions whose compositions differ.
LORENZ CURVE
This measure of dispersion is graphical. It is known
as the Lorenz curve named after Dr. Max Lorenz. It is generally used to show the
extent of concentration of income and wealth. The steps involved in plotting
the Lorenz curve are:
1. Convert a frequency
distribution into a cumulative frequency table.
2. Calculate percentage
for each item taking the total equal to 100.
3. Choose a suitable
scale and plot the cumulative percentages of the persons and income. Use the
horizontal axis of X to depict percentages of persons and the vertical axis of
Y to depict percent ages of income.
4. Show the line of equal
distribution, which will join 0 of X-axis with 100 of Yaxis.
5. The curve obtained in
(3) above can now be compared with the straight line of equal distribution
obtained in (4) above. If the Lorenz curve is close to the line of equal
distribution, then it implies that the dispersion is much less. If, on the contrary,
the Lorenz curve is farther away from the line of equal distribution, it
implies that the dispersion is considerable.
The Lorenz curve is a simple graphical device to
show the disparities of distribution in any phenomenon. It is, used in business
and economics to represent inequalities in income, wealth, production, savings,
and so on.
SKEWNESS
It may be repeated here that frequency distributions
differ in three ways: Average value, Variability or dispersion, and Shape.
Generally, there are two comparable characteristics
called skewness and kurtosis that help us to understand a distribution. Two
distributions may have the same mean and standard deviation but may differ
widely in their overall appearance.
important definitions of skewness are as follows:
1. "When
a series is not symmetrical it is said to be asymmetrical or skewed."
-Croxton & Cowden.
2. "Skewness refers to the
asymmetry or lack of symmetry in the shape of a
frequency distribution." -Morris Hamburg.
3. "Measures of skewness
tell us the direction and the extent of skewness. In
symmetrical distribution the mean, median and mode
are identical. The more
the mean moves away from the mode, the larger the
asymmetry or skewness."
-Simpson & Kalka
4. "A distribution is said
to be 'skewed' when the mean and the median fall at
different points in the distribution, and the
balance (or centre of gravity) is
shifted to one side or the other-to left or
right." -Garrett
Symmetrical Distribution. It
is clear from the diagram (a) that in a symmetrical distribution the values of
mean, median and mode coincide. The spread of the frequencies is the same on
both sides of the centre point of the curve.
Asymmetrical Distribution.
A distribution, which is not symmetrical, is called a skewed distribution and
such a distribution could either be positively skewed or negatively skewed as
would be clear from the diagrams (b) and (c).
Positively Skewed Distribution.
In the positively skewed distribution the value of the mean is maximum and that
of mode least-the median lies in between the two as is clear from the diagram
(b).
Negatively Skewed Distribution.
The following is the shape of negatively skewed distribution. In a negatively
skewed distribution the value of mode is maximum and that of mean least-the
median lies in between the two. In the positively skewed distribution the
frequencies are spread out over a greater range of values on the high-value end
of the curve (the right-hand side) than they are on the low-value end. In the
negatively skewed distribution the position is reversed, i.e. the excess tail
is on the left-hand side. It should be noted that in moderately symmetrical
distributions the interval between the mean and the median is approximately
one-third of the interval between the mean and the mode. It is this
relationship, which provides a means of measuring the degree of skewness.
In order to ascertain whether a distribution is
skewed or not the following tests may be applied. Skewness is present if:
1.
The values of mean, median and mode do not coincide.
2. When the data are
plotted on a graph they do not give the normal bell
shaped form i.e. when cut along a vertical line
through the centre the two
halves are not equal.
3. The sum of the
positive deviations from the median is not equal to the sum
of the negative deviations.
4. Quartiles are not
equidistant from the median.
5. Frequencies are not
equally distributed at points of equal deviation from
the mode.
MEASURES OF SKEWNESS
There are four measures of skewness, each divided
into absolute and relative measures. The relative measure is known as the
coefficient of skewness and is more frequently used than the absolute measure
of skewness. Further, when a comparison between two or more distributions is
involved, it is the relative measure of skewness, which is used. The measures of skewness are:
(i)
Karl Pearson's measure,
(ii)
Bowley’s measure,
(iii)
Kelly’s measure, and
(iv)
Moment’s measure.
The formula for measuring skewness as given by Karl
Pearson is as follows:
Skewness =
Mean - Mode
Or 3 Mean - 3 Median = Mean - Mode
Or Mode = Mean - 3 Mean + 3 Median
Or Mode = 3 Median - 2 Mean
The direction of skewness is determined by ascertaining
whether the mean is greater
than the mode or less than the mode. If it is
greater than the mode, then skewness is
Mean – Mode Standard Deviation
The value of coefficient of skewness is zero, when
the distribution is symmetrical. Normally, this coefficient of skewness lies
between +1. If the mean is greater than the mode, then the coefficient of
skewness will be positive, otherwise negative.
Comments
Post a Comment