Descriptive Statistics

As mentioned previously, a frequency distribution contains all of the observations for a particular

sample, which we refer to as the raw data. Only on rare occasions do we present the raw data. The

purpose of descriptive statistics is to provide a means of summarizing the information contained

within a frequency distribution. The two most important pieces of information that need to be

provided for any distribution are the central tendency of the distribution, and the dispersion of

the distribution. Measures of central tendency essentially describe the position of the distribution

on the X-axis (the value of the variable being measured), whereas measures of dispersion describe

how spread out the observations are along the X-axis. In a number of cases the shape of the

distribution, specifically the degree of symmetry, will also be important to describe.

Central Tendency

(Chapter 3 in Zar, 2010)

Where a particular distribution of data is located on the X-axis (which represents the values of the

variable being measured) is summarized by reference to some value associated with the

approximate center of the distribution. The 3 standard measures of central tendency are mean,

median, and mode.

The mean simply is the arithmetic average of the observations, and is a summary statistic that we

all are familiar with. For this reason, we will use the formula for calculation of the sample mean to

indicate some of the notation that we will be using throughout the semester:

We will use Y to denote the value of an observation. The total number of observations, i.e., the

sample size, will be denoted as n. When we wish to identify a specific observation, we will use a

subscript. For example, Y4 would indicate the 4th observation. ∑Y is the notation that we will use

for the sum of all the obervations (Y1+Y2+Y3+…Yn). In this case a subscript for the Y would

indicate a particular group of observations, e.g., ∑Ycontrol would indicate the sum of all the

observations for the control group. The Y with the bar over it is the sample mean, and it is an

estimate, based on our sample, of the actual mean of the statistical population. We can express this

formally as:

The population mean is represented by the symbol μ (lower case Mu in the Greek alphabet), and

the caret (^) over the top of it indicates that it is an estimate. Thus, the preceding formula can be

read as “the sample mean is an estimate of the population mean”.

When making estimates, we are far more concerned with accuracy (proximity to the actual value)

than we are with precision (proximity of estimates to one another). Of critical importance to

obtaining accuracy in our estimates is the use of estimators that are unbiased. An unbiased

estimator is as likely to overestimate as it is to underestimate, whereas a biased estimator will tend

to consistently overestimate, or consistently underestimate. Write that down…you are about to be

asked to evaluate bias.

We can use our newfound skills at reading frequency distributions (if you are feeling less than

skillful, review last week’s material) to examine the behavior of the sample mean as an estimate of

the population mean. The following graph was produced by drawing (at random) 1000 samples of

50 observations each from a statistical population where μ=10. This population mean of 10 was

subtracted from each of the sample means (such that a value of 0 would indicate that the sample

mean and population mean were identical) calculated from the 1000 samples to produce the

following distribution:

Note: These data were produced as the “smean” object in this R program

Question 1: From these data, does the sample mean appear to be an unbiased estimate of the

population mean? Justify your answer.

The other 2 measures of central tendency, the median and the mode, will return values similar to

the mean for distributions that are symmetrical, like the one above, but can convey different, and

sometimes important, information when applied to asymmetrical distributions. The median is the

middle observation when the observations are aligned in ascending (or descending) order by the

magnitude of their values. This can be a useful measure, because 50% of your observations are

above that value, and 50% of your observations fall below that value. The mode is the observation

that occurs the most frequently, i.e., the peak of the frequency distribution.

For distributions that are symmetrical, the mean, median, and mode should converge on the same

value. The distribution that follows displays observations of feeding rates of fruit fly (Drosophila

melanogaster) larvae, measured by counting the number of times the feeding apparatus

(cephalopharyngeal sclevites) contracted over the period of a minute.

The distribution of feeding rates is (more or less) symmetrical, resulting in the sample mean,

sample median, and sample mode all being approximately 85 contractions per minute. For

symmetrical distributions, all 3 measures convey basically the same information. When

distributions are asymmetrical, you have to carefully consider what information you wish to

convey when choosing a measure of central tendency. The following distribution was created by

examining the distribution of Vica sp. (vetch), a twining legume growing in the lawn outside of the

Pacer Commons dorm on campus, by counting the number of individuals present in a series of 0.5

m2 quadrats:

As you can see, this distribution exhibits a positive (right) skew, resulting in different values for

the sample mean, sample median, and sample mode. Reporting the sample mean will give a value

that does not occur frequently as an observation, and so you would have to weigh whether

frequency is a more important piece of information than the position of the distribution for the

question that you are addressing.

In some instances, a distribution may be suggestive of more than one coherent group of

observations, such as the distribution of exam grades shown below:

In such cases, the sample mean and median are poor indications of the pattern, and one should

report both modes (this type of distribution is referred to as “bimodal”).

While it is important to recognize the existence and potential uses for other measures of central

tendency, it will be a rare occasion when a measure of central tendency other than the sample mean

is reported.

Dispersion

(Chapter 4 in Zar, 2010)

While the position of a distribution on the X-axis is a critical piece of information to convey, the

relevance of that measure depends on how wide that distribution is, i.e., the amount of variation in

that variable, especially when making comparisons between or among distributions. Measures of

dispersion are indices of how spread out the observations are along the X-axis.

The simplest measure of dispersion is the range, which involves reporting the lowest and highest

observation, or the difference between them. This measure is very sensitive to outliers, which are

values that are unusually high or low relative to the other observations. While it is not difficult to

find recommendations for excluding outliers from a set of data, unless it is clear that the

observation is impossible, e.g., a human body temperature of 183 degrees C, or it is known that an

error in measurement occurred, one should always be hesitant to remove such observations (see

section 2.5 in chapter 2 of your text).

The reason that range is sensitive to outliers is that it relies on only 2 of your observations. Clearly

a measure of dispersion that relied on all of your observations would be of more value, and better

justify all the hard work that went into collecting those observations. Our newfound, and in-depth,

understanding of central tendency suggests one possible measure: the average distance of the

observations from the center of the distribution.

The distance of an observation from the sample mean can be calculated by subtracting the sample

mean from the observation as follows:

This value, indicated by a lowercase y, is called a deviate. Intuitively then, the average distance

would be the sum of the deviates, ∑y, divided by the number of observations, n. The problem with

this can be illustrated by examining the following table of quiz scores from 2 separate sections of a

biology class:

Because the sample mean is the mathematical center of the observations, the sum of the deviates

will always (within rounding error) be equal to zero. The two distributions of quiz scores are

clearly different, but the average deviations will provide no information about these differences.

The solution that we will apply is to square the deviates, making all of the differences positive. The

notation that we will use for a squared deviate will be y2, such that ∑y2 will indicate the sum of

the squared deviates. The sum of the squared deviates is generally referred to as the sum of

squares, and is a value that will figure prominantly in virtually all of the analyses that we will

address, so make sure that you are familiar with how to calculate it, and what it represents.

Applying this to the quiz score data, we can see that the sum of squares (∑y2) better reflects the

differences between the two distributions:

Dividing the sum of the squared deviates by the number of observations (∑y2/n) will give us the

average squared distance of the observations from the mean of the observations. While it should be

intuitive that this is a good measure of the spread of the observations (apart from using squared

distances, which we will address shortly), we cannot lose sight of the fact that the purpose of

deriving this value from a sample is to estimate the same parameter for the statistical population.

Thus, it is important to establish whether calculating this value as described will introduce a bias in

the estimation of the same population parameter.

Calculation of the average squared distance of the observations from the mean for a statistical

population, i.e., using every observation that exists, is a parameter that we call the population

variance, and denote using the symbol: σ2. Unfortunately, using the same calculation from sample

data produces a biased estimate of σ2. The following distribution was produced by taking 1000

random samples from a statistical population with μ=10, and σ2=4, and calculating the average

squared distance of the observations from the mean of the observations for each sample. For each

sample, the population variance (σ2) was subtracted from the average squared distance of the

observations from the sample mean ((∑y2/n)-σ2) to produce the values shown below, such that an

estimate matching the population variance would result in a value of 0:

Note: These data were produced as the “pvd” object in this R program

Question 2: In what direction is the bias demonstrated for the average squared distance of

the observations from the sample mean as an estimate of σ2?

The distribution above suggests that a different calculation must be used to produce an unbiased

estimate of σ2 from sample data. In this instance the correction is a simple one, involving the use

of n-1 in the denominator instead of n. The resulting formula calculates a parameter we call

sample variance, denoted as s2:

In the following graph, the sample variance (s2) calculated from the same series of 1000 random

draws has been plotted as a second series (SS/(n-1)):

Note: The additional series was produced as the “svd” object in this R program

From this distribution, we can see that the correction for sample variance removes the bias from

the estimate. Thus, we will use sample variance (s2) as our best estimate of population variance

(σ2):

The only issue one may take with variance as an indication of the spread of the data, is that the

units are squared relative to the values of the observations and, therefore, the mean. The solution to

this, as you might imagine, is a simple one: simply take the square root of the variance. This

produces a value referred to as the standard deviation, which, for a sample, we denote as s, and

for a population, we denote as σ. Obviously (at least I hope that it is obvious), the square root of a

sample variance (calculated with n-1 as the denominator) will produce a sample standard deviation

(s), and the square root of a population variance (calculated using n as the denominator) will

produce a population standard deviation (σ). Given that we will almost always be working with

samples, we will use sample standard deviation as our estimate of population standard deviation:

Now let’s practice calculating some descriptive statistics for some actual data. Download the Excel

workbook for this week’s exercise HERE.

Bird Data

The first worksheet (birds) contains the data from Example 3.3 in your textbook (p. 25). This will

allow you to double-check your calculations, and the ones Excel does for you.

In cell F15, type the formula to calculate the sample mean for species B as:

=SUM(F3:F12)/COUNT(F3:F12)

Type “mean” in the cell immediately adjacent to the cell containing the sample mean (G15), so that

you don’t become confused later (and so that I am not confused when I review your spreadsheet).

Excel has a function to calculate the median that we will use in cell F16:

=MEDIAN(F3:F12)

Add a label for the median in the adjacent cell as you did for the mean. Note that the value for the

median does not occur among the list of observations. The reason for this is that when there are an

even number of observations, we interpolate between the 2 middle observations to get the median

value.

Now highlight the 2 cells containing the formulae for mean and median, use “Ctrl+c” to copy the

cells, click on cell A15, and use “Ctrl-v” to paste. That feeling of anxiety that you are experiencing

is the result of your conscious (or subconscious) recognition that the sample sizes for the 2 groups

of observations differ. Pasting the formulas results in calculations for species A that include a

blank cell. Use “F2” to verify this.

Remember the words inscribed in friendly letters upon each copy of The Hitchhiker’s Guide to the

Galaxy: “Don’t Panic” (if you have yet to read any of the 5 books in this trilogy, please correct this

alarming oversight at your earliest convenience). For now, let’s take an objective and analytical

approach to examining the consequences of our actions.

Because there are an odd number of observations for the life span of species A, and because these

observations have been sorted in order of ascending value, we can see at a glance that there is, in

fact, a middle observation, and that the value of that observation matches the value of the median

as calculated by Excel. It would appear that the “MEDIAN” function ignores blank cells. We can

verify that the same is true for both the “SUM” and “COUNT” functions by recalculating the mean

using the “AVERAGE” function. Type the following into cell A17:

=AVERAGE(A3:A11)

Not only have we verified that several important functions ignore blank cells, which makes life a

little easier (because we can paste formulas) when dealing with unequal sample sizes, but we also

have verified that the “AVERAGE” function follows the formula that we learned (or more likely

were reminded of) for the sample mean. Feel the tension draining away?

We now are going to work on calculating the variance for both samples. In cell G3, type the

formula to calculate the deviate as:

=F3-F$15

Having the anchor ($) for the row number allows you to copy the formula down for the remaining

observations while referencing the same cell for the mean. Anchoring the column is not necessary

when the formula is only being copied down, and leaving the column unanchored will allow you to

copy the column in its entirety to calculate the deviates for the observations for species B, because

the reference will match the location of the sample mean. You will be doing yourself a favor if you

take the time to verify this…

In the next column, type the formula to square the deviate as:

=G3^2

Copy the formula down the column. We could have eliminated a step by using a single formula (=

(F3-F$15)^2) in column G, but this is a good reminder of the steps that we discussed (and besides,

I made you put labels where we would need to calculate sums).

It’s time to take the training wheels off. Let’s remind ourselves of the formula for sample variance:

You should be able to calculate the sample variance using the “SUM” function, and the “COUNT”

function. Presumably you can count the observations on your own, but this will be good practice

for when we use larger sample sizes. Just make sure that you use parentheses in your formula to

get the correct order of operations when subtracting 1 from the count, or you will be subtracting 1

from the population variance! You also should be able to repeat these calculations for Species A by

cutting and pasting if you have been careful with your cell references.

Lastly, calculate the sample standard deviation for the two samples. To find the square root of a

value in Excel, the “SQRT” function is used as:

=SQRT(value)

The value can be an actual number, or the cell location for a value. For example, if your sample

variance was located in cell C15, the sample standard deviation could be calculated as:

=SQRT(C15)

Make sure to label both sample variance and sample standard deviation clearly on your worksheet,

and remember to save your work!

Sunfish Data

The second worksheet (fish) contains mass and standard length measurements for bluegill sunfish

(Lepomis macrochirus) and hybrids of bluegill sunfish and green sunfish (Lepomis cynanellus),

collected from a constructed pond in Sedgewick County, Kansas. These measurements have been

used to calculate a “condition factor” (K), which is a ratio of the mass to the cube of the length (in

cm). Fish with a larger value for K will have more mass for a given length. Because green sunfish

have a larger gape, and tend to be more aggressive, there was some question as to whether the

introduction of the hybrids might have a negative effect on the condition of the bluegill sunfishes.

The following graph shows the frequency distributions for the condition factor for both species:

It should be immediately evident that the distributions are similar in terms of their central

tendencies, but differ in the degree of dispersion of the data.

Question 3: Calculate the mean, variance, and standard deviation of both sets of condition

factor data and determine whether these summary statistics reflect the similarities and

differences that can be observed between the two distributions.

Let’s move on to examining symmetry and standard error…

Send comments, suggestions, and corrections to: Derek Zelmer

Symmetry and Kurtosis

As has been mentioned previously, distributions that have an equal number of observations spread

similarly on either side of the mode, are said to be symmetrical. For such distributions, the mean

and median will have close to the same value. One example that we have seen of a symmetrical

distribution was for the difference of sample means from the population mean:

For this distribution the mean is 0.00368, and the median is 0.00342. The 4 distributions we

worked with in our Excel workbook also are symmetrical, as you can see by comparing the sample

means to the sample medians. The degree to which a distribution deviates from symmetry is

referred to as the skewness. We saw an example of a skewed distribution with the vetch data:

In this example, the data show a positive skew (or right skew), with the tail stretching to the right.

There is a rule of thumb that suggests that the position of the mean relative to the median will give

you the direction of the skew. In this instance, that appears to be the case, as the mean is to the

right of the median, but this rule of thumb is not a good one to apply, especially for multimodal

distributions, because it is not consistent enou…

Purchase answer to see full

attachment

Haven’t Found The Relevant Content? Hire a Subject Expert to Help You With

SOLUTION: University of South Carolina Aiken Descriptive Statistics Exercise

Post Your Own Question And Get A Custom Answer

We value our clients. For this reason, we ensure that each paper is written carefully as per the instructions provided by the client. Our editing team also checks all the papers to ensure that they have been completed as per the expectations.

Over the years, our Written Assignments has managed to secure the most qualified, reliable and experienced team of writers. The company has also ensured continued training and development of the team members to ensure that it keeps up with the rising Academic Trends.

Our prices are fairly priced in such a way that ensures affordability. Additionally, you can get a free price quotation by clicking on the "Place Order" button.

We pay strict attention to deadlines. For this reason, we ensure that all papers are submitted earlier, even before the deadline indicated by the customer. For this reason, the client can go through the work and review everything.

At Written Assignments, all papers are plagiarism-free as they are written from scratch. We have taken strict measures to ensure that there is no similarity on all papers and that citations are included as per the standards set.

Our support team is readily available to provide any guidance/help on our platform at any time of the day/night. Feel free to contact us via the Chat window or support email: support@writtenassignments.com.

Try it now!

Follow these simple steps to get your paper done

Place your order

Fill in the order form and provide all details of your assignment.

Proceed with the payment

Choose the payment system that suits you most.

Receive the final file

Once your paper is ready, we will email it to you.

Written Assignments has stood as the world’s leading custom essay writing paper services provider. Once you enter all the details in the order form under the place order button, the rest is up to us.

Essays

At Written Assignments, we prioritize all aspects that bring about a good grade such as impeccable grammar, proper structure, zero plagiarism and conformance to guidelines. Our experienced team of writers will help you completed your essays and other assignments.

Admissions

Be assured that you’ll get accepted to the Master’s level program at any university once you enter all the details in the order form. We won’t leave you here; we will also help you secure a good position in your aspired workplace by creating an outstanding resume or portfolio once you place an order.

Editing

Our skilled editing and writing team will help you restructure your paper, paraphrase, correct grammar and replace plagiarized sections on your paper just on time. The service is geared toward eliminating any mistakes and rather enhancing better quality.

Coursework

We have writers in almost all fields including the most technical fields. You don’t have to worry about the complexity of your paper. Simply enter as many details as possible in the place order section.