One-Way Frequency Tables using SAS
PROC FREQ
See
www.stattutorials.com/SASDATA for files mentioned in this tutorial
© TexaSoft, 2006
These SAS statistics tutorials briefly explain the use and
interpretation of standard statistical analysis techniques for Medical,
Pharmaceutical, Clinical Trials, Marketing or Scientific Research. The examples
include how-to instructions for SAS Software
Creating One-Way Frequency Tables with
PROC FREQ
Data that are
collected as counts require a specific kind of data analysis. It doesn’t
make sense to calculate means and standard deviations on categorical data.
Instead, categorical data is analyzed by creating frequency and
crosstabulation tables. The primary procedure within SAS for this kind of
analysis is PROC FREQ.
This tutorial
covers the creation and analysis of a single variable frequency table
using the PROC FREQ procedure.
The syntax for PROC FREQ is:
PROC FREQ
<options>; TABLES specification; <statements>;
Commonly used options used in PROC FREQ is:
DATA =
(Specify which data set to use)
ORDER=FREQ (Output data in frequency order)
A commonly
used statement used with PROC FREQ is:
BY varlist
(Specify BY list to create subsetted analyses)
The TABLES
statement is used to request which tables will be produced. For example,
to obtain counts of the number of subjects in each GROUP categories, use
the code:
PROC FREQ;
TABLES GROUP;
To produce a chi-square test for goodness of fit, use code such as
proc
freq;
tables color / chisq nocum testp=(0.5625 0.1875 0.1875 0.0625);
(See details about these options later in the tutorial.)
When only one
variable is used in the TABLES statement, PROC FREQ produces a frequency
table. For example, using the data from the SOMEDATA SAS data set, the
following code produces a frequency table using data in the STATUS
variable: (PROCFREQ1.SAS)
* ASSUMES YOU HAVE A SAS LIBRARY NAMED MYDATA;
ODS
RTF;
PROC
FREQ
DATA=MYDATA.SOMEDATA;
TABLES
STATUS;
TITLE
'Simple Example of PROC FREQ';
RUN;
PROC
FREQ
DATA=MYDATA.SOMEDATA
ORDER=FREQ;
TABLES
STATUS;
TITLE
'Simple Example of PROC FREQ';
RUN;
ODS
RTF
CLOSE;
The output for this job is:
Socioeconomic Status |
STATUS |
Frequency |
Percent |
Cumulative
Frequency |
Cumulative
Percent |
1 |
3 |
6.00 |
3 |
6.00 |
2 |
7 |
14.00 |
10 |
20.00 |
3 |
6 |
12.00 |
16 |
32.00 |
4 |
8 |
16.00 |
24 |
48.00 |
5 |
26 |
52.00 |
50 |
100.00 |
The frequency
gives the count of the number of times the STATUS variable took on the
value in the STATUS column. The percent column is the percent of total
(50). The Cumulative Frequency and Percent columns report an increasing
count or percent for each value of STATUS. Use this type of analysis to
discover the distribution of the categories in your data set. For example,
in this data, over half of the subjects fall into the STATUS=5 category.
If you’d hoped for a representative sample in each category, this shows
you that that criteria was not met.
Exercise:
Using the Order=Freq orders the table by frequency. Change the PROC FREQ
line to read
PROC FREQ Order=Freq; TABLES STATUS;
And rerun the
program to get the sorted by frequency output. This helps you identify
which categories have the most and fewest counts.
Socioeconomic Status |
STATUS |
Frequency |
Percent |
Cumulative
Frequency |
Cumulative
Percent |
5 |
26 |
52.00 |
26 |
52.00 |
4 |
8 |
16.00 |
34 |
68.00 |
2 |
7 |
14.00 |
41 |
82.00 |
3 |
6 |
12.00 |
47 |
94.00 |
1 |
3 |
6.00 |
50 |
100.00 |
Suppose your data were summarized into counts already. In this case you
can use the WEIGHT statement to read in your data. For example (PROCFREQ2.SAS)
DATA
CDS;
INPUT
@1
CATEGORY
$9.
@10
NUMBER
3.;
DATALINES;
JAZZ 252
POP 49
CLASSICAL 59
RAP 21
GOSPEL 44
JAZZ 21
;
ODS
RTF;
PROC
FREQ
DATA=CDS
ORDER=FREQ;
WEIGHT
NUMBER;
TITLE3
'READ IN SUMMARIZED DATA';
TABLES
CATEGORY;
RUN;
ODS
RTF
CLOSE;
Produces the following table:
CATEGORY |
Frequency |
Percent |
Cumulative
Frequency |
Cumulative
Percent |
JAZZ |
273 |
61.21 |
273 |
61.21 |
CLASSICAL |
59 |
13.23 |
332 |
74.44 |
POP |
49 |
10.99 |
381 |
85.43 |
GOSPEL |
44 |
9.87 |
425 |
95.29 |
RAP |
21 |
4.71 |
446 |
100.00 |
Notice that although the data were summarized, there were two observations
in the data set for “JAZZ” which were combined into a single category in the
table.
A goodness-of-fit test of a single population is a test to determine if the
distribution of observed frequencies in the sample data closely matches the
expected number of occurrences under a hypothetical distribution of the
population. The data observations must be independent and each data value
can be counted in one and only one category. It is also assumed that the
number of observations is fixed. The hypotheses being tested are
Ho: The population follows the hypothesized distribution.
Ha: The population does not follow the hypothesized distribution.
A Chi-Square statistic is calculation and a decision can be made based on
the p-value associated with that statistic. A low p-value indicates
rejection of the null hypothesis. That is, a low p-value indicates that the
data do not follow the hypothesized, or theoretical, distribution.
For example, data for this test comes from Zar (1999), page 465. According
to a genetic theory, crossbred pea plants show a 9:3:3:1 ratio of yellow
smooth, yellow wrinkled, green smooth, green wrinkled offspring. Out of 250
plants, under the theoretical ratio (distribution) of 9:3:3:1, you would
expect about
(9/16)x250=140.625 yellow smooth peas (56.25%)
(3/16)x250=46.875 yellow wrinkled peas (18.75%)
(3/16)x250=46.875 green smooth peas (18.75%)
(1/16)x250=15.625 green wrinkled peas (6.25%)
After growing 250 of these pea plants, you observe that
152 have yellow smooth peas
39 have yellow wrinkled peas
53 have green smooth peas
6 have green wrinkled peas
You can perform
this analysis using the following SAS program, (PROCFREQ3.SAS)
DATA
GENE;
INPUT
@1
COLOR
$13.
@15
NUMBER
3.;
DATALINES;
YELLOWSMOOTH 152
YELLOWWRINKLE 39
GREENSMOOTH 53
GREENWRINKLE 6
;
* HYPOTHESIZING A 9:3:3:1 RATIO;
PROC
FREQ
DATA=GENE
ORDER=DATA;
WEIGHT
NUMBER;
TITLE3
'GOODNESS OF FIT ANALYSIS';
TABLES
COLOR /
CHISQ
NOCUM
TESTP=(0.5625
0.1875
0.1875
0.0625);
RUN;
-
The CHISQ requests that a
Chi-Square test be performed
-
The TESTP=() statement
specifies the hypothesized proportions to be tested. (Your could have used
the TESTF=() and used expected frequencies instead.)
-
The NOCUM option suppresses
cumulative frequencies
-
Use the ORDER=DATA option to
cause SAS to displayed data in the same order as they are entered in the
input data set.
The result of
this analysis is: