Re: Normalization variables in regression analysis questio.n

Date: Mon Mar 19 01:50:20 2001
Posted By: Mark Huber, Post-doc/Fellow, Statistics, Stanford University
Area of science: Other
ID: 983474606.Ot

Message:

For those who might not be aware, discriminant analysis is a means for discovering which variables most affect the outcome of an experiment. For instance, a medical researcher might be interested in two categories a) people who develop heart disease and b) people who don't. The data might include height, weight, amount of exercise and other factors. The goal of the discriminant analysis is to find which variables are most useful in predicting to which class the patient belongs.

As you rightly point out in your question, the variables are usually normally distributed, and discrete yes/no variables present a problem to standard analysis packages. I'll suggest two approaches that you might consider. First, simply sum up the {0,1} variables for each line of data. If you have a number of these variables, if they are roughly equally likely to be yes or no, and they are largely independent, then the result will by the Central Limit Theorem be roughly normally distributed. This sum can then be used just like the other variables. If two of these yes/no variables are not independent, then they could perhaps be consolidated into one column. Or the yes/no variables may be broken into one or more groups, and summed within there groups to generate several extra variables.

If there are too few of these variables or the assumptions above do not hold, or your primary goal is prediction, then one thing to try is to run separate analysis for different values. Suppose you just have one column of yes/no data. Then run the analysis just on the yes data and a separate analysis on the no. If they are sharply different, the choice is important, and your final prediction should depend on that variable. If, however, the resulting predictions are pretty much independent of the value of the yes/no variable, then it is probably not important and should be ignored.

As to your final question concerning the way in which SPSS handles missing data, without further information I've got to side with the program on this one. The whole point of the analysis is to compare how all the variables relate in order to classify the final result. If you are missing data from a particular variable for one of your samples, it does not help in determining how that variables relates to the others. If that variable is unimportant, then running the analysis without it should increase your N and your ability to discriminate. But if the variable is important, any data you use should have a value for that variable.

Mark Huber

Current Queue | Current Queue for Other | Other archives

Try the links in the MadSci Library for more information on Other.