How to use this manual
Transcription
How to use this manual
1 How to use this manual This manual provides a set of data analysis laboratories and exercises which will familiarize you with a number of multivariate (and other) procedures as they are implemented in SYSTAT 8. You should work through these exercises at your own pace, which initially may be rather slow but will accelerate as you become more familiar with SYSTAT. Some points to bear in mind: 1. It is assumed that you have some familiarity with SYSTAT 8. As such, the exercises only sketch the sequence of menu commands required to do certain basic things: they do not necessarily provide step-by-step instructions for each and every analysis. If you are completely unfamiliar with SYSTAT, you can consult the Laboratory exercises for the course BIO 4118, available in .pdf format on the BIO 4118 Course Web: while these labs are written for SYSTAT 7, they provide step-by-step instructions about how to do certain things, and in the vast majority of cases, these instructions apply equally well to SYSTAT 8. 2. Each lab has one or more associated command files (.syc). These list the set of commands which I used to do each exercise. If you get stuck on an exercise, you can refer to the appropriate command file to see the sequence of commands that you should use. In fact, you can simply File/Open/Command the appropriate command file in the COMMAND pane (Untitled), and run the file as a command file (using File/Submit from current line to end). In fact, for each laboratory, you could simply submit the appropriate command file(s) and generate exactly the same output as I have done. This would take at most a couple of minutes, but you will learn virtually nothing about multivariate analysis in general and SYTAT 8’s implementation of various procedures in particular. So I strongly recommend that you refer to my command files only when you are stuck or you want to check your output. Also, be aware that because these commands were run on my own machines, you will have to change the paths for both input and output files. 3. The exception to the above occurs when the output requires considerable programming either in SYSTAT or in BASIC. In these instances, it would be very difficult for you to do the exercises by yourself unless you have this expertise already. So these exercises consist primarily of interpreting the output and figuring out how the command files work. 4. Most exercises include one or more questions. Consider these questions carefully! They are designed to (hopefully) illuminate important aspects, limitations, constraints and issues in multivariate analysis. 5. As you work through this manual, please bring any errors or infelicities to my attention, and I will correct them in future versions. 2 HOW TO USE THIS MANUAL I would also encourage you to bring to my attention any suggestions about how the manual might be improved. 3 Displaying multivariate data Description This exercise is designed to familiarize you with the different options for displaying multivariate data in SYSTAT 8. The data The file STURGDAT.SYD contains data on individual sturgeon captured in the commercial harvest at two locations (The Pas and Cumberland House) over the past several decades. The data include variables relating to size (e.g. FKLNGTH, RDWGHT), age (AGE), sex (SEX$), location of capture (LOCATION$) and year of capture (YEAR$). The file ALSC8.SYD contains data on a variety of characteristics of Adirondack lakes censused in 1984-1986, including dissolved summer oxygen (DO), pH (PH), acid neutralizing capacity (ANC), surface area (SA), mean depth (MEANDEPTH). Finally, the file DOG.SYD contains 5 different teeth measurements of several different kinds (GROUP$) of fossil dog. Exercises 1. Using STURGDAT.SYS, make a SPLOM (Graphics/SPLOM) using FKLNGTH, TOTLNGTH, RDWGHT and AGE. What inferences would you draw from these plots, and how would they affect subsequent analysis? Repeat this procedure for each sex. 2. Using the same data, do some 3-D plots using the Scatterplot command. While your at it, play around with various options and look at the results. 3. Using ALSC8, make a SPLOM using DO, PH, ACIDNC and surface conductivity (SCONDUCT). What inferences do you draw from these plots, and how would they affect subsequent analysis? Redo the plot using SA, MEANDEPTH and VOLUME. What inferences do you draw? 4. Using ALSC8, do some 3-D plots with various options. Experiment! 5. Using the file DOG.SYD, do Icon and Profile plots, using GROUP$ as the Grouping variable and the other five variables as the Feature variables. Repeat this procedure but before doing so, use Data/Standardize to convert values to standardized values. How does this change the output, and why? Repeat using Andrew’s Fourier Plot, enabling the Overlay graphs into single frame option. Output and Command files The output and command files for this laboratory are LAB2.SYO and LAB2.SYC respectively. 4 DISPLAYING MULTIVARIATE DATA 5 The bootstrap and jackknife Description This exercise is designed to familiarize you with bootstrapping and jackknifing SYSTAT 8. Here we consider these procedures in the context of two simple univariate procedures: the t-test and simple linear regression. You should, however, be aware that SYSTAT 8 has bootstrapping/jackknifing routines implemented for most of its statistical procedures, including most of the multivariate procedures discussed in class. In the exercises that follow, I will not explicitly deal with how these resampling methods can be implemented in the context of a specific procedure, but the following exercises provide models that you can in principle adapt to other more complex analyses, should you feel so inclined. The data The data file GJACKAL.SYD contains length of skull measurements on male and female golden jackals. The file islands.syd was kindly provided by Attila Kamar, and consists of data on the number of non-marine bird species (BIRD) present on 346 oceanic islands. Also included are various island attributes, including size (AREA), precipitation (PRECEPT), mean annual temperature (TEMPERATURE), etc. Exercises 1. Open the Systat command file BOOTT-TEST.SYC. This file contains a set of commands for producing the 95% upper and lower confidence intervals of the bootstrapped average difference between male and female golden jackal skull sizes, as well as a histogram and summary statistics. To begin with, open the file in the Untitled (Command) pane using File/Open/Command. Then use File/Submit from current line to end. What is the output from this file? 2. Run a standard t-test (see lab 3) comparing the mean skull sizes of males and females, and compare this result with the bootstrapped result. What do you conclude? Note that while the means are quite similar, the 95% confidence intervals are quite different. Why? 3. Examine the BOOTT-TEST command file. This set of commands does as follows: (1) opens the input data file (GJACKAL.SYD); (2) sets the random seed (RSEED); (3) runs the bootstrapped t-test procedure 1000 times using a sample size of 20, and writes the output to a file called TEMP; (4) the program then uses a bunch of BASIC commands to (i) read in the TEMP file; (ii) check through the temp file to find lines whose first two fields (A$,B$) are “Pooled” and “Variance”; (iii) skip from these lines to the next line down using the LAG function, and assign the value in the fifth field (E$) of this line to the variable R1 (Note that from the t-test output, this field of this line contains the value of the average difference between males and females using a pooled variance t-test.); (iv) uses the VAL function to convert this 6 THE BOOTSTRAP AND JACKKNIFE character variable to a numeric value; (v) delete any lines that do not have bona fide values for R1; (vi) writes the set of R1 values to an output file (CC). This output file is then sorted in ascending order, with the 25th and 975th values representing the lower and upper 95% confidence intervals. This same file is then used to produce a histogram of the 1000 values as well as some summary statistics (Note that the latter is done using SYSTAT, not BASIC, commands. Note also that unless you specify a path name for the output files, they will be saved in the default SYSTAT directory.) 4. Using the command file JACKREGRESS, run jackknifed multiple regression fitting a model where number of bird species is modelled as a linear combination of island area, precipitation, temperature and elevation. The set of commands in this file: (1) opens the islands data file and using GLM, runs the linear regression, saving the jackknifed regression coefficients to the file BOOTREG; (2) computes the “standard” multiple regression coefficients (i.e. using the normal approximation); (3) opens the file of jackknifed coefficients and computes descriptive statistics, as well as producing histograms of the coefficients for each variable. How do the standard coefficients compare to the mean bootstrapped coefficients? Compare the standard errors of the coefficients generated using the normal approximation with the standard deviation of the bootstrapped coefficients. What does this tell you? (Note: this is a common occurrence: resampled measures of precision in regression are often larger than those obtained under the normal approximation!) Output and Command files The command files for these exercises are BOOTT-TEST.CMD and JACKREGRESS.CMD. There are no output files: to get the output, run the command files with the appropriate modifications (i.e. paths for input/output files). 7 Two-sample comparisons Description This exercise is designed to familiarize you with two-sample comparisons in SYSTAT 8. Specifically, this exercise considers the case where there are two samples which we want to compare with respect to their multivariate means or variances. The data The file BUMPUS.SYD contains morphological measurements of 49 female sparrows collected by Herman Bumpus in February, 1898 in Rhode Island, including total length, alar extent, head length, length of humerus, length of keel and sternum, all in mm. A brief description of STURDAT.SYS is given in Laboratory 2. The file ALSC8 gives the characteristics of a large sample of lakes in the Adirondacks, including dissolved oxygen (DO), pH (PH), acid neutralizing capacity (ACIDNC), area (AREA), elevation (ELEVATION), etc. and the presence/absence of a whole bunch of fish species. In the first set of exercises, we compare the characteristics of sparrows which did and didn’t survive a tropical storm. In a second set of exercises, we compare the physiochemical properties of lakes which have and do not have brook trout present (with the overall objective of trying to predict the influence of physiochemical properties on the likelihood of success of brook trout stocking programs). Exercises 1. Using BUMPUS.SYS, run Analysis of variance (ANOVA)/Estimate using all five variables as the dependent variables and SURVIVAL$ as the factor variable, with Save File/Residuals/data enabled. (The default file name here is ANOVA - keep it as it is.) Then run ANOVA/Hypothesis test entering SURVIVAL$ as the effect. What do you conclude? 2. Using the residual file ANOVA.SYD created in the step above, select Graph/Probability, Graph/Scatterplot and Statistics/Time series/ACF plot to create normal probability plots, plots of residuals versus estimates, and autocorrelation plots respectively for the residuals of all variables for each group separately (by using SELECT statements). (Note: in the residual file, residual(1) corresponds to the first listed variable in the ANOVA, residual(2) the next listed variable, etc. (To make your life easier, you may want to rename these variables in the file.). Recall that the ACF plots test for serial autocorrelation of residuals, in this case within each group. Do you think the assumptions of multivariate normality, homogeneity of variances and independence of residuals are met? 3. Using ANOVA.SYD, run Levene’s test for each set of residuals. To do so, first use Data/Transform/Let, and create a new variable ABSRES = ABS(RESIDUAL). This is the absolute value of the difference between an observed value and the mean for the group (Recall that the sample means are the least-squares estimates of the 8 TWO-SAMPLE COMPARISONS population means!). Then compare the mean ABSRES for each variable between survivors and non-survivors. What do you conclude? 4. Using BUMPUS.SYS, generate the sample covariance matrices for both survivors and non-survivors by first enabling Data/By groups, and then running Statistics/Correlation/Simple, entering all 5 variables in the Variables dialog box and making sure the Type Continuous/Covariance option is enabled. Based on your inspection of the covariance matrices for each group, do you think the assumption of compound symmetry holds? (Note: we shall learn how to test the assumption of equality of covariance matrices in a later laboratory exercise.) 5. Test the one-tailed null hypothesis that survivors are less variable in body size than non-survivors using Levene’s test. First, standardize each variable in each group (survivors and non-survivors) using the Data/Transform menu with the By groups (SURVIVAL$) enabled. The use Statistics/Descriptive statistics/Basic statistics to calculate the medians for each standardized variable for each group. Then use a series of Data/Transform/If then let... statements (disabling BY groups before doing so!) to calculate the absolute value of the difference between the standardized values for each observation and the sample medians. Finally, compare the vector of mean absolute deviations of the two groups (survivors and non-survivors) using Statistics/Analysis of variance as described above. What do you conclude? (Note: if you are having trouble figuring out how to do this, consult the SYSTAT command file LAB4.SYC.) 6. Repeat the exercise above, but this time instead of calculating the absolute deviation of standardized values from the median, calculate Van Valen’s measure (See Lecture notes, lecture 5). Now what do you conclude? How do you explain the difference between this result and the previous one? (Note: again, if you are having trouble figuring out how to do this, consult the SYSTAT command file LAB4.SYC.) 7. Using File/Open/Command, import the command file MVNORMALTEST into the command pane (Untitled). Then run this file using File/Submit from current line to end. This file produces a plot of the Mahalanobis distances of each lake from the mean of the sample from which it is drawn (either the lakes with or without brook trout) versus its chi-square approximation. Recall that if the data are multivariate normal, this relationship should be a straight line. The commands in this file (i) run a discriminant function analysis using five variables (DO, pH, ACIDNC, elevation and log surface area) and produce an output file ( DISCRIM.SYD) of the Mahalanobis distances for each observation (Don’t worry about what the DISCRIM procedure is doing here - we use it simply to get the distances. This will be the subject of a later laboratory); (2) creates a subset of the data file consisting of lakes where brook trout were present (a total of N = 250 lakes); (3) uses this file to rank the Mahalanobis distances; (4) computes the chi-square percentiles 9 using the ranked values and the inverse chi-square function (XIF) which returns a critical value of the chi-square distribution given a certain percentile and a degrees of freedom (which in this case, is p, the number of variables); (5) plots the distances against the chi-square values. This plot gives us an indication of whether within the sample of lakes with brook trout, the five variables follow a multivariate normal distribution. What do you conclude? Can you find a transformation that improves the situation? 8. the command file BOX2 is designed to test the equality of two covariance matrices using the Box test (see Lecture Notes). This is quite a long program, which prints various and sundry things throughout. Most of the calculations involve matrices, so they are done within SYSTAT’s MATRIX program. The commands: (i) produce a set of covariance matrices (COV1, COV2) for the two samples of lakes (with and without brook trout), as well as the logarithm of the determinants of each matrix (LNS1, LNS2); (ii) calculates a pooled covariance matrix (POOLC) and the logarithm of its determinant (LNPOOLC) using the degrees of freedom for both COV1 and COV2 (DF1, DF2), as well as the total degrees of freedom (DFN); (iii) calculates the Box M and C values (BOXM, BOXC); (iv) uses these values to compute the chi-squared approximation (XAPPROX)and degrees of freedom (DF). On the basis of these calculations, are the covariance matrices equal? Output and Command files The command files used in this lab up until the exercise testing multivariate normality is LAB4.SYC; the output is LAB4.SYO. For the other exercises, the command files are as indicated in the exercises themselves. For these exercises, there are no output files: to generate output, run the command files. 10 TWO-SAMPLE COMPARISONS 11 Correlation, covariance and distance matrices Description This exercise is designed to familiarize you the calculation of correlation, covariance, SSCP and distance matrices in SYSTAT. The data The data file DOG.SYD contains average measurements for six different skull characters for 7 species of dog, both past and present. Manly described this data set in chapter 1 of his book. The file SKULL.SYD contains four measurements of human skulls dating from 5 periods in the past. See Manly for more details. Exercises 1. Open DOG.SYD. Then use Data/Transpose to create a new data file which transposes rows and columns, so that in this new file, the columns now represent dog species. You can now change the names of these column variables (which be default are called COL(1), COL(2) etc.) to the names of the species by double-clicking on the variable name in the data sheet and renaming it. Then choose Statistics/Correlation/Simple to generate SSCP, Covariance and Correlation matrices for the seven species. How is the covariance matrix generated from the SSCP matrix? What do you conclude from these matrices? Why is the procedure NOT a good one to use to generate these matrices? (Note: if you run the DISTANCE command file, it will bomb because the TRANSDOG file created by standardizing and transposing doesn’t have the new variable names (pre-dog, etc.). You have to do this manually in the data sheet.) 2. Repeat the above procedure except before transposing, use Data/Transform/Standardize/SD to standardize all variables to zero mean and unit standard deviation. How does the covariance matrix generated here compare to the correlation matrix generated using the raw data? 3. Repeat one or the other of the above exercises, but now write the correlation or covariance or distance matrix to an output file using Save file. Then open this file and see what it looks like. 4. SYSTAT does not have a “canned” algorithm for calculating Penrose and Mahalanobis distances between multivariate groups. However, I have written a short command file (MAHALANOBIS.SYC) which (in this particular case), calculates the Mahalanobis distance between the mean vectors for the sample of skulls contained in the data file SKULL.SYD (this is the same data given in Manly, Table 1.2). The procedures are mostly executing in SYSTAT’s MATRIX command. This set of commands (1) inputs the skull data file; (2) selects for a certain period (e.g. early predynastic) and computes the 12 CORRELATION, COVARIANCE AND DISTANCE MATRICES (unstandardized) covariance matrix for each period (C1, C2, etc.) as well as the degrees of freedom on which these matrices are based (df1, df2, etc.); (3) calculates the pooled covariance matrix; (4) calculates the Mahalanobis distance between the mean vectors for early and late pre-dynastic skulls by multiplying various matrices together. Run this command file, then look at the output that’s produced and try and figure out what’s going on. (Note that the mean vectors and covariance matrices for some periods are slightly different from what Manly reports on pp. 65 - 67 - either his mistake or mine, I’m not sure which!). How would you modify this command file to generate the distance between Ptolemaic and Roman skulls? (This is easy!) Note that this file only produces one distance at a time. How would you modify it to produce the entire distance matrix? (This is a substantially harder!) Output and Command files The command files for these exercises are DISTANCE and MAHALANOBIS.CMD. There are no output files: to generate them, run the command files with the appropriate modifications 13 Single classification MANOVA (k-groups MANOVA) Description This exercise is designed to familarize you running single classification MANOVA in SYSTAT, including unplanned comparisons and checking assumptions. The question of interest is: for Adirondack lakes which do not have brook trout, are there differences in the physiochemical properties of lakes with different numbers of top piscivore species (e.g. northern pike, largemouth bass, smallmouth bass, etc.) The data The data file is ALSC8.syd, described briefly in previous laboratories Exercises 1. Open ALSC.SYD. Then use Data/Transform to create a new variable (PSR$) which gives the number of top piscivore species present in the lake. This variable can have three categories: “none”, “one” or “two or more”.Then choose ANOVA/Estimate model and run a single classification ANOVA with PSR$ as the category and PH, DO, ELEVATION and surface area (SA) as the Dependent variables. make sure you save the residuals and data together using the Save File/Residuals/Data option. The output file (if you have Long output enabled) includes: (1) the estimate of effects for each category and each variable (note that because effects sum to zero across all categories, only two groups are included here); (2) the total SSCP matrix; (3) the residual SSCP matrix (what we called the W matrix, i.e.the within-group SSCP matrix); (4) the residual covariance matrix (i.e. the covariance matrix of residual values); (5) the least-squares means for each group for each variable (which, as you recall from the lecture on least-squares estimation, are just the sample means); (6) a test of the hypothesis that the constant in the fitted ANOVA model is zero; (7) a test of the null hypothesis that the group effects in the model are simultaneously zero. For (6) and (7), the univariate F-tests for each variable are printed, as well as the multivariate test statistics (Wilk’s lambda, etc.) Don’t worry about what the canonical correlations etc. mean - we will address this in a later exercise. Based on these results, what would you conclude? 2. Open the file of residuals you saved from the previous analysis. By using SELECT statements, run ACF (Time Series/ACF) plots on all 4 residuals for each group. (Note: in this file, residual(1) corresponds to the first listed variable in the ANOVA, i.e., pH; residual(2) the next listed variable, etc. (To make your life easier, you may want to rename these variables in the file.). Recall that the ACF plots test for serial autocorrelation of residuals, in this case within each group. What do you conclude? Using the same file, test for univariate normality of each variable separately by doing a 14 SINGLE CLASSIFICATION MANOVA (K-GROUPS MANOVA) normal probability plot (Graph/Probability/Normal) for each variable in each group. What do conclude? 3. Using the univariate F tables from the first analysis, calculate the intraclass correlation (see Lecture 5.17) for each of the four variables. What do you conclude? On the basis of this analysis and the preceding one, do you think that the assumption of independence is valid? What do you think would/should be your next step? 4. Open the command file MVNORMALTEST and try adapting it to check multivariate normality within each PSR$ category. What do you conclude? (If you get stuck, the command file MVNORM2 will do it for you. The output is given in MVNORM2.SYO) 5. Adapt the command file BOXTEST2 to test for equality of the three covariance matrices (one for each PSR$ category). Here you can use the two of the three files extracted above, namely PSR_0 and PSR_1 as inputs, but you can’t use PSR_2+ because MATRIX doesn’t like “+” in matrix names. Do you will have to name this file something else.)What do you conclude? (Note: if you get stuck here, the command file BOXTEST3 will do this for you.) 6. At this stage, it should be apparent that (1) at least for some variables, the assumption of independence within groups is not valid; (2) there are some indications that the data are not multivariate normal; and (3) the assumption of equality of covariance matrices is also unlikely to be true. So, what do you do? How might you try and resolve the first problem (hint: see Lecture 5.19)? How about problems (2) and (3)? Which problems (if any) do you consider particularly worrisome in this specific example? 7. For this exercise, let’s assume that in fact the assumptions did hold, and we wish to find out which pairs of groups differ from one another. In MANOVA, SYSTAT 8 does not allow automatic post-hoc pairwise contrasts (note that when you ran the original ANOVA above, the Post-hoc Tests dialog box was blanked out). Instead you must specify particular hypotheses to test following the original model estimation, by choosing ANOVA/Hypothesis test, and filling in the resulting dialog box. Effect means the grouping variable(s), which in this case is PSR$. You then must specify a Contrast matrix (C matrix). To compare the first (PSR$=’none’) and second (PSR$=’one’) groups, you must type in ‘1 -1 0’; to compare the first and third groups, enter ‘1 0 -1’ (without the single quotes). There are in total three such pairwise contrasts: 1 -1 0, 1 0 -1, 0 -1 1. Run these contrasts. Which groups differ from which others, and with respect to which variables? (Note that regardless of the contrast, the univariate F and multivariate tests always use the same error term (either the pooled (over all three groups) error MS for univariate tests, or the determinant of the total (pooled) SSCP matrix for the multivariate case.) This is because these contrasts assume that the within-group variances/covariances are in fact simply random samples of the “population” variances/covariances. That is, the contrasts assume that within-group variances/covariances are equal, so that the best estimate of the “true” population variance/covariance is simply the 15 pooled variances and covariances. Thus, if you were to simply run a two-group comparison (see Lab 4 above) of, say, PSR$=’none’ and PSR$=’one’, the error MS appearing in the univariate ANOVA tables would not be the same as appear here in the -1 1 0 contrast. If you don’t believe me, try it!) Output and Command files The command files for these exercises are SCLASSMANOVA, MVNORM2 and BOXTEST3. The output file is LAB6.SYO. For the output to MVNORM2 and BOXTEST3, run the command files with the appropriate modifications. 16 SINGLE CLASSIFICATION MANOVA (K-GROUPS MANOVA) 17 Principal component and factor analysis Description These laboratory exercises are designed to familarize you running principal component analysis and factor analysis in SYSTAT. While mathematically rather similar, these two procedures have quite different theoretical underpinnings, and whether one or the other is more appropriate depends not so much on the nature of the data, but rather on the nature of the question. Remember, principal components are weighted linear combinations of observed variables, whereas common factors are unobserved (indeed, unobservable) variables that are hypothesized to account for correlations among observed variables. (For more details on this, see Lecture notes chapter 7). In SYSTAT, factor analysis and principal component analysis are included in the same statistical module (Statistics/Data Reduction/Factor). The data is a classic data file used to test various computer algorithms for principal component and factor analysis, and consists of morphological measurements of 305 girls, including height, arm span, forearm length, lower leg length, weight, upper thigh diameter (BITRO), chest girth and chest width. Since all of these measurements are an index of size, we expect them to be highly correlated. The data file does not list the raw data but rather the correlation matrix. GIRLS Exercises 1. Open GIRLS.SYD. Then choose Data/Data Reduction/Factor, enter all eight variables as the Model Variables, and make sure the PCA button is highlighted. Then click on OK. From the resulting output, answer the following questions: (1) what do the eigenvalues (latent roots) represent? (2) what do the eigenvectors 1 and 2 represent, and why are there only 2 of them? (3) what do the (unrotated) component loadings represent? (4) where do the values for “Variance explained by component” come from?; (5) where does the “Percent of Total Variance explained by component” come from? (6) How would you interpret these results. In particular, which components and loadings are “significant”? 2. Repeat the above procedure, but this time make sure that Sort Loadings is enabled, and click on Rotation... and choose Varimax. What does the “Rotated Loading Matrix” represent, and how do you interpret it? Does this rotation make for easier interpretation? 3. Repeat the above procedure, trying a few different rotations. Which, in your view, is the best, and why? (Note: when you do Oblimin rotations, the output also includes a panel about “Direct and Indirect contributions of factors to Variance”. These are useful for 18 PRINCIPAL COMPONENT AND FACTOR ANALYSIS determining whether part of a factor’s contribution to the explained variance is due to its correlation with other factors. The reason this occurs here and not in the other rotations is that in oblique rotations (like Oblimin), factors need not be uncorrelated, whereas in other rotations, the assumption is that they are orthogonal (uncorrelated). 4. For Orthomax and Oblimin rotations, you can set a parameter called Gamma. In Orthomax rotations, Gamma (ranging from zero to one) determines the “family” of rotations used. Recall that Orthomax rotations are a sort of hybrid between Varimax and Quartimax rotations, and Gamma can be considered as determining the degree of hybridization: Gamma = 0 is equivalent to a Varimax rotation, while Gamma = 1 is equivalent to a Quartimax rotation. Thus, varying Gamma from zero to one causes a change in the family of rotations from those which minimize the number of variables that have high loadings on each factor (Varimax) to those that minimize the number of factors that load highly on each variable (Orthomax). 5. For Oblimin rotations, Gamma again determines the family of rotations used. In general, the “best” oblimin rotation depends on the nature of the correlation matrix. if correlations are generally moderate, Gamma near zero will usually give the best results. If the correlation includes many high correlations, the positive values for Gamma will often give the best results. In this case, many of the correlations are quite high, but a number are quite low, so a small positive Gamma (e.g. 0.1) gives good results. 6. Repeat the above procedures, except this time use Statistics/Data reduction/Factor and select Maximum Likelihood (ML). Choosing either IPA or ML causes the module to invoke factor analysis rather than principal components, and IPA or ML determines the analytical method by which factors (which remember, are unobservable) are extracted. In ML estimation, for every estimate of the communalities for each factor, the negative log-likelihood (see Lecture Notes, L1.10 - L1.17) is computed, and the process continues over a series of iterations until the convergence criterion is satisfied (note that you can set both the maximum number of iterations and the convergence criterion in the Factor Analysis dialog box.) From the output, (1) what do the “initial communality estimates” and “final communality estimates” represent, and why are there eight of them? (2) what does the “Factor Pattern” matrix represent?; (3) what do the canonical correlations represent? (To answer this question, check out Lecture Notes Ch. 11). The “Common Variance” is the sum of the communalities: if A is the maximum likelihood factor pattern matrix, then the common variance Vc = tr(ATA). 7. Note that in the above example, there are eight variables but the factor pattern matrix lists only four factors, even though we did not specify the number of factors in the Factor Analysis dialog box. This is due to a well-known (in some quarters at least) theorem that in factor analysis, the maximum number of factors for which 19 reasonably accurate estimates of loadings can be obtained equals half the number of variables. 8. Repeat the above procedure using IPA. In IPA (and ML), the initial communality estimates for each variable is simply the multiple squared correlation of that variable with all other variables. At each iteration, communalities are estimated from the factor pattern matrix A. Iterations continue until the largest change in an communality is less than that specified in Convergence. Replacing the diagonal elements of the original correlation (or covariance) matrix with these final communality estimates and computing the eigenvalues gives the eigenvalues reported in the next panel of the output. 9. PCA on the Girls data set results is quite effective, insofar as (1) most of the variation in the data set is explained by only a couple of components/factors; (2) the resulting factors/components are easily interpreted. But this is not always the case. For a counter example, run PCA (or factor analysis) on the skulls data set. How do you interpret these results? Save the scores and data (by enabling Save /Factor scores). Then using Data/Merge, merge the period$ variable in the original skull file with the factor scores, and use the SPLOM procedure to produce a plot of factors(1) - (3) against one another, with period$ as the grouping variable. You may also want to define a confidence interval for each group using confidence ellipses (say, 80%) (Options/Scatterplot matrix options/Confidence ellipse). What do these plots tell you? 10. PCA and factor analysis assume that the variables used in extracting components or factors follow a multivariate normal distribution. Using the procedures outlined in Laboratories 4 and 6, do you think that this assumption is justified for the skulls data set? (Note: unfortunately, SYSTAT has no general “stand alone” procedure for calculating Mahalanobis distances, but it will generate such distances for individual observations in the context of the DISCRIM procedure (see Lab 4). In this exercise, we are not concerned with variation in skull dimensions among periods, but rather with doing PCA/Factor analysis for the sample as a whole. So really we have only one group, and if we try and run DISCRIM, SYSTAT will complain bitterly because it cannot run discriminant analysis when there is only group. However, we can take the skull data file, and create a dummy second group by simply copying the 150 “real” cases to create a file of 300 cases, and creating another character variable group$ which has two value: “real” for the first 150 cases, and “dummy” for the last 150 cases. But DISCRIM won’t work if the two groups are identical, so for all cases in the dummy category, we need to change the values of the various skull variables, e.g. by adding 5.0 to every value. Once we have done this, we can now run DISCRIM on the set of 300 cases, and use only the calculated distances for the “real” cases. Once we have the distances, we can check for multivariate normality as we did in Laboratories 4 and 6. (If you get stuck on this, I have already created the dummy file (DUMMYSKULL.SYD), and the modified command file for testing multivariate normality is MVNORM3.) 20 PRINCIPAL COMPONENT AND FACTOR ANALYSIS Output and Command files The command file for this lab is LAB7.The output file is LAB7.SYO. The command file to check multivariate normality in the skulls data set is MVNORM3.SYO 21 Discriminant function analysis Description These laboratory exercises are designed to familarize you with running discriminant function analysis in SYSTAT, using both linear and quadratic analysis functions. The variables in these functions can be selected in a forward or backwards stepwise fashion, either interactively by the user or automatically by SYSTAT. Remember that discriminant analysis is rather like a hybrid between single classification MANOVA and multiple regression: cases are grouped into “treatments” (groups) as in MANOVA, and the predictor variables are used to generate functions (equations) which predict group membership, as in multiple regression. The data IRIS.SYD is a another classic data set from Sir Ronald Fisher, one of the fathers of modern biostatistical analysis, and dates back to 1936. The file consists of measurements (sepal length, sepal width, petal length and petal width, in cm) made on 150 iris flowers from three species. Exercises 1. Open IRIS.SYD. Then choose Statistics/Classification/Discriminant function and enter all four flower variables as the Variable(s), and species as the Grouping Variable. Click on Options... and make sure the Estimation/Complete is enabled. Then click on Continue and then OK. 2. The Between-groups F-matrix tests for equality of group means for a specified pair of groups. These values are the result of pairwise “univariate” F-tests comparing the average within-group Mahalanobis distances to the average between-group distances. Because the degrees of freedom are the same for each pairwise comparison, these F values are essentially measures of distances between group centroids. Which pair of groups are farthest apart? 3. The F-to-remove statistics are a measure of how important a particular variable is to the discriminant model. These values are equivalent to univariate ANOVAs for each variable among groups. For all variables, the denominator df is the same: total sample size number of groups - number of variables in the model + 1 (i.e., in this case, 144). On the basis of this result, which variable is the least useful in distinguishing among species? Which is the most useful? 4. What are the canonical discriminant functions, standardized canonical discriminant functions, canonical group means, the raw classification matrix and the jackknifed classification matrix. On the basis of the canonical function plot, how important is the second canonical function in discriminating among species? 22 DISCRIMINANT FUNCTION ANALYSIS 5. Repeat the above, but this time under Statistics..., make sure the Long/Mahal is enabled. As we have seen before, this calculates the Mahalanobis distance of each observation from the multivariate mean of each group. Based on these distances, it calculates an a posteriori probability of group membership for each observation in each group (so that the row sum for a given observation is one), based on the Mahalanobis distances: the closer an observation is to a group mean, the greater the posterior probability. An arrow (-->) marks misclassified observations. 6. On the basis of the above analysis, we might conclude that PETALLEN is not very important in discriminating among species. Rerun the above analysis, but this time drop PETALLEN from the variable list. Compare the classification matrices obtained here with those obtained using all four variables. Do you think that including PETALLEN results in a significantly better classification? 7. Consider now a more complex situation. Our goal is to build a model which predicts whether we will find brook trout in Adirondack lakes. Thus, we can ask: what is the set of functions that best discriminates between lakes that do and do not have brook trout? Using the file ALSC8B.SYD, run forward and backwards automatic stepwise discriminant function analysis, beginning with the predictors DO PH ACIDNC PHOSPHOROUS SCONDUCT ELEVATION LOGSA LOGLITT MAXDEPTH MEANDEPTH SHORE FLUSRATE INLET OUTLET BTSTOCK. Can you figure out how these stepwise processes are working (see Lecture Notes, ch. 9). What do you conclude? 8. From the stepwise procedures above it should be clear that some of the initial variables are not very useful in discriminating among groups. But based on the F-to-enter/remove statistics for each variable, which ones do you think are the most important? Try running a discriminant function analysis using only these variables (Hint: there are 5 of them - which are they?). Compare the jackknifed classification matrices obtained from this analysis with those obtained using forward and backwards stepping above. What do you conclude? 9. In the analyses above, we assumed that the prior likelihood of a lake having brook trout or not was the same. In reality, this is not the case. In fact, we know that only about 30% of the lakes in the Adirondacks have brook trout. Thus, it is appropriate to modify the priors by including in the Model command an option / Priors = 0.70, 0.30. (Note: there is no menu option to do this; it must be specified in the command file.) Rerun the forward stepping analysis with these priors, and compare the results with those obtained using equal priors. Why the difference? In particular, why does the misclassification rate for “absent” decrease substantially while the misclassification rate for “present” increases dramatically? 10. Like most multivariate procedures, discriminant function procedure analysis relies on multivariate normality within groups and equality of covariance matrices among groups. In previous laboratories (4, 6, and 7), we have seen how these assumptions can be checked. Suppose we consider only the variables PH SCONDUCT ELEVATION 23 FLUSRATE BTSTOCK1. As a “quick and dirty” check of the equality of covariance assumption, we can do a SPLOM of these 5 variables for each group of lakes. If the sample size is about the same for each group (or, as in this case, large enough that differences in sample size are largely irrelevant), if the equality of covariance assumption holds, then the ellipses for each pair of variables should have approximately the same tilt and shape across groups. Do a SLPOM to check whether this assumption is likely to be true. What do you conclude? 11. On the basis of the above, you should conclude that the equality of covariance assumption probably doesn’t hold (Big surprise!). In such cases, quadratic discriminant function analysis may be of value since it does not assume equality of covariances (Recall, however, that quadratic discriminant functions generally have lower reliability!). Rerun the above analysis using the five variables PH SCONDUCT ELEVATION FLUSRATE BTSTOCK, Quadratic and Complete estimation. How do the results compare to a simple linear discrimination using the same variables? 12. Recall that jackknifed cross-validation is not a particularly good way to assess model reliability. A better way is to use some of the data as a training set, the rest as a test set. In SYSTAT, the way to do this is to assign a random number (from zero to 1) to each case in the data file, and when this number is less than, say, 0.66, the value 1.0 is stored in a new variable CASE_USE. On the other hand, if the random number is equal to or greater than 0.66, the CASE_USE variable is set to zero. CASE_USE then determines the value of another variable WEIGHT (To see precisely how this is done using the SYSTAT URN function, consult the command file for this laboratory). In running a procedure, SYSTAT will then use only those cases which have a WEIGHT equal to 1.0. In this way, about 66% of the cases in the file are used to (in this example) generate the discriminant model. (i.e., these are the “training” data). Use this data splitting procedure to evaluate the reliability of the linear and quadratic discriminant models derived above. 13. Using the knowledge gained from Laboratory 4 and the present one, can you produce a command file to do bootstrapped data-splitting for discriminant function analysis (This takes some thinking about, but is not particular difficult.) Output and Command files The command and output files for this lab are LAB8.SYC and LAB8.SYO. 24 DISCRIMINANT FUNCTION ANALYSIS 25 Cluster analysis Description These laboratory exercises are designed to familarize you with running cluster analysis in SYSTAT. Within SYSTAT, one can cluster cases (usually the objects), variables or both, depending on the ultimate goal. SYSTAT has procedures for producing hierarchical clusters as well as partitioned clusters (additive trees and k-means clusters). The data shows the abundances of the 25 most abundant plant species on 17 plots from a gazed meadow in Steneryd Nature Reserve in Sweden. For each plot, the value for a particular species is the sum of cover values (range 0-5) over 9 subplots, so that a score of 45 indicates complete coverage of the entire plot. PLANTS2.SYD Exercises 1. Open PLANT2.SYD. As indicated in the Lecture Notes (Ch. 10), when clustering objects it is important to make sure that the variables (in this case, species counts) are all on the same scales. So before beginning, make sure you standardize the data. Then choose Statistics/Classification/Hierarchical clustering and enter all 25 plant species (columns in the data file) as the Variable(s), and enable Join/Rows. This will insure that plots are clustered on the basis of their dissimilarities in the abundances of the 25 plant species. Make sure that Linkage is set to single and Distance to Euclidean. Then click on OK. What conclusions do you draw from the resulting tree (Notice which plots are grouped together). What does this suggest? 2. Still using row joining, try building trees based on a number of different joining methods (e.g. single, complete, average, etc.). What, if anything, differs among the trees built using these different methods? Is there one method which you feel gives the “best” results? 3. Rerun the above analyses, but instead of using Euclidean distances, use distances based on the correlation matrix. How do these results compare with those obtained using Euclidean distances? Overall, how many “major” clusters of plots do there appear to be? 4. Suppose that instead of being measures of coverage, the data in the matrix represented counts of individuals per plot, do you think that constructing trees based on quantitative distance metrics (Euclidean, correlation, etc.) would be appropriate? Why or why not? If not, what do you think would be an appropriate distance metric to use? Try it. Do the results change if you use this distance metric? (Note: single linkage will not work for distance metrics for count data, e.g. chi-square. Why not?) 26 CLUSTER ANALYSIS 5. In the above exercise, you tried various joining methods assuming that the data were in fact frequency data (which they are not). But, inspection of the data matrix should immediately tell you that in this case, clustering using frequency metrics is unlikely to be very reliable. Why not? 6. In many practical applications, measured attributes of objects take only one of several values: in plant ecology, for example, a species may simply be scored as present (1) or absent (0). PLANT3.SYD is the same file as PLANT2.SYD, except that each species has been scored as either present or absent. Given the form of these data, what is an appropriate distance metric to use? Use these metrics to construct a tree for the 17 plots. How do the trees obtained compare with those obtained using quantitative metrics. Are the “major” clusters the same? What does this tell you? 7. From the above analyses it appears that there is one cluster (of plots 1 - 5) which consistently seems to form a cluster regardless of the joining method used and regardless of the distance metric. Even if we coarse-grain the data and use presence/absence, these plots still form a tight cluster. This suggests that there are at least two major groups of plots, which in turn suggests that a reasonable starting place for k-means clustering is with 2 groups. Rerun the above analyses using Statistics/Classification/k Means Clustering, varying the number of groups and the distance metric. Based on these analyses, how many “major” groups do you think you have? Which species do you think are the most important for distinguishing among these major clusters? Suppose you were to consider only the data given in the original data matrix: on the basis of this information alone, which species do you think would be more likely to be useful in distinguishing among plots, and why? 8. In all of the above exercises, our objective has been to cluster plots. But we can also cluster variables (i.e. species) by using Join/Column. Repeat the above exercises and study the relationships among the species. Which species show the closest relationships? Output and Command files The command and output files for this lab are LAB9.SYC and LAB9.SYO. 27 Canonical correlation analysis