SPSS for Windows


SPSS for windows is a computer program (computer software) for statistical analysis.

·        Introduction

 

·        Bivariate Correlations

·        Cox Regression Analysis

·        Crosstabs

·        Canonical Correlation

·        Curve Estimation

·        Analysis of Variance (ANOVA)

·        Descriptive

·        Discriminant Analysis

·        Distances

·        Factor Analysis

·        Frequencies

·        General Loglinear Analysis

·        GLM Multivariate

·        Hierarchical Cluster Analysis

·        Kaplan-Meier Survival Analysis

·        K-Means Cluster Analysis

·        Life Tables

·        Linear Regression

·        Logistic Regression

·        Logit Loglinear Analysis

·        Proximities

·        Spearman Correlation Coefficient

·        Variance

 

 

Top


Introduction

SPSS for windows is a computer program (computer software) for statistical analysis. Most commands are accessible from the menus and dialog boxes. However, some commands and options are available only by using the command language. The command language also allows you to save your jobs in a syntax file so that you can repeat your analysis at a later date or run it in an automated job with the Production Facility.

A syntax file is simply a text file that contains commands. While it is possible to open a syntax window and type in commands, it is easier if you let the software help you build your syntax file using one of the following methods:

1.      Pasting command syntax from dialog boxes

2.      Copying syntax from the output log

3.      Copying syntax from the journal file

Top


Descriptive

The Descriptive procedure displays univariate summary statistics for several variables in a single table and calculates standardized values (z scores). Variables can be ordered by the size of their means (in ascending or descending order), alphabetically, or by the order in which you select the variables (the default).

When z scores are saved, they are added to the data in the Data Editor and are available for charts, data listings, and analyses. When variables are recorded in different units (for example, gross domestic product per capita and percentage literate), a z-score transformation places variables on a common scale for easier visual comparison.

Statistics. Sample size, mean, minimum, maximum, standard deviation, variance, range, sum, standard error of the mean, and kurtosis and skewness with their standard errors.

Descriptives Considerations

Data. Use numeric variables after you have screened them graphically for recording errors, outliers, and distributional anomalies. The Descriptives procedure is very efficient for large files (thousands of cases).

Assumptions. Most of the available statistics (including z scores) are based on normal theory and are appropriate for quantitative variables (interval- or ratio-level measurements) with symmetric distributions (avoid variables with unordered categories or skewed distributions). The distribution of z scores has the same shape as that of the original data; therefore, calculating z scores is not a remedy for problem data.

To Obtain Descriptive Statistics

From the menus choose: Analyse; Descriptive Statistics; Descriptives...

Select one or more variables.

Top


Frequencies

The Frequencies procedure provides statistics and graphical displays that are useful for describing many types of variables. For a first look at your data, the Frequencies procedure is a good place to start.

For a frequency report and bar chart, you can arrange the distinct values in ascending or descending order or order the categories by their frequencies. The frequencies report can be suppressed when a variable has many distinct values. You can label charts with frequencies (the default) or percentages.

Statistics and plots: Frequency counts, percentages, cumulative percentages, mean, median, mode, sum, standard deviation, variance, range, minimum and maximum values, standard error of the mean, skewness and kurtosis (both with standard errors), quartiles, user-specified percentiles, bar charts, pie charts, and histograms.

Frequencies Considerations

Data. Use numeric codes or short strings to code categorical variables (nominal or ordinal level measurements).

Assumptions. The tabulations and percentages provide a useful description for data from any distribution, especially for variables with ordered or unordered categories. Most of the optional summary statistics, such as the mean and standard deviation, are based on normal theory and are appropriate for quantitative variables with symmetric distributions. Robust statistics, such as the median, quartiles, and percentiles, are appropriate for quantitative variables that may or may not meet the assumption of normality.

Frequencies Statistics

Percentile Values. Values of a quantitative variable that divide the ordered data into groups so that a certain percentage is above and another percentage is below. Quartiles (the 25th, 50th, and 75th percentiles) divide the observations into four groups of equal size. If you want an equal number of groups other than four, select Cut points for n equal groups. You can also specify individual percentiles (for example, the 95th percentile, the value below which 95% of the observations fall).

Central Tendency. Statistics that describe the location of the distribution include the mean, median, mode, and sum of all the values.

Dispersion. Statistics that measure the amount of variation or spread in the data include the standard deviation, variance, range, minimum, maximum, and standard error of the mean.

Distribution. Skewness and kurtosis are statistics that describe the shape and symmetry of the distribution. These statistics are displayed with their standard errors.

Values are group midpoints. If the values in your data are midpoints of groups (for example, ages of all people in their thirties are coded as 35), select this option to estimate the median and percentiles for the original, ungrouped data.

Frequencies Charts

Chart Type. A pie chart displays the contribution of parts to a whole. Each slice of a pie chart corresponds to a group defined by a single grouping variable. A bar chart displays the count for each distinct value or category as a separate bar, allowing you to compare categories visually. A histogram also has bars, but they are plotted along an equal interval scale. The height of each bar is the count of values of a quantitative variable falling within the interval. A histogram shows the shape, Center, and spread of the distribution. A normal curve superimposed on a histogram helps you judge whether the data are normally distributed.

To Obtain Frequencies and Statistics

·        From the menus choose: Analyse; Descriptive Statistics; Frequencies...

·        Select one or more categorical or quantitative variables.

·        Optionally, you can: Click Statistics for descriptive statistics for quantitative variables; Click Charts for bar charts, pie charts, and histograms; Click Format for the order in which results are displayed.

Top


Distances

This procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances), either between pairs of variables or between pairs of cases. These similarity or distance measures can then be used with other procedures, such as factor analysis, cluster analysis, or multidimensional scaling, to help analyse complex data sets.

Example. Is it possible to measure similarities between pairs of automobiles based on certain characteristics, such as engine size, MPG, and horsepower? By computing similarities between autos, you can gain a sense of which autos are similar to each other and which are different from each other. For a more formal analysis, you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure.

Statistics. Dissimilarity (distance) measures for interval data are Euclidean distance, squared Euclidean distance, Chebychev, block, Minkowski, or customized; for count data, chi-square or phi-square; for binary data, Euclidean distance, squared Euclidean distance, size difference, pattern difference, variance, shape, or Lance and Williams. Similarity measures for interval data are Pearson correlation or cosine; for binary data, Russel and Rao, simple matching, Jaccard, dice, Rogers and Tanimoto, Sokal and Sneath 1, Sokal and Sneath 2, Sokal and Sneath 3, Kulczynski 1, Kulczynski 2, Sokal and Sneath 4, Hamann, Lambda, Anderberg’s D, Yule’s Y, Yule’s Q, Ochiai, Sokal and Sneath 5, phi 4-point correlation, or dispersion.

To Obtain a Distance Matrix

·        From the menus choose: Analyse, Correlate Distances...

·        Select at least one numeric variable to compute distances between cases, or select at least two numeric variables to compute distances between variables.

·        Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables.

Top


K-Means Cluster Analysis

This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics, using an algorithm that can handle large numbers of cases. However, the algorithm requires you to specify the number of clusters. You can specify initial cluster centres if you know this information. You can select one of two methods for classifying cases, either updating cluster centres iteratively or classifying only. You can save cluster membership, distance information, and final cluster centres. Optionally, you can specify a variable whose values are used to label case wise output. You can also request analysis of variance F statistics. While these statistics are opportunistic (the procedure tries to form groups that do differ), the relative size of the statistics provides information about each variable’s contribution to the separation of the groups.

Example. What are some identifiable buildings that attract people with same demographic characteristics within population? With k-means cluster analysis, you could cluster buildings (cases) into k homogeneous groups based on demographic characteristics.

K-Means Cluster Analysis Considerations

Efficiency: The main advantage of the K-Means Cluster Analysis procedure is that it is much faster than the Hierarchical Cluster Analysis procedure. The k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases, as do many clustering algorithms, including that used by the hierarchical clustering command. On the other hand, the hierarchical procedure allows much more flexibility in your cluster analysis: you can use any of a number of distance or similarity measures, including options for binary and count data, and you do not need to specify the number of clusters a priori. Once you have identified groups, you can build a model useful for identifying new cases using the Discriminant procedure. You can also use saved cluster membership information to explore other relationships in subsequent analyses, such as Crosstabs or GLM Univariate.

Data: Variables should be quantitative at the interval or ratio level. If your variables are binary or counts, use the Hierarchical Cluster Analysis procedure.

Assumptions. Distances are computed using simple Euclidean distance. If you want to use another distance or similarity measure, use the Hierarchical Cluster Analysis procedure. Scaling of variables is an important consideration--if your variables are measured on different scales (for example, one variable is expressed in dollars and another is expressed in years), your results may be misleading.

To Obtain a K-Means Cluster Analysis

·        From the menus choose: Analyse, Classify; K-Means Cluster...

·        Select the variables to be used in the cluster analysis.

·        Specify the number of clusters. The number of clusters must be at least two and must not be greater than the number of cases in the data file.

·        Select either the Iterate and classify method or the Classify only method.

·        Optionally, you can select an identification variable to label cases.

Top


Hierarchical Cluster Analysis

This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics, using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left. You can analyse raw variables or you can choose from a variety of standardizing transformations. Distance or similarity measures are generated by the Proximities procedure. Statistics are displayed at each stage to help you select the best solution.

Example. Are there identifiable of buildings that have similar/same demographic characteristics (like ethnicity) within study area? With hierarchical cluster analysis, you could cluster buildings (cases) into homogeneous groups based on demographic characteristics.

Statistics. Agglomeration schedule, distance (or similarity) matrix, and cluster membership for a single solution or a range of solutions. Plots: dendrograms and icicle plots.

Agglomeration schedule. Displays the cases or clusters combined at each stage, the distances between the cases or clusters being combined, and the last cluster level at which a case (or variable) joined the cluster.

Proximity matrix. Gives the distances or similarities between items.

Cluster Membership. Displays the cluster to which each case is assigned at one or more stages in the combination of clusters. Available options are single solution and range of solutions.

Hierarchical Cluster Analysis Considerations

Data. The variables can be quantitative, binary, or count data. Scaling of variables is an important issue--differences in scaling may affect your cluster solution(s). If your variables have large differences in scaling (for example, one variable is measured in dollars and the other is measured in years), you should consider standardizing them (this can be done automatically by the Hierarchical Cluster Analysis procedure).

Assumptions. The distance or similarity measures used should be appropriate for the data analyzed. Also, you should include all relevant variables in your analysis. Omission of influential variables can result in a misleading solution. Because hierarchical cluster analysis is an exploratory method, results should be treated as tentative until they are confirmed with an independent sample.

Hierarchical Cluster Analysis Plots

Dendrogram. Dendrogram is a visual representation of the steps in a hierarchical clustering solution that shows the clusters being combined and the values of the distance coefficients at each step. Connected vertical lines designate joined cases. The dendrogram rescales the actual distances to numbers between 0 and 25, preserving the ratio of the distances between steps. Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep.

Icicle. Displays an icicle plot, including all clusters or a specified range of clusters. Icicle plots display information about how cases are combined into clusters at each iteration of the analysis. Orientation allows you to select a vertical or horizontal plot.

Icicle plot (Cluster) chart shows how cases are merged into clusters. At the bottom (right for horizontal plots), no cases have been merged; as you read up the chart (or to the right-to-left for horizontal plots), cases that are merged are indicated by an X or bar in the column between them, whereas different cluster are indicated by a white space between them.

To Obtain a Hierarchical Cluster Analysis

 

 

Top


Proximities

The Proximities procedure can be used independently to generate distance or similarity scores, which can then be read by the Cluster procedure using command syntax. Once you have identified groups, you can determine which variables distinguish between them using the Discriminant procedure. If you know ahead of time how many clusters to look for, use k-means cluster analysis for a quicker solution.

Top


Discriminant Analysis

Discriminant analysis is useful for situations where you want to build a predictive model of group membership based on observed characteristics of each case. The procedure generates a discriminant function (or, for more than two groups, a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups. The functions are generated from a sample of cases for which group membership is known; the functions can then be applied to new cases with measurements for the predictor variables but unknown group membership.

Note: The grouping variable can have more than two values. The codes for the grouping variable must be integers, however, and you need to specify their minimum and maximum values. Cases with values outside of these bounds are excluded from the analysis.

Example. On average, people in temperate zone countries consume more calories per day than those in the tropics, and a greater proportion of the people in the temperate zones are city dwellers. A researcher wants to combine this information in a function to determine how well an individual can discriminate between the two groups of countries. The researcher thinks that population size and economic information may also be important. Discriminant analysis allows you to estimate coefficients of the linear discriminant function, which looks like the right-hand side of a multiple linear regression equation. That is, using coefficients a, b, c, and d, the function is:

D = a * climate + b * urban + c * population + d * gross domestic product per capita

If these variables are useful for discriminating between the two climate zones, the values of D will differ for the temperate and tropic countries. If you use a stepwise variable selection method, you may find that you do not need to include all four variables in the function.

Statistics. For each variable: means, standard deviations, univariate ANOVA. For each analysis: Box’s M, within-groups correlation matrix, within-groups covariance matrix, separate-groups covariance matrix, total covariance matrix. For each canonical discriminant function: eigenvalue, percentage of variance, canonical correlation, Wilks’ lambda, chi-square. For each step: prior probabilities, Fisher’s function coefficients, unstandardized function coefficients, Wilks’ lambda for each canonical function.

Discriminant Analysis Considerations

Data. The grouping variable must have a limited number of distinct categories, coded as integers. Independent variables that are nominal must be recoded to dummy or contrast variables.

Assumptions. Cases should be independent. Predictor variables should have a multivariate normal distribution, and within-group variance-covariance matrices should be equal across groups. Group membership is assumed to be mutually exclusive (that is, no case belongs to more than one group) and collectively exhaustive (that is, all cases are members of a group). The procedure is most effective when group membership is a truly categorical variable; if group membership is based on values of a continuous variable (for example, high IQ versus low IQ), you should consider using linear regression to take advantage of the richer information offered by the continuous variable itself.

To Obtain a Discriminant Analysis

·        From the menus choose: Analyse, Classify, Discriminant...

·        Select an integer-valued grouping variable and click Define Range to specify the categories of interest.

·        Select the independent, or predictor, variables. (If your grouping variable does not have integer values, Automatic Recode on the Transform menu will create one that does.)

·        Optionally, you can select cases with a selection variable.

Top


Curve Estimation

The Curve Estimation procedure produces curve estimation regression statistics and related plots for 11 different curve estimation regression models. A separate model is produced for each dependent variable. You can also save predicted values, residuals, and prediction intervals as new variables.

Example a fire insurance company conducts a study to relate the amount of damage in serious residential fires to the distance between the closest fire station and the residence. A scatterplot reveals that the relationship between fire damage and distance to the fire station is linear. You might fit a linear model to the data and check the validity of assumptions and the goodness of fit of the model.

To Obtain a Curve Estimation

From the menus choose: Analyse, Regression, and Curve Estimation... Select one or more dependent variables. A separate model is produced for each dependent variable. Select an independent variable (either a variable in the working data file or Time).

Curve Estimation Models

You can choose one or more curve estimation regression models. To determine which model to use, plot your data. If your variables appear to be related linearly, use a simple linear regression model. When your variables are not linearly related, try transforming your data. When a transformation does not help, you may need a more complicated model. View a scatterplot of your data; if the plot resembles a mathematical function you recognise, fit your data to that type of model. For example, if your data resemble an exponential function, use an exponential model. The following models are available in the Curve Estimation procedure: linear, logarithmic, inverse, quadratic, cubic, power, compound, S-curve, logistic, growth, and exponential. If you are unsure which model best fits your data, try several models and select among them.

In the Curve Estimation dialog box, click your right mouse button on a model to obtain the equation of the model.

Top


Linear Regression

Linear Regression estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. Example, Is the location of residence of person related to the ethnic group? A scatterplot indicates that these variables are linearly related. The number of new people of coming to a particular location and total population also linearly related. These variables have a negative relationship. As the number of population increases, the average number of people coming to the place decreases. With linear regression, you can model the relationship of these variables. A good model can be used to predict how many people will come to the area.

 To Obtain a Linear Regression Analysis

From the menus choose: Analyse, Regression, and Linear... In the Linear Regression dialog box, select a numeric dependent variable. Select one or more numeric independent variables.

 Top


 Logistic Regression

Logistic regression is useful for situations in which you want to be able to predict the presence or absence of a characteristic or outcome based on values of a set of predictor variables. It is similar to a linear regression model but is suited to models where the dependent variable is dichotomous. Logistic regression coefficients can be used to estimate odds ratios for each of the independent variables in the model. Logistic regression is applicable to a broader range of research situations than discriminant analysis.

Example, what demographic characteristics are likely to move from their residences? Given a population of the a study area measured on the different demographic characteristics, age, sex, martial status, and race, you could build a model using the four demographic characteristics variables to predict the motion in a population. The model can then be used to derive estimates of the odds ratios for each factor to tell you, for example, how much more likely singles are to move than married.

To Obtain a Logistic Regression Analysis

From the menus choose: Analyse, Regression, and Binary Logistic... Select one dichotomous dependent variable. This variable may be numeric or short string. Select one or more covariates. To include interaction terms, select all of the variables involved in the interaction and then select >a*b>. To enter variables in groups (blocks), select the covariates for a block, and click Next to specify a new block. Repeat until all blocks have been specified. Optionally, you can select cases for analysis. Click Select, choose a selection variable, and click Rule.

 Top 


Kaplan-Meier Survival Analysis

There are many situations in which you would want to examine the distribution of times between two events, in this study such as length of residence (time between coming and leaving the place). However, this kind of data usually includes some censored cases. Censored cases are cases for which the second event isn’t recorded (for example, people still living in the area at the end of the study). The Kaplan-Meier procedure is a method of estimating time-to-event models in the presence of censored cases. The Kaplan-Meier model is based on estimating conditional probabilities at each time point when an event occurs and taking the product limit of those probabilities to estimate the survival rate at each point in time (SPSS 9.0 for windows).

Example does a new treatment for AIDS have any therapeutic benefit in extending life? You could conduct a study using two groups of AIDS patients, one receiving traditional therapy and the other receiving the experimental treatment. Constructing a Kaplan-Meier model from the data would allow you to compare overall survival rates between the two groups to determine whether the experimental treatment is an improvement over the traditional therapy. You can also plot the survival or hazard functions and compare them visually for more detailed information.

To Obtain a Kaplan-Meier Survival Analysis

From the menus choose: Analyse, Survival, Kaplan-Meier... Select a time variable. Select a status variable to identify cases for which the terminal event has occurred. This variable can be numeric or short string. Then click Define Event.

 Top


Life Tables

There are many situations in which you would want to examine the distribution of times between two events, in this study such as length of residence (time between coming and leaving the place). However, this kind of data usually includes some cases for which the second event isn’t recorded (for example, people still living in the place at the end of the study). This can happen for several reasons: for some cases, the event simply doesn’t occur before the end of the study; for other cases, we lose track of their status sometime before the end of the study; still other cases may be unable to continue for reasons unrelated to the study (such as a person becoming busy somewhere else like study and being absent). Collectively, such cases are known as censored cases, and they make this kind of study inappropriate for traditional techniques such as t tests or linear regression (SPSS 9.0 for windows).

A statistical technique useful for this type of data is called a follow-up life table. The basic idea of the life table is to subdivide the period of observation into smaller time intervals. For each interval, all people who have been observed at least that long are used to calculate the probability of a terminal event occurring in that interval. The probabilities estimated from each of the intervals are then used to estimate the overall probability of the event occurring at different time points.

To Create a Life Table

From the menus choose: Analyse, Survival, Life Tables... Select one numeric survival variable. Specify the time intervals to be examined. Select a status variable to define cases for which the terminal event has occurred. Click Define Event to specify the value of the status variable that indicates that an event occurred.

Top


Cox Regression Analysis

Like Life Tables and Kaplan-Meier survival analysis, Cox Regression is a method for modelling time-to-event data in the presence of censored cases. However, Cox Regression allows you to include predictor variables (covariates) in your models. For example, you could construct a model of length of employment based on educational level and job category. Cox Regression will handle the censored cases correctly, and it will provide estimated coefficients for each of the covariates, allowing you to assess the impact of multiple covariates in the same model. You can also use Cox Regression to examine the effect of continuous covariates.

Example, do men and women have different risks of developing lung cancer based on cigarette smoking? By constructing a Cox Regression model, with cigarette usage (cigarettes smoked per day) and gender entered as covariates, you can test hypotheses regarding the effects of gender and cigarette usage on time-to-onset for lung cancer.

 To Obtain a Cox Regression Analysis

 From the menus choose: Analyse, Survival, Cox Regression... Select a time variable. Select a status variable, and then click Define Event. Select variables to use as covariates. Optionally, you can compute separate models for different groups by defining a strata variable.

 Top


General Loglinear Analysis

The General Loglinear Analysis procedure analyses the frequency counts of observations falling into each cross-classification category in a crosstabulation or a contingency table. Each cross-classification in the table constitutes a cell, and each categorical variable is called a factor. The dependent variable is the number of cases (frequency) in a cell of the crosstabulation, and the explanatory variables are factors and covariates. This procedure estimates maximum likelihood parameters of hierarchical and non-hierarchical loglinear models using the Newton-Raphson method. Either a Poisson or a multinomial distribution can be analysed.

You can select up to 10 factors to define the cells of a table. A cell structure variable allows you to define structural zeros for incomplete tables, include an offset term in the model, fit a log-rate model, or implement the method of adjustment of marginal tables. Contrast variables allow computation of generalised log-odds ratios (GLOR).

SPSS automatically displays model information and goodness-of-fit statistics. You can also display a variety of statistics and plots or save residuals and predicted values in the working data file.

Example. Data from a report of automobile accidents in Florida are used to determine the relationship between wearing a seat belt and whether an injury was fatal or nonfatal. The odds ratio indicates significant evidence of a relationship.

To Obtain a General Loglinear Analysis

From the menus choose: Analyse, Loglinear, General... In the General Loglinear Analysis dialog box, select up to 10 factor variables. Optionally, you can Select cell covariates. Select a cell structure variable to define structural zeros or include an offset term. Select a contrast variable.

 Top


Logit Loglinear Analysis

The Logit Loglinear Analysis procedure analyses the relationship between dependent (or response) variables and independent (or explanatory) variables. The dependent variables are always categorical, while the independent variables can be categorical (factors). Other independent variables, cell covariates, can be continuous, but they are not applied on a case-by-case basis. The weighted covariate mean for a cell is applied to that cell. The logarithm of the odds of the dependent variables is expressed as a linear combination of parameters. A multinomial distribution is automatically assumed; these models are sometimes called multinomial logit models. This procedure estimates parameters of logit loglinear models using the Newton-Raphson algorithm.

You can select from 1 to 10 dependent and factor variables combined. A cell structure variable allows you to define structural zeros for incomplete tables, include an offset term in the model, fit a log-rate model, or implement the method of adjustment of marginal tables. Contrast variables allow computation of generalised log-odds ratios (GLOR). The values of the contrast variable are the coefficients for the linear combination of the logs of the expected cell counts.

SPSS automatically displays model information and goodness-of-fit statistics. You can also display a variety of statistics and plots or save residuals and predicted values in the working data file.

Example. A study in Florida included 219 alligators. How does the alligators’ food type vary with their size and the four lakes in which they live? The study found that the odds of a smaller alligator preferring reptiles to fish is 0.70 times lower than for larger alligators; also, the odds of selecting primarily reptiles instead of fish were highest in Lake 3.

To Obtain a Logit Loglinear Analysis

From the menus choose: Analyse, Loglinear, and Logit... In the Logit Loglinear Analysis dialog box, select one or more dependent variables. Select one or more factor variables. The total number of dependent and factor variables must be less than or equal to 10. Optionally, you can Select cell covariates. Select a cell structure variable to define structural zeros or include an offset term. Select one or more contrast variables.

 Top


Variance

A measure of dispersion around the mean, equal to the sum of squared deviations from the mean divided by one less than the number of cases. The variance is measured in units that are the square of those of the variable itself.

 Covariance

An unstandardized measure of association between two variables, equal to the cross-product deviation divided by N-1.

 Top


Factor Analysis

Factor analysis attempts to identify underlying variables, or factors, that explain the pattern of correlations within a set of observed variables. Factor analysis is often used in data reduction to identify a small number of factors that explain most of the variance observed in a much larger number of manifest variables. Factor analysis can also be used to generate hypotheses regarding causal mechanisms or to screen variables for subsequent analysis (for example, to identify collinearity prior to performing a linear regression analysis).

The factor analysis procedure offers a high degree of flexibility:

Seven methods of factor extraction are available. Five methods of rotation are available, including direct oblimin and promax for nonorthogonal rotations. Three methods of computing factor scores are available, and scores can be saved as variables for further analysis.

Example, what underlying attitudes lead people to respond to the questions on a political survey as they do? Examining the correlations among the survey items reveals that there is significant overlap among various subgroups of items--questions about taxes tend to correlate with each other, questions about military issues correlate with each other, and so on. With factor analysis, you can investigate the number of underlying factors and, in many cases, you can identify what the factors represent conceptually. Additionally, you can compute factor scores for each respondent, which can then be used in subsequent analyses. For example, you might build a logistic regression model to predict voting behaviour based on factor scores.

To Obtain a Factor Analysis

From the menus choose: Analyse, Data Reduction, and Factor... Select the variables for the factor analysis.

Top


Spearman Correlation Coefficient

 Commonly used nonparametric measure of correlation between two ordinal variables. For all of the cases, the values of each of the variables are ranked from smallest to largest, and the Pearson correlation coefficient is computed on the ranks.

Top


Bivariate Correlations

The Bivariate Correlations procedure computes Pearson’s correlation coefficient, Spearman’s rho, and Kendall’s tau-b with their significance levels. Correlations measure how variables or rank orders are related. Before calculating a correlation coefficient, screen your data for outliers (which can cause misleading results) and evidence of a linear relationship. Pearson’s correlation coefficient is a measure of linear association. Two variables can be perfectly related, but if the relationship is not linear, Pearson’s correlation coefficient is not an appropriate statistic for measuring their association.

Top


GLM Multivariate

The GLM (General Linear Model) Multivariate procedure provides regression analysis and analysis of variance for multiple dependent variables by one or more factor variables or covariates. The factor variables divide the population into groups. Using this general linear model procedure, you can test null hypotheses about the effects of factor variables on the means of various groupings of a joint distribution of dependent variables. You can investigate interactions between factors as well as the effects of individual factors. In addition, the effects of covariates and covariate interactions with factors can be included. For regression analysis, the independent (predictor) variables are specified as covariates.

Both balanced and unbalanced models can be tested. A design is balanced if each cell in the model contains the same number of cases. In a multivariate model, the sums of squares due to the effects in the model and error sums of squares are in matrix form rather than the scalar form found in univariate analysis. These matrices are called SSCP (sums-of-squares and cross-products) matrices. If more than one dependent variable is specified, the multivariate analysis of variance using Pillai’s trace, Wilks’ lambda, Hotelling’s trace, and Roy’s largest root criterion with approximate F statistic are provided as well as the univariate analysis of variance for each dependent variable. In addition to testing hypotheses, GLM Multivariate produces estimates of parameters.

Commonly used a priori contrasts are available to perform hypothesis testing. Additionally, after an overall F test has shown significance, you can use post hoc tests to evaluate differences among specific means. Estimated marginal means give estimates of predicted mean values for the cells in the model, and profile plots (interaction plots) of these means allow you to visualize some of the relationships easily. The post hoc multiple comparison tests are performed for each dependent variable separately.

Residuals, predicted values, Cook’s distance, and leverage values can be saved as new variables in your data file for checking assumptions. Also available are a residual SSCP matrix, which is a square matrix of sums of squares and cross-products of residuals, a residual covariance matrix, which is the residual SSCP matrix divided by the degrees of freedom of the residuals, and the residual correlation matrix, which is the standardized form of the residual covariance matrix.

WLS Weight allows you to specify a variable used to give observations different weights for a weighted least-squares (WLS) analysis, perhaps to compensate for different precision of measurement.

Top


Crosstabs

The Crosstabs procedure forms two-way and multi way tables and provides a variety of tests and measures of association for two-way tables. The structure of the table and whether categories are ordered determine what test or measure to use

Crosstabs’ statistics and measures of association are computed for two-way tables only. If you specify a row, a column, and a layer factor (control variable), the Crosstabs procedure forms one panel of associated statistics and measures for each value of the layer factor (or a combination of values for two or more control variables). For example, if GENDER is a layer factor for a table of MARRIED (yes, no) against LIFE (is life exciting, routine, or dull), the results for a two-way table for the females are computed separately from those for the males and printed as panels following one another.

To Obtain Cross tabulations

From the menus choose: Analyse, Descriptive Statistics, Crosstabs...Select one or more row variables and one or more column variables.

Crosstabs Layers

If you select one or more layer variables, a separate cross tabulation is produced for each category of each layer variable (control variable). For example, if you have one row variable, one column variable, and one layer variable with two categories, you get a two-way table for each category of the layer variable. To make another layer of control variables, click Next. Subtables are produced for each combination of categories for each 1st-layer variable with each 2nd-layer variable and so on. If statistics and measures of association are requested, they apply to two-way subtables only.

Crosstabs: Related Procedures

To model the relationships between two or more categorical variables, use the General Loglinear procedure (available in the Advanced Models option) to fit a model to the cell frequencies. For defining intervals along a quantitative variable, use Recode on the Transform menu. For example, if you want to look at the relationship between salary and job satisfaction, and salary is recorded to the nearest dollar, use the Recode procedure to define intervals such as less than $20,000, $20,000 to $30,000, and so on.

Crosstabs Clustered Bar Charts

Display clustered bar charts. A clustered bar chart helps summarize your data for groups of cases. There is one cluster of bars for each value of the variable you specified under Rows. The variable that defines the bars within each cluster is the variable you specified under Columns. There is one set of differently colored or patterned bars for each value of this variable. If you specify more than one variable under Columns or Rows, a clustered bar chart is produced for each combination of two variables.

Crosstabs Cell Display

To help you uncover patterns in the data that contribute to a significant chi-square test, the Crosstabs procedure displays expected frequencies and three types of residuals (deviates) that measure the difference between observed and expected frequencies. Each cell of the table can contain any combination of counts, percentages, and residuals selected.

Counts. The number of cases actually observed and the number of cases expected if the row and column variables are independent of each other.

Percentages. The percentages can add up across the rows or down the columns. The percentages of the total number of cases represented in the table (one layer) are also available.

Residuals. Raw unstandardized residuals give the difference between the observed and expected values. Standardized and adjusted standardized residuals are also available.

 Top


Analysis of Variance (ANOVA)

 Analysis of variance, or ANOVA, is a method of testing the null hypothesis that several group means are equal in the population, by comparing the sample variance estimated from the group means to that estimated within the groups.

Top


Canonical Correlation

 The canonical correlation for a discriminant function is the square root of the ratio of the between-groups sum of squares to the total sum of squares. Squared, it is the proportion of the total variability explained by differences between groups.

 

Top


More to come

 

Go to the main page