Business Analytics Class Participation
Monday 23 January 2012
Basic Difference Between Clustering Analysis and Discriminant analysis
Thursday 19 January 2012
Scree Plot
A Scree Plot is a simple graph that shows the fraction of total variance in the data as explained or represented by each principal component.
The principal components with the largest fraction contribution is labeled with the label name from the preferences file.
Such a plot when read left-to-right across the abscissa, can often show a clear separation in fraction of total variance where the 'most important' components cease and the 'least important' components begin.
This point of separation is often called the 'elbow'.
This was something interesting to know for me:
In the PCA (Principal Component Analysis) literature, the plot is called a 'Scree' Plot because it often looks like a 'scree' slope, where rocks have fallen down and accumulated on the side of a mountain.
Some tips regarding when to use Scree Plot graph:
1) If there are less than 30 variables and communalities after extraction are greater than 0.7 OR if the sample size exceeds 250 and the average communality is greater than 0.6, than retain all factors having Eigenvalues more than 1.
2) If none of the above apply, we can use scree plot when the sample size is cionsiderably large- around 300 or more cases.
Links:
http://www.statisticshell.com/docs/factor.pdf
http://www.improvedoutcomes.com/docs/WebSiteDocs/PCA/Creating_a_Scree_Plot.htm
Monday 16 January 2012
Factor Analysis with help of other statistical techniques
Factor Analysis - Components, PCA, Comparison of Factor Analysis and PCA
HELLO FRIENDS !!!
Hope you all are enjoying reading the blog, and the data that is updated is valuable to all of you. Let us today get more familiar with a new concept called Factor Analysis, PCA.
Factor Analysis
Factor analysis is a collection of methods used to examine how underlying constructs influence the responses on a number of measured variables.
There are basically two types of factor analysis: exploratory and confirmatory.
1. Exploratory factor analysis (EFA) attempts to discover the nature of the constructs influencing a set of responses.
2. Confirmatory factor analysis (CFA) tests whether a specified set of constructs is influencing responses in a predicted way.
Both types of factor analyses are based on the Common Factor Model, illustrated in figure 1.1. This model proposes that each observed response (measure 1 through measure 5) is influenced partially by underlying common factors (factor 1 and factor 2) and partially by underlying unique factors (E1 through E5). The strength of the link between each factor and each measure varies, such that a given factor influences some measures more than others. This is the same basic model as is used for LISREL analyses.
Factor analyses are performed by examining the pattern of correlations (or covariances) between the observed measures. Measures that are highly correlated (either positively or negatively) are likely influenced by the same factors, while those that are relatively uncorrelated are likely influenced by different factors.
Exploratory Factor Analysis
Objectives:
The primary objectives of an EFA are to determine:
· The number of common factors influencing a set of measures.
· The strength of the relationship between each factor and each observed measure.
Some common uses of EFA are to:
· Identify the nature of the constructs underlying responses in a specific content area.
· Determine what sets of items “hang together” in a questionnaire.
· Demonstrate the dimensionality of a measurement scale. Researchers often wish to develop scales that respond to a single characteristic.
· Determine what features are most important when classifying a group of items.
· Generate “factor scores" representing values of the underlying constructs for use in other analyses.
Confirmatory Factor Analysis
Objectives
The primary objective of a CFA is to determine the ability of a predefined factor model to fit an observed set of data.
Some common uses of CFA are to:
· Establish the validity of a single factor model.
· Compare the ability of two different models to account for the same set of data.
· Test the significance of a specific factor loading.
· Test the relationship between two or more factor loadings.
· Test whether a set of factors are correlated or uncorrelated.
· Assess the convergent and discriminant validity of a set of measures.
Factor Analysis vs. Principal Component Analysis
· Exploratory factor analysis is often confused with principal component analysis (PCA), a similar statistical procedure. However, there are significant differences between the two: EFA and PCA will provide somewhat different results when applied to the same data.
· The purpose of PCA is to derive a relatively small number of components that can account for the variability found in a relatively large number of measures. This procedure, called data reduction, is typically performed when a researcher does not want to include all of the original measures in analyses but still wants to work with the information that they contain.
· Differences between EFA and PCA arise from the fact that the two are based on different models. An illustration of the PCA model is provided in figure 2.1. The first difference is that the direction of influence is reversed: EFA assumes that the measured responses are based on the underlying factors while in PCA the principal components are based on the measured responses. The second difference is that EFA assumes that the variance in the measured variables can be decomposed into that accounted for by common factors and that accounted for by unique factors. The principal components are defined simply as linear combinations of the measurements, and so will contain both common and unique variance.
In summary, you should use EFA when you are interested in making statements about the factors that are responsible for a set of observed responses, and you should use PCA when you are simply interested in performing data reduction.
Factor- Example and Mathematical Model
- Seven methods of factor
extraction are available.
- Five methods of rotation are
available, including direct oblimin and promax for non-orthogonal
rotations.
- Three methods of computing
factor scores are available, and scores can be saved as variables for
further analysis.
Multiple orthogonal factors: After we have found the line on which the variance is maximal, there remains some variability around this line. In principal components analysis, after the first factor has been extracted, that is, after the first line has been drawn through the data, we continue and define another line that maximizes the remaining variability, and so on. In this manner, consecutive factors are extracted. Because each consecutive factor is defined to maximize the variability that is not captured by the preceding factor, consecutive factors are independent of each other. Put another way, consecutive factors are uncorrelated or orthogonal to each other.
STATISTICA FACTOR ANALYSIS | Eigenvalues (factor.sta) Extraction: Principal components | |||
---|---|---|---|---|
Value | Eigenval | % total Variance | Cumul. Eigenval | Cumul. % |
1 2 3 4 5 6 7 8 9 10 | 6.118369 1.800682 .472888 .407996 .317222 .293300 .195808 .170431 .137970 .085334 | 61.18369 18.00682 4.72888 4.07996 3.17222 2.93300 1.95808 1.70431 1.37970 .85334 | 6.11837 7.91905 8.39194 8.79993 9.11716 9.41046 9.60626 9.77670 9.91467 10.00000 | 61.1837 79.1905 83.9194 87.9993 91.1716 94.1046 96.0626 97.7670 99.1467 100.0000 |
Eigenvalues: In the second column above, we find the variance on the new factors that were successively extracted. In the third column, these values are expressed as a percent of the total variance (in this example, 10). As we can see, factor 1 accounts for 61 percent of the variance, factor 2 for 18 percent, and so on. As expected, the sum of the eigenvalues is equal to the number of variables. The third column contains the cumulative variance extracted. The variances extracted by the factors are called the eigenvalues. This name derives from the computational issues involved.
Which criterion to use: Both criteria have been studied in detail (Browne, 1968; Cattell & Jaspers, 1967; Hakstian, Rogers, & Cattell, 1982; Linn, 1968; Tucker, Koopman & Linn, 1969). Theoretically, you can evaluate those criteria by generating random data based on a particular number of factors. You can then see whether the number of factors is accurately detected by those criteria. Using this general technique, the first method (Kaiser criterion) sometimes retains too many factors, while the second technique (scree test) sometimes retains too few; however, both do quite well under normal conditions, that is, when there are relatively few factors and many cases. In practice, an additional important aspect is the extent to which a solution is interpretable. Therefore, you usually examines several solutions with more or fewer factors, and chooses the one that makes the best "sense." We will discuss this issue in the context of factor rotations below.
Histogram & Box Plot
Histogram
A histogram is a graphical representation of the distribution of the data. It contains tabular frequencies represented in the form of rectangles adjacent to each other. These discrete intervals are known as bins. The total area of the histogram is equal to the number of the data.
Box Plot
The Box plot is a chart that graphically represents the five most important descriptive values for a data set. It summarizes the following statistical measures:-
· Median
· Upper & lower Quartiles
· Minimum & maximum data values
Comparing histogram & box plots
® The data in a histogram is represented in the form of bars which are considered as the peaks. This helps us to interpret the data and also shows the fluctuations. Whereas in a box plot the values average one another out, causing the distribution to look roughly normal.
® A histogram is preferable over a box plot is when there is very little variance among the observed frequencies. The histogram displayed to the right shows that there is little variance across the groups of data; however, when the same data points are graphed on a box plot, the distribution looks roughly normal with a high portion of the values falling below six.
® When there is moderate variation among the observed frequencies, the histogram looks ragged and non-symmetrical due to the way the data is grouped. However, when a box plot is used to graph the same data points, the chart indicates a perfect normal distribution.
Source:-
http://en.wikipedia.org/wiki/Box_plot
http://en.wikipedia.org/wiki/Histogram
http://www.netmba.com/statistics/plot/box/
http://www.brighthub.com/office/project-management/articles/58254.aspx#