Monday 23 January 2012

Basic Difference Between Clustering Analysis and Discriminant analysis

Hello everyone.....its been long time since we had a discussion on discriminant analysis in class.
Many of you might have realized that using discriminant analysis we actually categorize people which is very similar to the clustering analysis of clustering people/objects with similar patterns into one group and thus which could be useful in targeting/segmentation.
but as i said above clustering is only similar to discriminant analysis. Basic difference between the two analysis is that in discriminant analysis, to classify the objects into two similar groups, one has to know the membership for the case that is used to find the classification rule whereas in clustering analysis one cannot know who belongs to which group. In fact sometimes one cannot even know the number of groups to determine the similarities.
For example, if you are interested in determining the difference between several high tolerance free machines using discriminant analysis, cases with known diagnoses of non tolerant machines must be available. and thus based on these known cases, we derive a rule for classifying non tolerance machines.
hope the above information would be useful in determining the basic difference between clustering analysis and discriminant analysis.
Thank you




Thursday 19 January 2012

Scree Plot

What is a scree plot?


A Scree Plot is a simple graph that shows the fraction of total variance in the data as explained or represented by each principal component.


The principal components with the largest fraction contribution is labeled with the label name from the preferences file.
Such a plot when read left-to-right across the abscissa, can often show a clear separation in fraction of total variance where the 'most important' components cease and the 'least important' components begin.
This point of separation is often called the 'elbow'. 


This was something interesting to know for me:
In the PCA (Principal Component Analysis) literature, the plot is called a 'Scree' Plot because it often looks like a 'scree' slope, where rocks have fallen down and accumulated on the side of a mountain.


Some tips regarding when to use Scree Plot graph:


1) If there are less than 30 variables and communalities after extraction are greater than 0.7 OR if the sample size exceeds 250 and the average communality is greater than 0.6, than retain all factors having Eigenvalues more than 1.


2) If none of the above apply, we can use scree plot when the sample size is cionsiderably large- around 300 or more cases.


Links:
http://www.statisticshell.com/docs/factor.pdf
http://www.improvedoutcomes.com/docs/WebSiteDocs/PCA/Creating_a_Scree_Plot.htm

Monday 16 January 2012

Factor Analysis with help of other statistical techniques


Factor analysis is a type of statistical technique,
·         The aim of factor analysis is to simplify a complex data set by representing the set of variables in terms of a smaller number of underlying (hypothetical or unobservable) variables, known as factors.
To validate the Factor Analysis, it needs to be compared with Cluster Analysis.
o    It was found if Cluster Analysis did not challenge the result of Factor Analysis then it confirmed the result of Factor Analysis.
o   Thus, Factor Analysis can be used to understand the financial position and performance of the various industries in a more practical and time saving manner.
There are many ways in which data can be analyzed for a reliable solution but here I have selected only three.
è Assuming that Study is carried for an industry

1.    Correlation Study
With the help of inter-correlation matrix, some variables would be excluded if they showed a very weak correlation (i.e. < ±0.5) with the other variables in the study. However, before elimination domain knowledge is exercised to ensure that no important variable (financial ratio) is excluded.
2.    Multiple Regression Analysis
Factor Analysis is conducted on remaining variables after doing correlation analysis and it helps to create factors for analysis. Multiple regression analysis is conducted taking the Factor Scores of different factors as dependant variables and the constituent variables in the respective factor as independent variables. It is found that R-square (coefficient of determination) for each such regression analysis is very high. It signifies the presence of strong regression relationship amongst the factors and their constituent variables. However, presence of variables with low t-value (i.e. < 2) and the corresponding high p-value (i.e. > 0.05) are found in different factors.
3.    Factor Analysis
Factor Analysis is conducted once again on the remaining variables from previous analysis. The Rotated Component Matrix is then produced. It is observed that remaining variables have been categorized in factors. These factors account for about X% of the total variance, which can be considered for decision making.
To check whether the analysis is up to mark for use, Cluster analysis is further done.

Factor Analysis - Components, PCA, Comparison of Factor Analysis and PCA




HELLO FRIENDS !!!

Hope you all are enjoying reading the blog, and the data that is updated is valuable to all of you. Let us today get more familiar with a new concept called Factor Analysis, PCA.

Factor Analysis

Factor analysis is a collection of methods used to examine how underlying constructs influence the responses on a number of measured variables.

There are basically two types of factor analysis: exploratory and confirmatory.

1. Exploratory factor analysis (EFA) attempts to discover the nature of the constructs influencing a set of responses.

2. Confirmatory factor analysis (CFA) tests whether a specified set of constructs is influencing responses in a predicted way.

Both types of factor analyses are based on the Common Factor Model, illustrated in figure 1.1. This model proposes that each observed response (measure 1 through measure 5) is influenced partially by underlying common factors (factor 1 and factor 2) and partially by underlying unique factors (E1 through E5). The strength of the link between each factor and each measure varies, such that a given factor influences some measures more than others. This is the same basic model as is used for LISREL analyses.

Factor analyses are performed by examining the pattern of correlations (or covariances) between the observed measures. Measures that are highly correlated (either positively or negatively) are likely influenced by the same factors, while those that are relatively uncorrelated are likely influenced by different factors.

Exploratory Factor Analysis

Objectives:

The primary objectives of an EFA are to determine:

· The number of common factors influencing a set of measures.

· The strength of the relationship between each factor and each observed measure.

Some common uses of EFA are to:

· Identify the nature of the constructs underlying responses in a specific content area.

· Determine what sets of items “hang together” in a questionnaire.

· Demonstrate the dimensionality of a measurement scale. Researchers often wish to develop scales that respond to a single characteristic.

· Determine what features are most important when classifying a group of items.

· Generate “factor scores" representing values of the underlying constructs for use in other analyses.

Confirmatory Factor Analysis

Objectives

The primary objective of a CFA is to determine the ability of a predefined factor model to fit an observed set of data.

Some common uses of CFA are to:

· Establish the validity of a single factor model.

· Compare the ability of two dierent models to account for the same set of data.

· Test the significance of a specific factor loading.

· Test the relationship between two or more factor loadings.

· Test whether a set of factors are correlated or uncorrelated.

· Assess the convergent and discriminant validity of a set of measures.

Factor Analysis vs. Principal Component Analysis

· Exploratory factor analysis is often confused with principal component analysis (PCA), a similar statistical procedure. However, there are significant differences between the two: EFA and PCA will provide somewhat different results when applied to the same data.

· The purpose of PCA is to derive a relatively small number of components that can account for the variability found in a relatively large number of measures. This procedure, called data reduction, is typically performed when a researcher does not want to include all of the original measures in analyses but still wants to work with the information that they contain.

· Differences between EFA and PCA arise from the fact that the two are based on different models. An illustration of the PCA model is provided in figure 2.1. The first difference is that the direction of influence is reversed: EFA assumes that the measured responses are based on the underlying factors while in PCA the principal components are based on the measured responses. The second difference is that EFA assumes that the variance in the measured variables can be decomposed into that accounted for by common factors and that accounted for by unique factors. The principal components are defined simply as linear combinations of the measurements, and so will contain both common and unique variance.

In summary, you should use EFA when you are interested in making statements about the factors that are responsible for a set of observed responses, and you should use PCA when you are simply interested in performing data reduction.


Factor- Example and Mathematical Model



Overview

Factor analysis attempts to identify underlying variables, or factors, that explain the pattern of correlations within a set of observed variables. Factor analysis is often used in data reduction to identify a small number of factors that explain most of the variance that is observed in a much larger number of manifest variables. Factor analysis can also be used to generate hypotheses regarding causal mechanisms or to screen variables for subsequent analysis (for example, to identify collinearity prior to performing a linear regression analysis).
The factor analysis procedure offers a high degree of flexibility:
  • Seven methods of factor extraction are available.
  • Five methods of rotation are available, including direct oblimin and promax for non-orthogonal rotations.
  • Three methods of computing factor scores are available, and scores can be saved as variables for further analysis.


Example

Suppose a psychologist proposes a theory that there are two kinds of intelligence, "verbal intelligence" and "mathematical intelligence", neither of which is directly observed. Evidence for the theory is sought in the examination scores from each of 10 different academic fields of 1000 students. If each student is chosen randomly from a large population, then each student's 10 scores are random variables. The psychologist's theory may say that for each of the 10 academic fields, the score averaged over the group of all students who share some common pair of values for verbal and mathematical "intelligences" is some constant times their level of verbal intelligence plus another constant times their level of mathematical intelligence, i.e., it is a linear combination of those two "factors". The numbers for a particular subject, by which the two kinds of intelligence are multiplied to obtain the expected score, are posited by the theory to be the same for all intelligence level pairs, and are called "factor loadings" for this subject. For example, the theory may hold that the average student's aptitude in the field of amphibology is {10 × the student's verbal intelligence} + {6 × the student's mathematical intelligence}. The numbers 10 and 6 are the factor loadings associated with amphibology. Other academic subjects may have different factor loadings. Two students having identical degrees of verbal intelligence and identical degrees of mathematical intelligence may have different aptitudes in amphibology because individual aptitudes differ from average aptitudes. That difference is called the "error" — a statistical term that means the amount by which an individual differs from what is average for his or her levels of intelligence (see errors and residuals in statistics). The observable data that go into factor analysis would be 10 scores of each of the 1000 students, a total of 10,000 numbers. The factor loadings and levels of the two kinds of intelligence of each student must be inferred from the data.


Mathematical model

In the example above, for i = 1, ..., 1,000 the ith student's scores are





where
§  xk,i is the ith student's score for the kth subject

§  μk is the mean of the students' scores for the kth subject (assumed to be zero, for simplicity, in the example as described above, which would amount to a simple shift of the scale used)

§  vi is the ith student's "verbal intelligence",

§  mi is the ith student's "mathematical intelligence",

 are the factor loadings for the kth subject, for j = 1, 2.

§  εk,i is the difference between the ith student's score in the kth subject and the average score in the kth subject of all students whose levels of verbal and mathematical intelligence are the same as those of the ith student,


In matrix notation, we have



where
§  N is 1000 students

§  X is a 10 × 1,000 matrix of observable random variables,

§  μ is a 10 × 1 column vector of unobservable constants (in this case "constants" are quantities not differing from one individual student to the next; and "random variables" are those assigned to individual students; the randomness arises from the random way in which the students are chosen),

§  L is a 10 × 2 matrix of factor loadings (unobservable constants, ten academic topics, each with two intelligence parameters that determine success in that topic),

§  F is a 2 × 1,000 matrix of unobservable random variables (two intelligence parameters for each of 1000 students),

§  ε is a 10 × 1,000 matrix of unobservable random variables.


Observe that by doubling the scale on which "verbal intelligence"—the first component in each column of F—is measured, and simultaneously halving the factor loadings for verbal intelligence makes no difference to the model. Thus, no generality is lost by assuming that the standard deviation of verbal intelligence is 1. Moreover, for similar reasons, no generality is lost by assuming the two factors are uncorrelated with each other. The "errors" ε is taken to be independent of each other. The variances of the "errors" associated with the 10 different subjects are not assumed to be equal. Note that, since any rotation of a solution is also a solution, this makes interpreting the factors difficult. In this particular example, if we do not know beforehand that the two types of intelligence are uncorrelated, then we cannot interpret the two factors as the two different types of intelligence. Even if they are uncorrelated, we cannot tell which factor corresponds to verbal intelligence and which corresponds to mathematical intelligence without an outside argument. The values of the loadings L, the averages μ, and the variances of the "errors" ε must be estimated given the observed data X and F (the assumption about the levels of the factors is fixed for a given F).



Hello,

A basic idea of Factor Analysis being used as a Data Reduction Method and a few steps

Suppose we conducted a study in which we measure 100 people's height in inches and centimeters. Thus, we would have two variables that measure height. If in future studies, we want to research, for example, the effect of different nutritional food supplements on height, would we continue to use both measures? Probably not; height is one characteristic of a person, regardless of how it is measured.

Suppose we want to measure people's satisfaction with their lives. We design a satisfaction questionnaire with various items; among other things we ask our subjects how satisfied they are with their hobbies (item 1) and how intensely they are pursuing a hobby (item 2). Most likely, the responses to the two items are highly correlated with each other. Given a high correlation between the two items, we can conclude that they are quite redundant.

Combining Two Variables into a Single Factor: You can summarize the correlation between two variables in a scatterplot. A regression line can then be fitted that represents the "best" summary of the linear relationship between the variables. If we could define a variable that would approximate the regression line in such a plot, then that variable would capture most of the "essence" of the two items. Subjects' single scores on that new factor, represented by the regression line, could then be used in future data analyses to represent that essence of the two items. In a sense we have reduced the two variables to one factor.
Principal Components Analysis: The example described above, combining two correlated variables into one factor, illustrates the basic idea of factor analysis, or of principal components analysis to be precise. If we extend the two-variable example to multiple variables, then the computations become more involved, but the basic principle of expressing two or more variables by a single factor remains the same.
Extracting Principal Components: We do not want to go into the details about the computational aspects of principal components analysis here, which can be found elsewhere. However, basically, the extraction of principal components amounts to a variance maximizing (varimax) rotation of the original variable space.
Generalizing to the Case of Multiple Variables: When there are more than two variables, we can think of them as defining a "space," just as two variables defined a plane. Thus, when we have three variables, we could plot a three- dimensional scatterplot, and, again we could fit a plane through the data.


Multiple orthogonal factors: After we have found the line on which the variance is maximal, there remains some variability around this line. In principal components analysis, after the first factor has been extracted, that is, after the first line has been drawn through the data, we continue and define another line that maximizes the remaining variability, and so on. In this manner, consecutive factors are extracted. Because each consecutive factor is defined to maximize the variability that is not captured by the preceding factor, consecutive factors are independent of each other. Put another way, consecutive factors are uncorrelated or orthogonal to each other.
How many Factors to Extract? So far, we are considering principal components analysis as a data reduction method, that is, as a method for reducing the number of variables. The question then is, how many factors do we want to extract? Note that as we extract consecutive factors, they account for less and less variability. The decision of when to stop extracting factors basically depends on when there is only very little "random" variability left. The nature of this decision is arbitrary; however, various guidelines have been developed.
Reviewing the Results of a Principal Components Analysis. Now looking at some of the standard results from a principal components analysis. To reiterate, we are extracting factors that account for less and less variance. To simplify matters, we usually start with the correlation matrix, where the variances of all variables are equal to 1.0. Therefore, the total variance in that matrix is equal to the number of variables. For example, if we have 10 variables each with a variance of 1 then the total variability that can potentially be extracted is equal to 10 times 1. Suppose that in the satisfaction study introduced earlier we included 10 items to measure different aspects of satisfaction at home and at work. The variance accounted for by successive factors would be summarized as follows:

STATISTICA
FACTOR
ANALYSIS
Eigenvalues (factor.sta)
Extraction: Principal components

Value

Eigenval
% total
Variance
Cumul.
Eigenval
Cumul.
%
1
2
3
4
5
6
7
8
9
10
6.118369
1.800682
.472888
.407996
.317222
.293300
.195808
.170431
.137970
.085334
61.18369
18.00682
4.72888
4.07996
3.17222
2.93300
1.95808
1.70431
1.37970
.85334
6.11837
7.91905
8.39194
8.79993
9.11716
9.41046
9.60626
9.77670
9.91467
10.00000

61.1837
79.1905
83.9194
87.9993
91.1716
94.1046
96.0626
97.7670
99.1467
100.0000


Eigenvalues: In the second column above, we find the variance on the new factors that were successively extracted. In the third column, these values are expressed as a percent of the total variance (in this example, 10). As we can see, factor 1 accounts for 61 percent of the variance, factor 2 for 18 percent, and so on. As expected, the sum of the eigenvalues is equal to the number of variables. The third column contains the cumulative variance extracted. The variances extracted by the factors are called the eigenvalues. This name derives from the computational issues involved.
Eigenvalues and the Number-of-Factors Problem: Now that we have a measure of how much variance each successive factor extracts, we can return to the question of how many factors to retain. As mentioned earlier, by its nature this is an arbitrary decision. However, there are some guidelines that are commonly used, and that, in practice, seem to yield the best results.
The Kaiser criterion: First, we can retain only factors with eigenvalues greater than 1. In essence this is like saying that, unless a factor extracts at least as much as the equivalent of one original variable, we drop it. This criterion was proposed by Kaiser (1960), and is probably the one most widely used. In our example above, using this criterion, we would retain 2 factors (principal components).
The scree test: A graphical method is the scree test first proposed by Cattell (1966). We can plot the eigenvalues shown above in a simple line plot.



Which criterion to use: Both criteria have been studied in detail (Browne, 1968; Cattell & Jaspers, 1967; Hakstian, Rogers, & Cattell, 1982; Linn, 1968; Tucker, Koopman & Linn, 1969). Theoretically, you can evaluate those criteria by generating random data based on a particular number of factors. You can then see whether the number of factors is accurately detected by those criteria. Using this general technique, the first method (Kaiser criterion) sometimes retains too many factors, while the second technique (scree test) sometimes retains too few; however, both do quite well under normal conditions, that is, when there are relatively few factors and many cases. In practice, an additional important aspect is the extent to which a solution is interpretable. Therefore, you usually examines several solutions with more or fewer factors, and chooses the one that makes the best "sense." We will discuss this issue in the context of factor rotations below.
Principal Factors Analysis: Before we continue to examine the different aspects of the typical output from a principal components analysis, let us now introduce principal factors analysis. Let us return to our satisfaction questionnaire example to conceive of another "mental model" for factor analysis. We can think of subjects' responses as being dependent on two components. First, there are some underlying common factors, such as the "satisfaction-with-hobbies" factor we looked at before. Each item measures some part of this common aspect of satisfaction. Second, each item also captures a unique aspect of satisfaction that is not addressed by any other item.
Communalities: If this model is correct, then we should not expect that the factors will extract all variance from our items; rather, only that proportion that is due to the common factors and shared by several items. In the language of factor analysis, the proportion of variance of a particular item that is due to common factors (shared with other items) is called communality. Therefore, an additional task facing us when applying this model is to estimate the communalities for each variable, that is, the proportion of variance that each item has in common with other items. The proportion of variance that is unique to each item is then the respective item's total variance minus the communality. A common starting point is to use the squared multiple correlation of an item with all other items as an estimate of the communality (refer to Multiple Regression for details about multiple regression). Some authors have suggested various iterative "post-solution improvements" to the initial multiple regression communality estimate; for example, the so-called MINRES method (minimum residual factor method; Harman & Jones, 1966) will try various modifications to the factor loadings with the goal to minimize the residual (unexplained) sums of squares

Histogram & Box Plot

Hello...

Histogram

A histogram is a graphical representation of the distribution of the data. It contains tabular frequencies represented in the form of rectangles adjacent to each other. These discrete intervals are known as bins. The total area of the histogram is equal to the number of the data.

Box Plot

The Box plot is a chart that graphically represents the five most important descriptive values for a data set. It summarizes the following statistical measures:-

· Median

· Upper & lower Quartiles

· Minimum & maximum data values

Comparing histogram & box plots

® The data in a histogram is represented in the form of bars which are considered as the peaks. This helps us to interpret the data and also shows the fluctuations. Whereas in a box plot the values average one another out, causing the distribution to look roughly normal.

® A histogram is preferable over a box plot is when there is very little variance among the observed frequencies. The histogram displayed to the right shows that there is little variance across the groups of data; however, when the same data points are graphed on a box plot, the distribution looks roughly normal with a high portion of the values falling below six.

® When there is moderate variation among the observed frequencies, the histogram looks ragged and non-symmetrical due to the way the data is grouped. However, when a box plot is used to graph the same data points, the chart indicates a perfect normal distribution.

Source:-

http://en.wikipedia.org/wiki/Box_plot

http://en.wikipedia.org/wiki/Histogram

http://www.netmba.com/statistics/plot/box/

http://www.brighthub.com/office/project-management/articles/58254.aspx#