Understanding Principal Component Analysis (PCA)

 

During machine learning practice or data analysis, when we get a new dataset, what we can do first is to visualize the data to observe how the data distribute and what the relationship between datapoints. We can easily visualize dataset with 2 or 3 variables by plotting 2-demension(2D) or 3-dimension(3D) figure. However, most of the datasets have quite a large number of variables. Therefore it is important to get to know how to visulize high-dimension dataset.

I would like to discuss two popular techniques we usually use for high-dimension dataset visualization: Principal Component Analysis (PCA) and T-distributed Stochastic Neighbor Embedding (t-SNE). We will focus on PCA on this post and more about t-SNE in later post.

What is PCA

PCA tries to out a smaller set of new variables(principal components, PCs) which can capture most of the variation in dataset. We can get some simple examples to make it easy to understand. pca_what We can observe from the top of Figure 1 that some dimensions may have more important than others. In this case, we can take 2-D data and display it on a 1-D graph without too much information loss. Both graphs say, “the important variation is left to right”. Intuitively, a dimension that has more variability can explain more about the happenings.

We can visulize data with more than 3 variables (in this case is 4 variables) in 2D graph by using PCA…

To describe how the PCA works, we will use two variables simply for convenience. More variables will follow the same way.

pca_how

How PCA works

  1. normalization
  2. best fitting (projection)
  3. egienvalue, eigenvector
  4. loading score, screen plot

Examples

to be continued