Understanding Principal Component Analysis (PCA)

During machine learning practice or data analysis, when we get a new dataset, what we can do first is to visualize the data to observe how the data distribute and what the relationship between datapoints. We can easily visualize dataset with 2 or 3 variables by plotting 2-demension(2D) or 3-dimension(3D) figure. However, most of the datasets have quite a large number of variables. Therefore it is important to get to know how to visulize high-dimension dataset.

I would like to discuss two popular techniques we usually use for high-dimension dataset visualization: Principal Component Analysis (PCA) and T-distributed Stochastic Neighbor Embedding (t-SNE). We will focus on PCA on this post and more about t-SNE in later post.

What is PCA

PCA tries to out a smaller set of new variables(principal components, PCs) which can capture most of the variation in dataset. We can get some simple examples to make it easy to understand. pca_what We can observe from the top of Figure 1 that some dimensions may have more important than others. In this case, we can take 2-D data and display it on a 1-D graph without too much information loss. Both graphs say, “the important variation is left to right”. Intuitively, a dimension that has more variability can explain more about the happenings.