PCA on neural data

Posted on 05 July 2018 in Tutorials

In my previous tutorial PCA: from newbie to ninja I went through a bit of the theory behind PCA and showed how you can write a simple implementation in Python, using the famous Iris dataset as an example.

However, neural data generally has a more complex structure than the Iris dataset, because we have an additional dimension: time. Consequently, there are different ways in which we can use PCA to reduce its dimensionality. In this tutorial, I will present what I think are the main strategies to run PCA on neural data, discuss their strengths and weaknesses, and show you some cool ways to visualize the results. As an example, I will use a recording session of a population of neurons in mouse V1 obtained with chronic calcium imaging where the animals are presented with oriented drifting gratings.


Why PCA?

Admittedly, there are many fancier ways to do dimensionality reduction compared to vanilla PCA. Starting with the many flavors of PCA, such as Kernel PCA, we then have the whole suite of manifold learning algorithms (Isomap, Locally Linear Embedding, t-SNE, etc.), arriving to more modern tools like auto-encoders, and finally dimensionality reduction methods tailored specifically to neural data, such as Gaussian-Process Factor Analysis (GPFA), tensor component analysis (TCA), demixed principal component analysis (dPCA), sequential non-negative matrix factorization (seqNMF).

So why am I still talking about PCA? The truth is, whenever I am interested in performing dimensionality reduction on a new dataset, I generally start with PCA before moving on to more advanced methods. In my experience this is a good idea for a number of reasons:

  1. Chances are, you understand PCA better than something more advanced. This means you will be able to clearly interpret your results, and it also means more freedom to play around and experiment with it. Importantly, your audience will also most likely know what you are talking about, which is always a good place to start.
  2. You most likely will have access to a good implementation in your programming language of choice, which means it will be a matter of minutes to get it up and running (this may not always be the case for newer methods).
  3. There are virtually no parameters you need to tinker with, which makes the application very straightforward. With more advanced methods, you need to understand the impact of each parameter, and you may get different results for different settings. While this can be interesting for an extensive analysis, it may not be desirable if you are looking for a quick exploration of the data.
  4. PCA can serve as a good baseline to judge the advantages of using a fancier method. While it may certainly be the case that using a fancier method allows you to uncover properties of the data which PCA missed, it can also happen that the benefits of spending extra time and effort on a more advanced technique are not really worth it. So if you decide to move on to a fancier method, you are aware of the added value you are getting (or not) beyond simple PCA.

Neural data

When I say neural data, I specifically mean the case where record the activity of a population of neurons in parallel across a number of trials of potentially different types. Trials can consist in the presentation of different type of stimuli, and a behavioral response of the animal may or may not be required. In the example we will look at, mice are presented with visual stimuli consisting of drifting gratings, and the trial types correspond to the different orientations of these gratings. This tutorial is also restricted to the case where trials have roughly the same duration and stimulus-behavior sequence, so that we could average the activity of neurons across trials. In experiments with freely-moving animals and self-paced behavior, this is of course not the case, which makes the application of most dimensionality reduction methods a lot more tricky.

While of course these specifications, namely parallel recording, trial structure, and fixed within-trial structure, are not universal, they are very common in cognitive and systems neuroscience. Requiring fixed within-trial structure is probably the most restrictive condition, so perhaps I will dedicate a future tutorial specifically to this issue.

But why make a separate tutorial for PCA on neural data? Didn't I already cover PCA in my previous tutorial? Well, almost. With respect to the Iris dataset, the main added complexity here comes from the fact that neural data has an additional time dimension: for every trial, the activity of the neurons is recorded at several time points, so several samples (activity at a given time point) are nested within a single trial. This opens up a few possibility as to how we can come up with data matrix $X$ on which to fit PCA.

It is also important to distinguish between the two different operations we perform when we say that we 'apply' PCA. The first operation is where we fit the PCA decomposition. That is, we take the eigenvectors of the covariance matrix and construct the linear transformation $C$ of dimensions $N \times Q$ (number of features by number of components). In the second operation, we apply the linear transformation to reduce the dimensionality of a data matrix, by projecting (i.e. multiplying) the data matrix by the linear transformation $C$. As you can see, the matrix on which we fit PCA and the matrix we apply the transformation to do not need to coincide. All that matters is that they have the same number of features.

In the drawing below I illustrate what I found to be the most common ways to use PCA on neural data, what I call the trial-response form, trial-averaged form, trial-concatenated (or simply long) form, and the averaged-concatenated (or simply hybrid) form.

PCA on neural data: the different forms


The trial-response form

In the trial-response form, we reduce each trial to a single sample of shape $N \times 1$, where $N$ is the number of neurons we recorded. Then, we simply stack all these samples to produce a matrix $X_r$ of shape $N \times K$, where $K$ is the number of trials. There are several ways in which we can 'compress' the activity during the full trial to a single sample (the response). For instance, we can take the mean or maximum activity of each cell during the full trial (or during a certain epoch within the trial) as our response, and we can subtract a baseline from it (e.g. mean/maximum activity before stimulus) to focus our attention on the change in activity.


  • Simple and intuitive.


  • Disregards a lot of potentially interesting information by compressing the time dimension.
  • The choice of response measure may not be obvious, and may influence the result.

The trial-averaged form

In the trial-averaged form, we first average all trials of each type together, concatenate the trial averages, and then apply PCA to the resulting matrix.


  • Maintains a time dimension, allowing us to see the evolution of activity in the course of the trial.


  • Partially supervised.

The trial-concatenated (long) form

In the trial-concatenated form, we concatenate all trials, and apply PCA to the resulting matrix $X_l$, which has shape of shape $N \times TK$ (number of neurons by number of time points times number of trials).


  • Fully unsupervised.
  • Performs the dimensionality reduction at the level of single trials, so we get an idea of trial-to-trial variability in the reduction we obtain.


  • More sensitive to noise, potentially less able to pick up effects of interest.

The averaged-concatenated (hybrid) form

In the averaged-concatenated form, we fit PCA on the concatenated trial averages, like in the trial-averaged form, and then apply it to the matrix of concatenated trials, like in the trial-concatenated form.


  • The decomposition is more sensitive to the effect of interest (difference between trial types).
  • Performs the dimensionality reduction at the level of single trials.


  • Partially supervised.

A note on preprocessing

In my PCA tutorial I mentioned that we always want to center our data (i.e. subtract the mean from each feature) before running PCA. In the context of neural data, in my experience this is not always enough. In fact, what often happens, is that some neurons that fire a lot more than others, and they tend to monopolize the variance that PCA is trying to capture. Suppose we have a neuron which, after stimulus presentation, increases its firing from a baseline of 5 Hz to 6 Hz, and another one which increases firing from 50 Hz to 60 Hz. In both cases, the percentage change in firing is 20% ($100 \times \frac{6-5}{5}$ or $100 \times \frac{60-50}{50}$), however, PCA will not give the same importance to these neurons.

Let's code up a quick example. We begin by defining a function called center which accepts an array $X$ of shape $\#\,\mathrm{features} \times \#\,\mathrm{samples}$ and returns an array of the same shape where each row is mean-subtracted (has mean zero). This function is nothing more than a thin wrapper around Scikit-learn's StandardScaler which allows us to scale our arrays without always having to manually initialize the scaler and remember to transpose our array (the StandardScaler assumes that the features are on the columns, whereas in our data they are on the rows).

In [1]:
import numpy as np
from sklearn.preprocessing import StandardScaler

def center(X):
    # X: ndarray, shape (n_features, n_samples)
    ss = StandardScaler(with_mean=True, with_std=False)
    Xc = ss.fit_transform(X.T).T
    return Xc

X = np.array([[5., 6., 5.],
              [50., 60., 50.]]) # data array of shape n_neuron by n_time_points
X = center(X)
c = np.cov(X, rowvar=True) # covariance matrix
eig_vals, eig_vecs = np.linalg.eig(c)
srt = np.argsort(eig_vals)[::-1]
print('Sorted eigenvalues: \n{}'.format(eig_vals[srt]))
print('\nFirst eigenvector: \n{}'.format(eig_vecs[:, srt][:, 0]))
Sorted eigenvalues: 
[33.66666667  0.        ]

First eigenvector: 
[-0.09950372 -0.99503719]

As you can see from this example, the first component, which captures all of the variance (the second one has an eigenvalue of zero), assigns a weight of close to one to the second neuron, and virtually zero to the first (remember: the sign doesn't really matter). That is, PCA is 'listening' only to the second neuron, which due to its larger values is responsible for most of the overall variance. So potentially interesting modulations in the activity of the first neuron will almost inevitable be shadowed by the activity of the second one. While for certain types of data this may be absolutely fine, for the case of neural data this is problematic. We generally regard a 20% change in firing to be equally meaningful for neurons with both high and low baseline firing, and we would like our PCA to pay attention to all neurons, rather than focus only on a few high firing units.

To address this issue, we can (in addition to centering), scale the activity of our neurons such that it has unit variance, which is achieved by diving each neuron/feature by its standard deviation. This is what people call standardizing (I also refer to it as z-scoring, see Wikipedia). Using Scikit-learn's StandardScaler, we can essentially copy our previous function and set the with_std parameter to True. Let's copy the same example, but this time standardize our data:

In [2]:
def z_score(X):
    # X: ndarray, shape (n_features, n_samples)
    ss = StandardScaler(with_mean=True, with_std=True)
    Xz = ss.fit_transform(X.T).T
    return Xz

X = np.array([[5., 6., 5.],
              [50., 60., 50.]]) # data array of shape n_neuron by n_time_points
X = z_score(X)
c = np.cov(X, rowvar=True) # covariance matrix
eig_vals, eig_vecs = np.linalg.eig(c)
srt = np.argsort(eig_vals)[::-1]
print('Sorted eigenvalues: \n{}'.format(eig_vals[srt]))
print('\nFirst eigenvector: \n{}'.format(eig_vecs[:, srt][:, 0]))
Sorted eigenvalues: 
[3. 0.]

First eigenvector: 
[0.70710678 0.70710678]

Now the first eigenvector still captures all of the variance, but both neurons appear in it with the same weight (this makes sense, since once it is standardized, the activity of the two neurons is the same).

As an alternative to z-scoring, the features of the data can be scaled to all be within a certain range (commonly between 0 and 1), what is known as min-max scaling. In my experience, standardizing neural data always leads to better, more interpretable results than centering alone, whereas qualitative differences between z-scoring and min-max scaling are generally very small.

PCA on neural data: an example

The dataset

The dataset I will use as an example was recorded by Jorrit Montijn and Guido Meijer and is published in this Cell Reports paper. Neural activity is recorded with two-photon calcium imaging in mouse V1. Each trial consists of the presentation of a bidirectionally drifting square-wave grating with one of eight orientations. That is, for a given orientation, the grating will drift in one direction for 1.5 seconds, and the in the opposite direction for another 1.5 seconds. The sampling frequency of the recording is 12.7 Hz, and the inter-trial interval (ITI) is of 5 seconds. The measure of neural activity we use is the $dF/F$ traces, which measure the changes in fluorescence related to calcium transients.

In [5]:
import numpy as np
from orient.load_session import load_session
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from scipy.ndimage.filters import gaussian_filter1d
import warnings
from matplotlib.lines import Line2D
from mpl_toolkits.mplot3d import Axes3D

frames_pre_stim = 30
frames_post_stim = 30

data, events = load_session(index=8, activity_measure='dFoF',

rec_freq     = 12.7
trial_type   = data['trial_type']
trial_types  = data['trial_types']
trials       = data['trial']
start_stim   = events['start_stim'] / rec_freq
end_stim     = events['end_stim'] / rec_freq
session_type = data['session_type']
time         = events['time'] / rec_freq
trial_size   = trials[0].shape[1]
Nneurons     = trials[0].shape[0]

# list of arrays containing the indices of trials of each type (t_type_ind[0] contains the
# indices of trials of type trial_types[0])
t_type_ind = [np.argwhere(np.array(trial_type) == t_type)[:, 0] for t_type in trial_types]
Loading session /media/pietro/bigdata/neuraldata/NaturalOrient/20170424xyt05_ses_43_orientat.mat

The format of this data is fairly simple:

  • trials is a list of $K$ Numpy arrays of shape $N \times T$ (number of neurons by number of time points).
  • trial_type is a list of length $K$ containing the type (i.e. the orientation) of every trial.
  • trial_types is a list containing the unique trial types (i.e. orientations) in ascending order.
In [6]:
print('Number of trials: {}'.format(len(trials)))
print('Types of trials (orientations): {}'.format(trial_types)) 
print('Dimensions of single trial array (# neurons by # time points): {}'.format(trials[0].shape))
print('Trial types (orientations): {}'.format(trial_types))
print('Trial type of the first 3 trials: {}'.format(trial_type[0:3]))
Number of trials: 80
Types of trials (orientations): [  0.   22.5  45.   67.5  90.  112.5 135.  157.5]
Dimensions of single trial array (# neurons by # time points): (159, 103)
Trial types (orientations): [  0.   22.5  45.   67.5  90.  112.5 135.  157.5]
Trial type of the first 3 trials: [157.5   0.   90. ]

If you organize your data in the same way, you will be able to use the code that I will present throughout the tutorial.

As a last preparatory step, I will define some plotting parameters and utility functions which to add the stimulus shading and legend to the plots.

In [7]:
shade_alpha      = 0.2
lines_alpha      = 0.8
pal = sns.color_palette('husl', 9)
%config InlineBackend.figure_format = 'svg'

def add_stim_to_plot(ax):
    ax.axvspan(start_stim, end_stim, alpha=shade_alpha,
    ax.axvline(start_stim, alpha=lines_alpha, color='gray', ls='--')
    ax.axvline(end_stim, alpha=lines_alpha, color='gray', ls='--')
def add_orientation_legend(ax):
    custom_lines = [Line2D([0], [0], color=pal[k], lw=4) for
                    k in range(len(trial_types))]
    labels = ['{}$^\circ$'.format(t) for t in trial_types]
    ax.legend(custom_lines, labels,
              frameon=False, loc='center left', bbox_to_anchor=(1, 0.5))

Trial-response PCA

Here, I take as my response the mean activity during stimulus presentation, so that for each trial I end up with a single sample, and stack all the samples together. By calling the shortcut method fit_transform, I first fit PCA on Xr, and then use the extracted components to project Xr.

Our projected data has as many dimensions as there are PCA components, so we can pick a few of the first dimensions and plot them against each other. Every dot in the scatter corresponds to a single trial.

In [8]:
Xr = np.vstack([t[:, frames_pre_stim:-frames_post_stim].mean(axis=1) for t in trials]).T

# or take the max
#Xr = np.vstack([t[:, frames_pre_stim:-frames_post_stim].max(axis=1) for t in trials]).T

# or the baseline-corrected mean
#Xr = np.vstack([t[:, frames_pre_stim:-frames_post_stim].mean(axis=1) - t[:, 0:frames_pre_stim].mean(axis=1) for t in trials]).T

Xr_sc = z_score(Xr)

pca = PCA(n_components=15)
Xp = pca.fit_transform(Xr_sc.T).T

projections = [(0, 1), (1, 2), (0, 2)]
fig, axes = plt.subplots(1, 3, figsize=[9, 3], sharey='row', sharex='row')
for ax, proj in zip(axes, projections):
    for t, t_type in enumerate(trial_types):
        x = Xp[proj[0], t_type_ind[t]]
        y = Xp[proj[1], t_type_ind[t]]
        ax.scatter(x, y, c=pal[t], s=25, alpha=0.8)
        ax.set_xlabel('PC {}'.format(proj[0]+1))
        ax.set_ylabel('PC {}'.format(proj[1]+1))
sns.despine(fig=fig, top=True, right=True)

It is clear that some of the PCA components are picking up differences between trial types (i.e. orientation). This is not surprising: we know that neurons in V1 are tuned to orientation, meaning that they respond preferentially to gratings inclined a certain angle, so we expect population responses to angles close to each other to be similar and close to each other in PCA space. This is true in particular for the second and third components (PC 2 and PC 3), whereas PC 1 seems to capture variance that is mostly not related to orientation tuning: as we move along the PC 1 axis we see no clear change in orientation.

Trial-averaged PCA

For the trial-averaged PCA, we first compute the trial averages for each trial type and concatenate them:

In [9]:
trial_averages = []
for ind in t_type_ind:
Xa = np.hstack(trial_averages)

We then fit and apply PCA to the resulting data matrix.

In [10]:
n_components = 15
Xa = z_score(Xa) #Xav_sc = center(Xav)
pca = PCA(n_components=n_components)
Xa_p = pca.fit_transform(Xa.T).T

And lastly we plot the data projected onto the first three components over time. To keep the example a bit simpler, I will only focus on the first three components in the tutorial, although of course it may be useful to consider more components as well, judging from their variance explained and their trajectories.

In [11]:
fig, axes = plt.subplots(1, 3, figsize=[10, 2.8], sharey='row')
for comp in range(3):
    ax = axes[comp]
    for kk, type in enumerate(trial_types):
        x = Xa_p[comp, kk * trial_size :(kk+1) * trial_size]
        x = gaussian_filter1d(x, sigma=3)
        ax.plot(time, x, c=pal[kk])
    ax.set_ylabel('PC {}'.format(comp+1))
axes[1].set_xlabel('Time (s)')
sns.despine(fig=fig, right=True, top=True)
plt.tight_layout(rect=[0, 0, 0.9, 1])

The first thing we notice is that differences in orientation are represented very well in the second and third components. The first component, on the other hand, appears to be mostly involved in a global transient (the trajectories for the different trial types are mostly overlapping). This is not so strange: while orientation tuning may be the characteristic of neural activity that we care about, it is quite likely that the main source of variance in the data is due to the fact that there is a stimulus at all. The sudden increase in luminance of the screen will elicit an overall increase in population activity, which is then modulated by the orientation. In my experience, it is not uncommon for the first or first few PCA components to capture changes in activity which do not depend on the specific trial type.

At this point, it can be interesting to ask which units PCA combines to obtain these components. Since every component is nothing more than a linear combination of the neurons in the population (a vector which assigns a weight to each neuron). So we can ask which units feature more prominently in a given component as a way to spot units with an interesting behavior1.

In [23]:
# find the indices of the three largest elements of the second eigenvector
units = np.abs(pca.components_[1, :].argsort())[::-1][0:3]

f, axes = plt.subplots(1, 3, figsize=[10, 2.8], sharey=False,
for ax, unit in zip(axes, units):
    ax.set_title('Neuron {}'.format(unit))
    for t, ind in enumerate(t_type_ind):
        x = np.array(trials)[ind][:, unit, :]
        sns.tsplot(x, time=time,
for ax in axes:
axes[1].set_xlabel('Time (s)')
sns.despine(fig=f, right=True, top=True)

Trial-concatenated PCA

For trial-concatenated PCA, we simply stack all the trials (the order doesn't matter).

In [11]:
Xl = np.hstack(trials)

We then standardize the resulting matrix Xl, and fit and apply PCA to it.

In [12]:
Xl = z_score(Xl)
pca = PCA(n_components=15)
Xl_p = pca.fit_transform(Xl.T).T

Our projected data Xl_p is in the form of a $Q \times TK$ array (number of components by number of time points times number of trials). To plot the components, I rearrange it into a dictionary:

In [13]:
gt = {comp : {t_type : [] for t_type in trial_types} for comp in range(n_components)}

for comp in range(n_components):
    for i, t_type in enumerate(trial_type):
        t = Xl_p[comp, trial_size * i: trial_size * (i + 1)]

for comp in range(n_components):
    for t_type in trial_types:
        gt[comp][t_type] = np.vstack(gt[comp][t_type])

Now, accessing the dictionary as gt[component][orientation] returns an array of all trials of the selected orientation projected on the selected component. We then plot the projections of each trial on the first three components.

In [14]:
f, axes = plt.subplots(1, 3, figsize=[10, 2.8], sharey=True, sharex=True)
for comp in range(3):
    ax = axes[comp]
    for t, t_type in enumerate(trial_types):
        sns.tsplot(gt[comp][t_type], time=time, ax=ax,
    ax.set_ylabel('PC {}'.format(comp+1))
axes[1].set_xlabel('Time (s)')
sns.despine(right=True, top=True)

Averaged-concatenated PCA

In principle, trial-concatenated PCA is the most unsupervised option, because PCA gets no information about the trial types that we have set up. It is only after we have reduced the dimensionality of the full data that we split trials up by type to compare them.

Another added benefit is that we perform dimensionality reduction at the level of single trials. This allows to reason about differences between individual trials which may reflect trial-to-trial variations in attention, arousal, or other variables. Furthermore, in the trial-averaged PCA we only project the trial averages, so we don't really know if the trajectories of neural activity corresponding to two different stimuli are actually different. In trial-concatenated PCA, by plotting confidence intervals, we can see to what extent the trajectories of different stimuli are overlapping, and get a measure of trial-to-trial variability.

While this sounds great, in my experience this decomposition can be quite affected by noise. In fact, for the second component (PC 2), we see huge confidence intervals, meaning that the projections of different trials are largely overlapping across trial types.

In the hybrid averaged-concatenated PCA, we fit PCA on trial averages, like in trial-averaged PCA, and project individual trials on the extracted principal components, like in trial-concatenated PCA. This allows for a more powerful (and more supervised) decomposition, more likely to retrieve a certain structure in the data the we are interested in, while at the same it preserves single trial information.

The preprocessing needs a bit of attention here: in fact, it is important that we use the same scaling transformation we apply to the trial averages for the individual trials (standardizing every trial independently would give us completely wonky results).

In [15]:
# fit PCA on trial averages
trial_averages = []
for ind in t_type_ind:
Xav = np.hstack(trial_averages)

ss = StandardScaler(with_mean=True, with_std=True)
Xav_sc = ss.fit_transform(Xav.T).T
pca = PCA(n_components=15) 
pca.fit(Xav_sc.T) # only call the fit method
PCA(copy=True, iterated_power='auto', n_components=15, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
In [16]:
projected_trials = []
for trial in trials:
    # scale every trial using the same scaling applied to the averages 
    trial = ss.transform(trial.T).T
    # project every trial using the pca fit on averages
    proj_trial = pca.transform(trial.T).T

Again, I do a bit of rearranging of the projected data:

In [17]:
gt = {comp: {t_type: [] for t_type in trial_types}
      for comp in range(n_components)}

for comp in range(n_components):
    for i, t_type in enumerate(data['trial_type']):
        t = projected_trials[i][comp, :]

And again plot the trials over time on the first three components:

In [18]:
f, axes = plt.subplots(1, 3, figsize=[10, 2.8], sharey=True, sharex=True)
for comp in range(3):
    ax = axes[comp]
    for t, t_type in enumerate(trial_types):
        sns.tsplot(gt[comp][t_type], time=time, ax=ax,
    ax.set_ylabel('PC {}'.format(comp+1))
axes[1].set_xlabel('Time (s)')
sns.despine(right=True, top=True)

Now, as you can see, the separation between trials of different orientations projected on the second (and third) component is much better than in the trial-concatenated PCA, meaning that these components are better at capturing the orientation differences. This is not surprising: we have deliberately biased PCA towards this structure in the data (by fitting on trials averaged by orientation). But reducing the dimensionality of individual trials allows us to see whether this structure actually holds up at the level of single trials, or whether it is just an artifact of averaging. The fact that the projections of individual trials are actually well separated by trial type in PCA space is a good indication that this structure is present in the data.


Armed with these tools, we can get creative on how we visualize our PCA decompositions.

Animated scatter

The first plot of this tutorial, for the trial-response PCA, was a scatter plot, where each dot represents a trial in the space of the first two components. But we can do better than that! I will use the hybrid averaged-concatenated PCA (but the long form would work as well) to plot a dot per trial at every time point.

To generate the animation, I use the Matplotlib's animation module. The main idea is that we set up a figure and an empty scatter plot. Then we write a function (animate, in this case), which will get called at every iteration, and will replace the x and y position of the scatter plot with the x and y position of every trial in a PCA subspace at a given point in time. The animation is generated by calling the FuncAnimation method, which takes as input a figure and the animation function (plus some other parameters).

In [19]:
# import the animation and the HTML module to create and render the animation
from matplotlib import animation 
from IPython.display import HTML

# smooth the single projected trials 
for i in range(len(projected_trials)):
    for c in range(projected_trials[0].shape[0]):
        projected_trials[i][c, :] = gaussian_filter1d(projected_trials[i][c, :], sigma=3)

# for every time point (imaging frame) get the position in PCA space of every trial
pca_frame = []
for t in range(trial_size):
    # projected data for all trials at time t 
    Xp = np.hstack([tr[:, None, t] for tr in projected_trials]).T
subspace = (1, 2) # pick the subspace given by the second and third components
# set up the figure
fig, ax = plt.subplots(1, 1, figsize=[6, 6]); plt.close()
ax.set_xlim(( -16, 16))
ax.set_ylim((-16, 16))
ax.set_xlabel('PC 2')
ax.set_xticks([-10, 0, 10])
ax.set_yticks([-10, 0, 10])
ax.set_ylabel('PC 3')
sns.despine(fig=fig, top=True, right=True)

# generate empty scatter plot to be filled by data at every time point
scatters = []
for t, t_type in enumerate(trial_types):
    scatter, = ax.plot([], [], 'o', lw=2, color=pal[t]);

# red dot to indicate when stimulus is being presented
stimdot, = ax.plot([], [], 'o', c='r', markersize=35, alpha=0.5)

# annotate with stimulus and time information
text     = ax.text(6.3, 9, 'Stimulus OFF \nt = {:.2f}'.format(time[0]), fontdict={'fontsize':14})

# this is the function to be called at every animation frame
def animate(i):
    for t, t_type in enumerate(trial_types):
        # find the x and y position of all trials of a given type
        x = pca_frame[i][t_type_ind[t], subspace[0]]
        y = pca_frame[i][t_type_ind[t], subspace[1]]
        # update the scatter
        scatters[t].set_data(x, y)
    # update stimulus and time annotation
    if (i > frames_pre_stim) and (i < (trial_size-frames_post_stim)):
        stimdot.set_data(10, 14)
        text.set_text('Stimulus ON \nt = {:.2f}'.format(time[i]))
        stimdot.set_data([], [])
        text.set_text('Stimulus OFF \nt = {:.2f}'.format(time[i]))
    return (scatter,)

# generate the animation
anim = animation.FuncAnimation(fig, animate, 
                               frames=len(pca_frame), interval=80, 

HTML(anim.to_jshtml()) # render animation

Once Loop Reflect