In this post I am going to show you more about theoretical and practical concepts related to exploratory data analysis (EDA). So far I only depicted a broad sketch of what this stage implies but now I will go a bit further to show you the initial steps of an important branch of EDA called Spectral Density Estimation (SDE).
Spectral density estimation is, as its name indicates, an estimation of the probability density function (pdf) of a time series in the frequency domain. This means that the purpose of the method is to extract the information that allows the characterization of the distribution of the data in an specific subdimension as is the frequency dimension (also know as spectral dimension. If you are curious you can track the origin of the word. Spectrum was used for frequencies of light that generates images, direct translation from Latin, but as optics techniques were applied to other fields involving waves the term was extrapolated). But what is special about that dimension? Well the truth is that is that to go to that dimension we need only a linear transformation from the time domain (the well known Fourier Transform) and what provides is a quick overview of the phenomena that happen periodically, that is, that happen with a certain frequency. In the long term this is a key concept for the model of the reality we want to have. Imagine the rotation of the planets, they have a characteristic frequency that could be estimated with this method and afterwards predicted to send satellites or other spacecraft with enough confidence and success.
Wait a minute! You are a neuroscientist and for planets it seems a reasonable technique but, what does it have to do with the brain! Let's face the question. The frequency domain it is interesting at many levels and in many scientific fields. In the case of the neurons we have some signals like the voltage of their membrane, the electromagnetic field that this voltage generate in the extracellular regions (let's say by now between neurons...), the BOLD (Blood-Oxygen-Level Dependent), etc. that we think are reflecting physiological phenomena that could be related with the cognition and behavior of the living beings. In simple words, what we think, feel, perceive and remember might be encoded and decoded based on the different frequencies or rhythms of the brain.
As data scientist one of the first things we have to face is to know our data set and the system under study. This will impose some constrains that seem logical in study and that will simplify the exploration of the data. Think about that as the boundary conditions of a ODE (ordinary differential equation), without them the solution problem is not possible. Then, as an initial assumption on the data within the spectral estimation, I am going to claim that the data under study will be an stationary stochastic process (SSP). The data we work with usually will be stochastic or random. As opposite to deterministic processes the stochastic ones assume that there is a level of randomness in the system that makes the signal could be described only in a probabilistic way, i.e. I can say that the signal has a property only with a certain degree of confidence or likelihood. The data we work with also is assumed to be stationary. This constrain is a strong one...rarely in nature we find anything stationary, as Heraclitus noticed long time ago πάντα χωρεῖ καὶ οὐδὲν μένει; all flows, nothing stays. But if we observed the nature for a period of time shorter than changing rate of the phenomena we observe we can likely have a good estimation of that phenomena. For instance, if I want to know the state of your brain during sleep it would be much better to sample it for 8 hours (the average time of sleep of humans) than observe it for 15 hours, because in the last case it is likely I get signals from the awake period. Another option to increase the stationarity of our signal is to observe the phenomena always with the same external conditions. If I observe not only for 8 hours but always starting at 23:00 in your bedroom it is likely I can get similar phenomena night after night and get a good estimator of your neural activity during sleep.
[..]as Heraclitus noticed long time ago πάντα χωρεῖ καὶ οὐδὲν μένει; all flows, nothing stays.
Once we know the signals we are dealing with it is time to get hands on the processing. Let's imagine we have our time series, which is a SSP, that looks like in the picture below in blue. This signal has been sampled with a sampling rate of 30 kHz and represents the breathing activity of a subject.
According to the Fourier theory it is known that the spectral distribution can be expressed as:
Where are the data points of our signal and is the representation of the signal in the spectral domain. So easy! Well, in deed I left appart many mathematical considerations on the convergence and other topics but in general terms the former equation describe the translation of the temporal domain to the spectral domain. The signal is what I called so far spectral density! And it is a deterministic function!
Did I lie? No. This is a mathematical construct and we know we can not reach or in the real world. Imagine that in our case we can have only the samples in the periods delimited with red bars in the previous figures and that are called windows. In deed does not matter how big could be the window in your case because it will be always smaller than the one defined by . In this case we have a lack of information of the underlying distribution and hence the signal is stochastic and the best we can do would be to get an estimator through the truncated version of the previous expression:
Having our estimator with our T being the period were we can observe and record the signal. Visually this could be summarized in the next figure.
From Fourier theory it is also known that:
This means that our temporal signal can be recovered from the frequency domain as the integral. Why the interval ? You are right this makes no sense but let me explain it to you. First thing you may complain about is why the interval is not as before with the time. Well here it comes another constrain or assumption on the data: usually is fruitful to consider the data bandlimited. This is not a terrible wrong assumption, it only reflects the fact that the information of the signal is in a specific range of frequencies and the others do not contribute to the system. Think about the rotation of the earth, the frequency is 1/T, with T=24 hours. Any other frequency will not have a special impact in the rotation of the earth and hence is bandlimited. Because the EDA is by definition exploratory we should not constrain this range too much but it makes a lot of sense to do it for the sake of the analysis. In the case of neuroscience think about the refractory period in which a neuron can fire. Beyond the frequency defined by this physiological phenomena makes no sense to consider further frequencies. Hence one can think as the frequencies lying on a range of ; and if we normalized by the range could be defined as . One last fact of the frequency domain is that is a periodic domain i.e. is a periodic function and hence the information lying in is the same as the one lying in the interval . Fair enough? Let's continue the end of the post is close.
So far we have that:
Combining the second in the expression of our estimator, , we have:
I know, this may be hard but believe me that if you manage to get the whole idea you will get a good understanding of the spectral estimation and a better data scientist.
So what we have is that our spectral density estimator is equal to the underlying spectral density (the ground truth) multiplied by a shifted version of a special function and integrated in the frequency domain. It seems apparent that in this case the similarity between our spectra density estimator and the real spectral distribution will depend to a large degree on this special function. This term is called Dirichlet kernel. Its shape depends on the parameter T, the duration of our observation, as you can see in the next figure.
Then, how this situation affects our estimator, i.e. how this situation affects our capacity of getting a confident representation of our data in the spectral domain? This question and other points will be clarified in the next post.
If you have any doubt about this post, please,ask freely in the comments. In the next days I will upload the python code I used to generated all the plots to my github and hence you can reproduce these examples.
Now, time to solve!