Time Series Classification using a Frequency Domain EM Algorithm

Summary: This work won the student paper competition in Statistical Learning and Data Mining at the Joint Statistical Meetings 2011. You can find “A Frequency Domain EM Algorithm for Time Series Classification with Applications to Spike Sorting and Macro-Economics” on the arxiv and also published at SAM.

Let’s say you have n time series and you want to classify them into groups of similar dynamic structure. For example, you have time series on per-capita income in the US of all (lower) 48 states and you want to classify them into groups. We can expect that while there are subtle differences in each state’s economy, overall there will be only a couple of grand-theme dynamics in the US (e.g., east coast and mid-west probably have different economic dynamics). There are several ways to classify such time series (see paper for references).

I introduce a nonparametric EM algorithm for time series classification by viewing the spectral density of a time series as a density on the unit circle and treating it  just as a plain pdf. And what do we do to classify data in statistics/machine learning?: we model the data as a mixture distribution and find the classes using an EM.  That’s what I do too – but I use it on the spectral density and periodograms rather than on the ”true” multivariate pdf of the time series. Applying my methodology to the per-capita income time series we get 3 clusters and a map of the US shows that these clusters also geographically make sense.



May the ForeC be with you: R package ForeCA v0.2.0

I just submitted a new, majorly improved ForeCA R package to CRAN.  Motivated by a bug report on whiten() I went ahead and rewrote and tested lots of the main functions in the package; now ForeCA is as shiny as never before.

For R users there isn’t a lot that will change (changelog): just use it as usual as foreca(X), where X is your multivariate (approximately) stationary time series (as a matrix, data.frame, or ts object in R).


ret <- ts(diff(log(EuStockMarkets)) * 100) 
mod <- foreca(ret, spectrum.control = list(method = "wosa"))

I will add a vignette in upcoming versions.

I did it the Lambert Way

Finally, after years of struggling to convince reviewers that the Lambert W function is indeed a useful mathematical function I published a sequel to the original Lambert W paper: this time it’s about heavy tails and how to model it, but also remove it from data.  The paper is entitled “The Lambert Way to Gaussianize heavy-tailed data with the inverse of Tukey’s h transformation as a special case”.

For those of you who know Tukey’s h distribution, heavy tail Lambert W x F distributions are a generalization of it and I show the explicit inverse (even though some reviewers – I think – don’t want to acknowledge this, because they have worked on it previously and deemed it impossible.)

GMG goes Google

After my internship at Google NYC in the summer of 2011, I eventually decided to join Google full time by beginning of 2013. I’m a statistician in the quantitative marketing (QM) team – a great mix of (mostly) PhD statisticians joined with with data analysts, machine learners (is that how they/we are called?), engineers, and business/marketing people. We work on an interesting range of projects in the intersection of sales & marketing and engineering, ranging from recommendation systems, time series prediction/classification, to network analysis and longitudinal data analysis.

As part of a research team we also have ‘publish papers’ on our agendas. You can find (part of) my past and some current work at research.google.com (expect a certain latency with what I am currently working on).

Disclaimer: While being at Google full time does of course limit the time I can spend on – previous research (and software) -, I’m still finishing up papers and code implementations from my time at CMU. So if you are interested in these areas, keep coming back for updates – or just send me an email.

Escaping the blog-comment-destruction trap

Some clarifications to cool down the heat over economics-public health paper

Normally my work deals with topics that the general public would consider boring or put in the “Who cares?” section.

The poverty trap paper, however, generated quite a buzz on several sites — and their commenting section:

In hindsight, it’s not too surprising for such a controversial topic as economics and health / development. Reading the comments showed me that Larry was right (again!) on his proposal for the future of publishing1.

After reading those comments I think it’s necessary to re-iterate and clarify some points:

  • First and foremost, this work is a proof of concept. We wanted to show that a dynamic, deterministic model of economy and health can lead to poverty traps if certain (idealized) steps are taken. And we did.
  • This is a model. Thus as every model, it makes simplifying assumptions — and we make a lot of very extremely simplifying assumptions! Some people might consider some of these assumptions as unrealistic, while others are less problematic. But it is always under these assumptions that the conclusions of the model must be evaluated.
    If someone does not agree with the assumptions, then they don’t have to agree with the conclusions. That’s valid! – No harm done.
  • The estimates we get out of our models, especially those related to costs and cost reduction, must not be taken literally. More than once have I read: ”This work shows that drugs only need to cost 50 cents and then poor countries can get out of poverty.” No, this is not what we –intend to– say. We say that in our very simplified model, with several assumptions and estimates serving as input, the simulations spit out 50 cents as the value for which poverty traps form. However, we did not put a lot of time and effort in this work to show that drugs should cost 50 cents, but that poverty traps are in principle possible.
  • A commonly seen criticism / intended debunk of our work is Correlation is not causation:
  • Of course we are aware of this misuse of observational data and statistical inference. However, our causal statements are made solely on the basis of a dynamic model of the economy and health state and how they evolve over time in an intertwined manner. The regression fits only parameterize the curves of sanitation and nutrition as a function of capital. I think most people agree that sanitation and nutrition increase with increase in capital (d / d K s(K) > 0 and d / d K n(K) > 0). We will not agree on how exactly they are related:
    • If you know the functional form, please let us know— we were desperately trying to be enlightened. No success.
    • Rather than rolling a die and using ad-hoc / random parameters, we used data from OECD and the UNO to get an estimate of a reasonable functional form. The fits have the agreed-upon property of monotonic increase as a function of capital — and that’s all we wanted (and we tried other curve shapes: results didn’t change)
    • If you disagree with the functional form we use, feel free to use your own and run our models (we put very detailed instructions in the paper; we might even publish our code once cleaned and commented well).
  • Lots of development aid money does not reach the people in need. Correct. But for the point of the paper this does not make a difference. When we talk about cutting costs in half we are aware that it might need more money to put in the system so that the 50% cost reduction can be achieved. Just view the development aid as effective development aid (i.e., the part that’s not lost in bureaucracy/management of its infrastructure or lost in corruption).

We will include these points in an updated, reviewed version of this paper (it will be updated on arxiv, and also for the reviewing process.)

  1. Researchers put their work on publicly available sites (such as arxiv.org), the whole world can discuss it (and suggest improvements), and then once the paper is corrected/reviewed thoroughly by the crowd, then publishers would approach the researcher and get them on board for a journal submission. This would lead to much improved research, papers, and work in general.

Productivity at its best: Santa Fe Summer School on Complexity

In the summer of 2012 I had the privilege to be part of the Santa Fe Summer School on Complexity. I learned a lot, ‘worked’ a lot (Ben, Oscar, and Laurent were very dedicated collaborators ;)), and made great friends.

I knew it would be an intense time with classes from dawn to — beautiful New Mexico — dusk; 5 days a week; for 4 weeks. Topics ranging from physics, chemistry, biology, economics, psychology, statistical inference and machine learning, … And on top of that we would work on group projects with a final presentation to the faculty and group.

If anyone had told me that I’ll get one paper out of this I would have said ‘no way, its too much going on there. And besides, these are mostly mathematicians, physicists, and epidemiologists … How should I do a paper with them?’

Well it turned out that Oscar, Laurent, and Ben (math, phsycis, and epidemolgy) and I (stats) worked not on one, but on 4 (!) papers. Despite my paper being the first one to be going out for review, it’s the last one in the reviewing / soon to be published aether. The others have done their homework:

The work I was mainly responsible for is in submission and can be found on arxiv:

Three sentence summary

We show that in a dynamic, deterministic model of capital accumulation and disease spread, poor countries can get stuck in a cycle of not having enough money (producing capital) to treat its people, which in turn leads to less labor due to sickness and even less capital. Rich countries, on the other hand, don’t have this problem since their improved sanitation infrastructure and nutrition helps to contain the disease at lower costs. As an exit strategy for a poor country we show that development aid in the form of reduced drug and treatment costs (effectively injecting capital in the economy from outside) can get poor countries back on track to capital gains, and improved health — and thus out of the poverty trap.

For a follow up post on the reaction to this paper in the public (outrage) see this post.

References on the Lambert W function

Here I gather a list of statistical and non-statistical references on the Lambert W function.


Links are in no particular order (yet). The number in parenthesis gives the year of the version I could find; it’s not the current state of the linked work.

If any link is broken, please email me so I can fix it.



Thesis research machine learning lunch talk video

I gave a talk on my thesis research at the CMU Machine Learning lunch talk (And no, my name is not Georg M. Georg). It was a lot of fun and a great audience. They recorded the talk and it is now online available at vimeo.com/53233543.

Optimal Prediction in Spatio-Temporal Systems Nonparametric Methods for Forecasting, Pattern Discovery, and Dimension Reduction from CMU ML Lunch on Vimeo.

This is not white noise! – Ceci n’est pas bruit blanc!

Disclaimer: This is an informal summary of Goerg (2012), Testing for white noise against locally stationary alternatives, Statistical Analysis & Data Mining, 5, (6). You can find a draft version on academia.edu. By having come to this site I assume you know about time series modeling and prediction – so I spare you technical details and formal definitions here.


This is not white noise!

Figure: “This is not white noise.” Rene Magritte’s 2 cents on time series residual checks (see MagrittePipe).

I am sure you have seen many of those residual check figures in text books, papers, etc. and also have produced them yourselves. However, these plots are misleading because they do not necessarily mean that the series is indeed white noise; in fact, it just shows you that on average it is white noise.

What does that mean? Well, what actually happens is that sample estimates of the autocorrelation function (ACF) \widehat{\rho}(k) from the entire sequences x_1, \ldots, x_T converge to the temporal average of a possibly time-varying correlation structure \rho(k; u), where u = t/T \in [0,1] is time rescaled (from [0, T]) to the unit-interval.

This means that when you fit a model, plot the sample ACF of the residuals and all lags k \geq 1 are below the significance line, this just shows that your model can remove auto-correlation on average, but you still have to check if your data has time varying dynamics.


Let’s take an example from the paper. Consider the locally-stationary moving-average process of order 1 (LS-MA(1)):

    \[x_{t/T} = \varepsilon_{t/T} + \theta_1(u - 1/T) \varepsilon_{(t-1)/T}, \quad \varepsilon_{t/T} \sim N(0, \sigma(u)^2)\]


and \theta_1(u) has a non-linear time-varying structure over normalized time u = t/T \in [0,1], and \sigma(u)^2 = (1 + \theta_1(u)^2)^{-1} is the inverse of the usual MA(1) variance of a process. This choice of variance ensures that the process x_t has unit variance over time (if you want another variance, just multiply the result with the square root of this target variance).

yy <- lsarma11.sim(n = 1000, theta = c(0.5, -1), sigma = "constant")
layout(matrix(1:2, ncol = 2))

Figure: Simulation of an LS-MA(1) with such a time-varying heta_1(u) that it becomes a spurious white noise process.

The functional form of \theta(u) was parametrized by (0.5, -1), which means that it is a linear polynomial of u with coefficients

    \[\theta_1(u) = 0.5 - 1 \cdot u\]


Recall that the autocovariance function of an MA(1) process has only a non-zero value for lag k=1 with \gamma(k = 1) = \sigma^2 \theta_1. Now let’s add in the normalized time in again and get \gamma(k, u) = \sigma^2(u) \cdot \theta_1(u).

The important part here is that \sigma^2(u) \theta_1(u) integrates to 0 over time u \in [0, 1]: \int_0^1 \sigma^2(u) \theta_1(u) du = \int_0^1 \theta_1(u) du = 0, where the first equaliy follows since \sigma(u)^2 is an even function (around 0.5) and the second because \theta_1(u) is odd (around 0.5). This means that this particular LS-MA(1) process has an autocovariance function that averages out to 0 for all k > 0. Using the terminology of Goerg (2012) this particular LS-MA(1) process is a spurious white noise process.

Testing for white noise and proposal for a new white noise test

Why does that matter in practice? Well, it is spurious in the sense that the sample autocorrelation function (ACF) \widehat{\rho}(k) tends to the the ACF of white noise, \gamma_{WN}(k) = \mathbf{1}(k = 0), for T \rightarrow \infty. This means for a large enough sample the sample ACF is indistinguishable from white noise. This is bad!! Especially if we use Portmanteau style white noise tests like the Box-Ljung test.

Box.test(yy, lag = 5, type = "Ljung-Box")
##  Box-Ljung test
## data:  yy
## X-squared = 5.5603, df = 5, p-value = 0.3514

Using the popular white noise test we wouldn’t reject white noise (p-value: 0.3513767) – even though we know it is not white noise. This shows you that the Ljung-Box test has very low power against locally stationary alternatives.

In the paper I propose a simple windowing approach to improve the power against such locally stationary alternatives – locally stationary Lung Box test.

Box.test.ls(yy, lag = 5, K = 2, type = "Ljung-Box")
##  LS Ljung-Box test; Number of windows = 2; non-overlapping window
##  size = 500
## data:  yy
## X-squared = 47.813, df = 10, p-value = 6.715e-07

Here even with already two windows (K = 2) we can detect the non-stationarity.

See the paper for details and from now on beware when someone shows you an ACF plot with a Ljung-Box test to justify that their models produce white noise residuals.

Note: I do note that the locally stationary deviation from white noise might not always be important for the application at hand. However, in principle to understand the dynamics of the process – and to get even better forecasts – you should care about it.

Update: It is true that this might look just like a theoretically interesting case, but not seen in practice. I agree it is unlikely to see such spurious white noise in a real world dataset directly, but you will see this in almost any residuals of a stationnary model. Without going into too much theoretical detail, just imagine a general time-varying model \theta(u) with a time varying covariance function \gamma_{\theta}(k, u). Well if you fit a constant model to this with parameter \widehat{\theta} then this estimate will be close to the time average of the true model \theta(u); and naturally the residuals will now be close to \theta(u) - \widehat{\theta} which by construction of \widehat{\theta} will integrate out to zero (approximately).

Let’s look at an example

tree.ring <- ts(fread(file.path(data.path, "tree-rings.txt"))[, V1])
layout(matrix(1:4, ncol = 2))

mod.arfima <- fracdiff(tree.ring)
mod.arfimad  ## [1] 0.236507 arfima.res <- diffseries(tree.ring, mod.arfimad)

Looks good right? Well, the issue is that the residuals are spurious white noise. How do I know? First, I can test it

Box.test(arfima.res, type = "Ljung-Box")
##  Box-Ljung test
## data:  arfima.res
## X-squared = 1.8757, df = 1, p-value = 0.1708
Box.test.ls(arfima.res, K = 4, type = "Ljung-Box")
##  LS Ljung-Box test; Number of windows = 4; non-overlapping window
##  size = 497
## data:  arfima.res
## X-squared = 39.361, df = 4, p-value = 5.867e-08

and second, we know from literature that the tree ring data is in fact locally stationary fractional noise: see Goerg (2012) and Ferreira, Olea, and Palma (2013).



Simulating a locally stationary ARMA(1, 1) process:

function (n, phi = 0, theta = 0, sigma = 1, rand.gen = rnorm, 
    innov = rand.gen(n)) 
    u.d <- seq_len(n)/n
    phi.u <- as.function(polynomial(phi))
    theta.u <- as.function(polynomial(theta))
    phi.t <- phi.u(u.d)
    theta.t <- theta.u(u.d)
    if (sigma == "constant") {
        var.arma11 <- (1 + theta.t^2 - 2 * theta.t * phi.t)/(1 - 
        sigma.t <- sqrt(1/var.arma11)
    else {
        sigma.u <- as.function(polynomial(sigma))
        sigma.t <- sigma.u(u.d)
    xx <- innov
    for (ii in 2:n) {
        xx[ii] <- phi.t[ii] * xx[ii - 1] + sigma.t[ii] * innov[ii] + 
            theta.t[ii - 1] * sigma.t[ii - 1] * innov[ii - 1]

The locally stationary Box test:

function (x, lag = 1, type = c("Ljung-Box", "Box-Pierce"), fitdf = 0, 
    K = 1, plot = FALSE, ...) 
    stopifnot(is.numeric(x), K >= 1, K == round(K), is.logical(plot))
    type <- match.arg(type)
    if (K >= length(x)) {
        stop("Number of windows must be smaller than the length of the time series.")
    N.k <- floor(length(x)/K)
    if (K == 1) {
        Box.test(x, type, lag = lag)
    else {
        METHOD <- paste0("LS ", type, " test; \n Number of windows = ", 
            K, "; non-overlapping window size = ", N.k)
        stats.pvals <- rollapply(yy, width = N.k, by = N.k, FUN = function(x) {
            out <- Box.test(x, lag = lag, type)
            return(c(outstatistic, outp.value))
        }, fill = NA)
        stats.pvals <- stats.pvals[!is.na(stats.pvals[, 1]), 
        colnames(stats.pvals) <- c("statistic", "p-value")
        if (plot) {
            plot(stats.pvals[, "p-value"], ylim = c(0, max(stats.pvals[, 
                "p-value"])), type = "h", lwd = 2, ylab = "p-value", 
                xlab = "Window")
            abline(0.05, 0, col = 4, lty = 4)
            abline(0, 0, lty = 1, col = "gray")
        STATISTIC <- sum(stats.pvals[, "statistic"])
        names(STATISTIC) <- "X-squared"
        df.total <- lag * K
        names(df.total) <- "df"
        PVAL <- 1 - pchisq(STATISTIC, df.total)
        return(structure(list(statistic = STATISTIC, parameter = df.total, 
            p.value = PVAL, method = METHOD, data.name = deparse(substitute(x))), 
            class = "htest"))

Santa Fe Institute and cutting off turtle heads

I just arrived in Santa Fe yesterday for a 4 week (!) summer school on complex systems. Just one day has passed and it’s already impressive to see what the other people are doing / working on. Especially for a statistician having so many people to talk about their scientific problems and what problems they face with their experiments / simulations / data is very fascinating; a great resource of interesting problems to work.

Participants are from a wide variety of fields; some obvious (given where complexity science started from) such as physics, computer science, economics/finance, evolutionary biology – other fields not so obvious but at least as interesting such as paleontology, tax evasion and fraud detection, literature/linguistics and – yes – statistics (I am the only statistician / machine learning person here).

And if in the ‘student introduction’ session one student starts with “I am a PhD student in neuroscience and we study neural signal transmission in turtles after cutting off their head …” — I mean if that’s just the first day, how great will be the next 4 weeks 😉