Towards the automatic annotation of large corpora of Classical music
Bernhard Niedermayer

Audio-to-score alignment is a technique, where the symbolic representation of a piece of music is synchronized to the audio signal of a recorded performance, such that the onset time of each individual note can be extracted and tempo curves can be computed. A standard algorithm used for alignment is Dynamic Time Warping. However this methods suffers from the shortcoming that notes which are played simultaneously in the score will always be aligned to the same time frame within the audio signal.

We have introduced two post-processing methods, which are not only able to overcome the flaw, but also to generally increase the accuracy of extracted note onsets. The first one facilitates tone models, trained in advance. The audio spectrum is then factorized using a dictionary of these models, such that the activation energy of each model over time is obtained. In cases where a note is played only once within a certain time span, the frame where the maximal increase in this activation energy occurs is a very accurate estimator for the onset time.

However, there are ambiguous cases, where notes are repeated and no significant peak of activation energy is found. In these cases a second revision is done by investigating the energy increase in the frequency band, corresponding to the notes fundamental frequency. Energy increases are weighted using a Gaussian window around the initial onset estimate, considering its robustness.

We have shown that this method results in more than 90% of all notes being aligned with a temporal deviation from the real onset time of less than 50ms and almost 50% of the notes with an error smaller than 10 ms. The test data used consists of several Mozart sonatas, comprising more than 30.000 notes.