Sunday, 14 August 2011

Music mining with SoX

SoX is a very powerful command line audio processing tool for Linux. In its wealth of capabilities it's a bit like ImageMagick for sound. Apart from generating and altering audio, it's also quite good at telling you what is inside a sound file.

To print metadata about an MP3 file you can do



SoX is also able to extract statistics from the actual audio information using the stats command:



The different parameters are thoroughly explained in "man sox". (The SoX manpage is actually a great little read, it covers lots of digital audio concepts.)

One annoying thing about SoX is its output format. It looks good to a human being, but to get it into a computer readable format you have to massage it quite well. Look at the output table for the stats command, for example. See how the second "r" in Crest factor sits in the same column as the "-" in the DC offset and Min level. Also notice that Num samples is 28.1M, not 28100000. To any statistics software that would make it a categorical attribute, not a numeric one.

In order to overcome some of these limitations I wrote a small shell script that takes a list of filenames from stdin and outputs a text file to stdout, where each line corresponds to one individual audio file, and the attributes that SoX extracts are separated by the pipe character.

If you use this script you'll still need to do a bit of manual processing afterwards, getting rid of lines that don't have enough values, etc.

To try it out I processed around 2,000 MP3 files I had on my hard drive. The SoX processing took around 2 hours on an old dual core Macbook. Then I got rid of lines with missing values, brought it into R, and used the following function to transform attributes in scientific notation to actual numbers:



After that I pulled the Artist ID3 tag out of the comments attribute and used J48 (C4.5 implementation) from the RWeka package to generate a decision tree model based on artist. Without tweaking the parameters for J48 very much, I got around 23% accuracy. It's obviously not a great number, but comparing it to the accuracy of a completely random prediction model, which resulted in an accuracy of 0.7%, it's not all that bad.

The audio features that SoX extracts may not be the most useful, but considering how fast and easy to use it is, I think it's definitely worth a go.

Friday, 15 April 2011

Automatic metre detection by autocorrelating IOI

Metre is so ingrained into our musical subconsciousness that it's harder to get it wrong than to get it right. Try to dance fox trot to a Wiener waltz, or lead a marching battalion to war with Bulgarian wedding music. But like all tasks that are completely intuitive to human beings, teaching a computer how to detect time signature is quite tricky.

So how do humans do it? A lot of it comes down to repetition. We tend to place our imaginary bar separators at the beginning of repeating segments. Also, the time between consecutive notes, the inter-onset interval, tends to be more significant to our perception of segmentation than actual pitch.

In 1992, Judith C. Brown came up with the ingenious idea of using autocorrelation of IOI to determine metre. While the approach is relatively straight forward, and has been expanded and improved on since, it is still very effective.

Implementing Brown's technique serves as quite a good case study of using R for musical analysis. Now I'll work through an example.

Let's say we have a time series vector, where the discrete observations correspond to the inter-onset intervals of notes being triggered at that particular time frame. If no note is triggered, the value will be zero.

For example, consider the German folk song "Herr Bruder Zur Rechten" (number E1145 in the ESAC database):



The corresponding IOI time series would be:

ioi.seq <- c(1, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 1, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 1, 1, 2, 0, 2, 0, 1, 1, 2, 0, 2, 0, 1, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2)

where the minimum subdivision is quavers.

Next we calculate the full autocorrelation using the acf() function from the stats package.

ioi.acf <- as.numeric(acf(ioi.seq, length(ioi.seq))$acf)

Looking at the ACF plot we notice that peaks in correlation are occurring every 6 quavers, which matches the actual 3/4 time signature.



To automatically find peaks we use Brian Ripley's implementation of the S+ peaks function (I renamed it to s.peaks).

peaks <- s.peaks(ioi.acf, 6)

The second parameter to s.peaks specifies the window length. This has to be inversely proportional to the minimum subdivision. For example, working with crotchets, the window length would be 3, with quavers 6, and with semiquavers 12.

Next, we find the distances between peaks.

peaks.indices <- which(peaks)
distances <- diff(peaks.indices)

Finally, we multiply the mode of peak distances with the subdivision to get the extracted metre.

# there must be a better way of finding mode...
distances.mode <- as.numeric(names(which.max(table(distances))))
metre <- distances.mode * 0.5

As expected:

> metre
[1] 3

Now, this method doesn't always get it spot on. Sometimes it thinks that a melody in 2/4 is in 8/4, or that 3/4 is 1.5/4. As it turns out, finding the exact metre is very difficult. How would you tell a computer that a piece is in 6/8 or 3/4? Or even worse, 2/4 or 4/8? For the purposes of music informatics, it is often sufficient to determine if the time signature is binary (2/4, 4/4, etc.) or ternary (3/4, 6/8, 9/8, etc.). Of course, this means that we will have to discard of all pieces in compound time signatures, such as 5/4 or 7/8. Fortunately for us (but perhaps unfortunately for music in general), western composers traditionally used to stick to simple metres.

In my previous post I talked about the Essen data set. One of the great things about it is that it has metre and key information encoded for all included melodies. Using this, we can test how good our metre detection algorithm actually is.

The results are promising. Out of 5120 melodies, we manage to get the correct result in 4148 of the cases, yielding an accuracy of roughly 81%.

The extract.metre() function resides on Github in the rmidi_tools repo.

Wednesday, 13 April 2011

The Essen folk music dataset as a MySQL database

The ESAC project is amazing. They have an archive of almost 6000, mostly German, folk melodies which they have encoded in a nice, consistent, computer-friendly format.

The melodies can be downloaded in MIDI format here, but I couldn't find the data in SQL format anywhere. So I wrote a little script that kneaded the dataset into a MySQL database.

I put the database dump, as well as the script (so you can see where bugs in the data come from), on Github.

Tuesday, 12 April 2011

Quantising monophonic MIDI melodies

Quantisation is a useful technique in its own right, but I personally
tend to use it more often as a pre-processing tool. By applying quantisation
we can enforce a minimum granularity, which helps keep a large dataset
of melodies (such as the Essen dataset) consistent.

Quantisation should really be simple as pie - just round the note start
times to a minimum subdivision. However, monophony complicates
things slightly. If two notes get rounded to the same start time,
which one do you chose? Where does the other one go? In the
implementation below I keep the note that has an original start time
closest to the quantised start time. Any colliding notes get
discarded.

Another question is whether the notes' durations should be
quantised or if they should keep their original durations. If the
original duration is maintained, one is quite likely to end up with
overlapping notes, but if durations are quantised, one could lose
rests. Since requirements might vary, I leave the choice to the user.

Finally, since many melodies begin with an upbeat, we need a way to
offset quantisation. This is particularly useful when dealing with
quite aggressive quantisation (e.g. quantisation to nearest crotchet).

Enough talk, here's the code:

# notes is the input matrix,
# subdiv the minimum subdivision (in crotchets),
# offset the quantisation offset (also in crotchets).
# preserve.duration determines whether to tie notes or
# preserve original durations.
quantise.notes <- function(notes, subdiv, offset = 0,
preserve.duration = TRUE)
{
if(subdiv == 0)
return(notes)

# first, do "naive" quantising, by just rounding start times
subdiv.ticks <- subdiv * midi.get.ppq()
offset.ticks <- offset * midi.get.ppq()
notes.quantised <- notes
notes.quantised[, "start"] <-
round((notes.quantised[, "start"] - offset.ticks) /
subdiv.ticks) * subdiv.ticks + offset.ticks

# find any collisions
tbl <- sort(table(notes.quantised[, "start"]))
collisions <- as.numeric(names(tbl[tbl > 1]))

# if there are collisions, remove all notes that are not
# the note closest to the quantised value
if(length(collisions) > 0) {
remove.indices <- integer(0)
for(collision in collisions) {
colliding.indices <- which(notes.quantised[, "start"] ==
collision)
distances <- abs(notes[, "start"] - collision)

# if two colliding notes are equally close,
# use the lower one
closest.index <- which(distances == min(distances))[1]

remove.indices <- c(remove.indices,
colliding.indices[colliding.indices !=
closest.index])
}

# remove colliding notes
notes.quantised <- notes.quantised[-remove.indices, ]
}

# if we don't preserve the original durations, we tie the notes
if(!preserve.duration) {

# order notes by start time
notes.quantised <-
notes.quantised[order(notes.quantised[, "start"]), ]

# tie all notes except last note
for(i in 1:(nrow(notes.quantised) - 1)) {
notes.quantised[i, "duration"] <-
notes.quantised[i + 1, "start"] -
notes.quantised[i, "start"]
}

# set the duration of the last note such that the duration of
# the quantised melody is the same as the duration of the
# original melody
notes.quantised[nrow(notes.quantised), "duration"] <-
notes[nrow(notes), "duration"] +
notes[nrow(notes), "start"] -
notes.quantised[nrow(notes.quantised), "start"]
}

return(notes.quantised)
}

The input to this function is a matrix with the column headers
"start", "duration", "pitch" and "velocity". In the file hosted on
Github
I provide functions from converting to this format from RMidi
matrices
, and back again.

Saturday, 9 April 2011

Automatic key detection in R

When you want to analyse a large corpus of melodies it's often vital that the pieces are all in the same key. For example, building a Markov model from melodies where some are in C and some in G# doesn't really make sense. So we need some way to automatically determine the key of a piece in order to transpose it to a pre-defined, "normal" scale.

Usually this is achieved by building a chroma from the melody in question. A chroma in this context refers to frequencies of pitch occurrences, usually in the same octave, i.e. modulo 12. Once the chroma has been determined, you compare that chroma to a major scale, such as C major, and for all 12 transpositions (including the zero transposition) of the melody, you find the one that is closest. That is then defined to be your "delta" by which you need to transpose the melody.

One problem with this approach is that not all melodies have the same number of notes. In fact, a melody in C might just have one or two B notes, and one melody in F might just have one B flat. Whatever distance metric you use, these two melodies will be regarded as very similar. As a remedy to this, we can take the fourth root (which has a similar shape to the log function) of the relative frequencies in the chroma. Doing this renders the frequencies of occurring notes closer to 1 (and further away from non-occurring notes).

This is the R function that extracts the chroma from a melody, where the notes are defined as standard MIDI pitch values:

extract.chroma.melody <- function(melody)
{
pitches <- melody %% 12
chroma <- numeric(12)
for(i in 0:11) {
chroma[i + 1] <- sum(pitches == i)
}
chroma <- (chroma / max(chroma)) ^ (1/4)
return(chroma)
}

This function returns a vector of the 12 semitones, starting from E flat, and their relative frequencies. Now we just need a way to compare that vector to the C major scale vector (again, starting from E flat) [0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1]. Since the vector components are not dependant on each other we use the Manhattan distance function. The following function returns the best transposition from one chroma to another:

find.best.transpose.dist <- function(chroma.fixed, chroma.moving)
{
best.dist <- +Inf
best.transpose <- NA
chroma.fixed <- as.numeric(chroma.fixed)
chroma.moving <- as.numeric(chroma.moving)
for(i in 0:11) {
# optimised manhattan dist()
new.dist <- sum(abs(chroma.fixed - shift(chroma.moving, -i)))
if(new.dist < best.dist) {
best.dist <- new.dist
best.transpose <- i
}
}

return(list(dist = best.dist, transpose = best.transpose))
}

(The shift function simply performs a circular shift of the input vector. That function is included in the Github repo.)

To get the closest major scale, i.e. the key of the melody, we just call the function with the major scale as the first argument and the melody under scrutiny as the second argument:

major <- c(0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1)
key <- find.best.transpose.dist(major, chroma)$transpose

This code is on Github, in the chroma.R file. Grab it and have a play if you want. There are some other functions in there for doing useful things with the extracted Manhattan distances, such as computing a distance matrix that can be used in clustering, for example. A bunch of functions for reading MIDI files into R are included in the RMidi package, also on Github.

Thursday, 7 April 2011

Updates on RMidi

I've been doing a bit of work on RMidi recently. I've added a function for reading in MIDI files into R, midi.read.file(), and the ability to work with MIDI control events, implemented in midi.controller(). As an example use case, I've included a little piece of open source music in the /examples directory.

While having my hands on the project, I decided to move from Google Code to Github. Github has been getting a lot of attention recently so I figured I better follow the other sheep. I really like it though, and Google Code does look a bit pale in comparison.

So, what's the expression, fork me on Github!

Sunday, 13 March 2011

Using the subsumption architecture for musical composition

I recently learned about the subsumption architecture, and I wanted to see if I could use it in music.

(If you want, you can jump to the end of the post for a sample of what the result can sound like, before reading the actual post.)

In AI, the subsumption architecture is a simple, reactive, multi-agent architecture. The idea is that you have a number of basic actions that can trigger, based on a number of rules. The rules are ordered by precedence, so that if multiple actions can be triggered, only one action will be executed. For example, say you're driving a car. One rule could be "If you see an obstacle in front of you, stop!". Another rule could be "If you're at a traffic light and it turns green, drive!". But if you are at the traffic light, it turns green, but an elderly lady has not finished crossing the road, you probably don't want to drive her over. In that case the first rule takes precedence over the second rule, and so on.

The real power of the subsumption architecture comes when you have many agents operating together (i.e. a city with lots of cars). I remember reading somewhere that the system is modelled on ant colonies, where each member is quite stupid on his own, but put all these stupid behaviours together in an orderly fashion, you have something that works quite well.

Now, applying this to music. Let's think of an agent as a part, hammering semiquavers, in a polyphonic piece. Say we want the parts to stay in their own "pitch band", and we want to avoid a higher part dropping below a lower part (although we could allow this from time to time). Also, we don't want any part in our piece to drop below or above given pitch boundaries (max pitch, min pitch). Based on these two goals we can define the following rules:

1a. If agent's pitch is above max pitch, give the agent a bias towards downward motion.
1b. If agent's pitch is below min pitch, give the agent a bias towards upward motion.
2a. If agent's pitch is equal to another agent's pitch (i.e. a collision has occurred), and the other agent is supposed to occupy a higher pitch band, give the agent a bias towards downward motion.
2b. If agent's pitch is equal to another agent's pitch, and the other agent is supposed to occupy a lower pitch band, give the agent a bias towards upward motion.
2c. If agent's pitch is equal to more than one other agent's pitch, give the agent a bias based on a majority vote.
3. Otherwise, move randomly, based on bias.

We can define our order of precedence as 1a < 1b < 2c < 2a < 2b < 3 (where < is read "takes precedence over").

It is easy to see how these rules will meet the goals we set out to achieve. Of course, an agent could jump below a lower agent, but if we make the longest allowed jump fairly short, the agent won't get very far down before it bumps in to another agent.

So what notes should we allow the agent to play? If we allow the full chromatic scale, it will all just be a mess. If we only allow a major pentatonic scale, it will sound pleasant but boring in the long run. Really, we would want to be able to change the scale. Rodney Brooks, the inventor of the subsumption architecture, gave an example of Mars rovers that look for samples of some Mars material, and communicate by dropping radioactive crumbs. As an homage to Brooks, I'm going to drop little gems containing musical scales on my "world". Agents can pick gems up (if close enough), and then hand them over to other agents if they collide. If two agents collide, both carrying gems, the one that got its gem more recently gets to give one away. I've simplified things a bit, so that gems never actually disappear, even if they are "picked up".

Using this technique, a scale can be gradually imposed on the piece over time, creating interesting overlaps of scales. During these overlaps, tension increases while the "orchestra" of agents are bouncing in to each other, a tension which is released when all agents have adopted the same scale.

In the example piece I have created, I drop the following gems (times in crotchets, radius refers to how far away you can be to pick the gem up, the greater the radius, the more likely that the gem is picked up):

1. Scale: C, D, E, G, time = 0, pitch = 70, radius = 4
2. Scale: C, D, E, F, A, time = 8, pitch = 100, radius = 3
3. Scale: D, E, G, B, time = 16, pitch = 60, radius = 2
4. Scale: C, E, G, A, B, time = 32, pitch = 100, radius = 5
5. Scale: D, E, F#, A, B, time = 48, pitch = 60, radius = 3
6. Scale: D, G, B, time = 56, pitch = 80, radius = 4
7. Scale: C, E, G, B, time = 64, pitch = 100, radius = 4
8. Scale: C, G, time = 72, pitch = 80, radius = 6
9. Scale: C, Eb, G, Ab, Bb, time = 88, pitch = 60, radius = 3
10. Scale: C, G, time = 104, pitch = 80, radius = 3
11. Scale: C, D, G, time = 112, pitch = 60, radius = 4

At the 128th crotchet I rewind my counter and drop the first gem again, etc. Below are two graphs showing the movement of four and ten agents, respectively, during the first 128 crotchets. Notice how agents bounce back when they collide.





The next two graphs show how the scales get distributed, using "gems", to the different agents. As you can imagine, it is less likely that all scales will find their way to all agents in the four agent scenario than with ten agents.





This is what these two pieces sound like: 10 agents, 4 agents

And finally, here are a couple of longer generations: 10 agents, 4 agents

If you're interested in the source code (in R), send me a message and I'll tidy it up.