Sunday, 14 August 2011

Music mining with SoX

SoX is a very powerful command line audio processing tool for Linux. In its wealth of capabilities it's a bit like ImageMagick for sound. Apart from generating and altering audio, it's also quite good at telling you what is inside a sound file.

To print metadata about an MP3 file you can do

$ soxi "Herbie Hancock/Thrust/01 Palm Grease.mp3"
Input File : 'Herbie Hancock/Thrust/01 Palm Grease.mp3'
Channels : 2
Sample Rate : 44100
Precision : 16-bit
Duration : 00:10:37.34 = 28106474 samples = 47800.1 CDDA sectors
File Size : 15.5M
Bit Rate : 195k
Sample Encoding: MPEG audio (layer I, II or III)
Comments :
Title=Palm Grease
Artist=Herbie Hancock
Album=Thrust
Tracknumber=01/04
Year=1974
Genre=Jazz
view raw gistfile1.txt hosted with ❤ by GitHub


SoX is also able to extract statistics from the actual audio information using the stats command:

$ sox "Herbie Hancock/Thrust/01 Palm Grease.mp3" -n stats
Overall Left Right
DC offset -0.000033 -0.000033 0.000002
Min level -1.000000 -1.000000 -1.000000
Max level 1.000000 1.000000 1.000000
Pk lev dB 0.00 0.00 0.00
RMS lev dB -16.50 -16.64 -16.37
RMS Pk dB -10.17 -10.40 -10.17
RMS Tr dB -inf -inf -inf
Crest factor - 6.79 6.59
Flat factor 3.23 2.63 3.65
Pk count 20 17 23
Bit-depth 29/29 29/29 29/29
Num samples 28.1M
Length s 637.335
Scale max 1.000000
Window s 0.050
view raw gistfile1.txt hosted with ❤ by GitHub


The different parameters are thoroughly explained in "man sox". (The SoX manpage is actually a great little read, it covers lots of digital audio concepts.)

One annoying thing about SoX is its output format. It looks good to a human being, but to get it into a computer readable format you have to massage it quite well. Look at the output table for the stats command, for example. See how the second "r" in Crest factor sits in the same column as the "-" in the DC offset and Min level. Also notice that Num samples is 28.1M, not 28100000. To any statistics software that would make it a categorical attribute, not a numeric one.

In order to overcome some of these limitations I wrote a small shell script that takes a list of filenames from stdin and outputs a text file to stdout, where each line corresponds to one individual audio file, and the attributes that SoX extracts are separated by the pipe character.

If you use this script you'll still need to do a bit of manual processing afterwards, getting rid of lines that don't have enough values, etc.

To try it out I processed around 2,000 MP3 files I had on my hard drive. The SoX processing took around 2 hours on an old dual core Macbook. Then I got rid of lines with missing values, brought it into R, and used the following function to transform attributes in scientific notation to actual numbers:

function(x) {
k.indices <- grep('k', x)
M.indices <- grep('M', x)
G.indices <- grep('G', x)
nums <- as.numeric(gsub("[kMG]", "", x))
nums[k.indices] <- nums[k.indices] * 1000
nums[M.indices] <- nums[M.indices] * 1000000
nums[G.indices] <- nums[G.indices] * 1000000000
nums
}
view raw gistfile1.r hosted with ❤ by GitHub


After that I pulled the Artist ID3 tag out of the comments attribute and used J48 (C4.5 implementation) from the RWeka package to generate a decision tree model based on artist. Without tweaking the parameters for J48 very much, I got around 23% accuracy. It's obviously not a great number, but comparing it to the accuracy of a completely random prediction model, which resulted in an accuracy of 0.7%, it's not all that bad.

The audio features that SoX extracts may not be the most useful, but considering how fast and easy to use it is, I think it's definitely worth a go.

No comments:

Post a Comment