# rootnote: Music mining with SoX

SoX is a very powerful command line audio processing tool for Linux. In its wealth of capabilities it's a bit like ImageMagick for sound. Apart from generating and altering audio, it's also quite good at telling you what is inside a sound file.

To print metadata about an MP3 file you can do

	$ soxi "Herbie Hancock/Thrust/01 Palm Grease.mp3"

	Input File : 'Herbie Hancock/Thrust/01 Palm Grease.mp3'
	Channels : 2
	Sample Rate : 44100
	Precision : 16-bit
	Duration : 00:10:37.34 = 28106474 samples = 47800.1 CDDA sectors
	File Size : 15.5M
	Bit Rate : 195k
	Sample Encoding: MPEG audio (layer I, II or III)
	Comments :
	Title=Palm Grease
	Artist=Herbie Hancock
	Album=Thrust
	Tracknumber=01/04
	Year=1974
	Genre=Jazz

view raw gistfile1.txt hosted with ❤ by GitHub

SoX is also able to extract statistics from the actual audio information using the stats command:

	$ sox "Herbie Hancock/Thrust/01 Palm Grease.mp3" -n stats
	Overall Left Right
	DC offset -0.000033 -0.000033 0.000002
	Min level -1.000000 -1.000000 -1.000000
	Max level 1.000000 1.000000 1.000000
	Pk lev dB 0.00 0.00 0.00
	RMS lev dB -16.50 -16.64 -16.37
	RMS Pk dB -10.17 -10.40 -10.17
	RMS Tr dB -inf -inf -inf
	Crest factor - 6.79 6.59
	Flat factor 3.23 2.63 3.65
	Pk count 20 17 23
	Bit-depth 29/29 29/29 29/29
	Num samples 28.1M
	Length s 637.335
	Scale max 1.000000
	Window s 0.050

view raw gistfile1.txt hosted with ❤ by GitHub

The different parameters are thoroughly explained in "man sox". (The SoX manpage is actually a great little read, it covers lots of digital audio concepts.)

One annoying thing about SoX is its output format. It looks good to a human being, but to get it into a computer readable format you have to massage it quite well. Look at the output table for the stats command, for example. See how the second "r" in Crest factor sits in the same column as the "-" in the DC offset and Min level. Also notice that Num samples is 28.1M, not 28100000. To any statistics software that would make it a categorical attribute, not a numeric one.

In order to overcome some of these limitations I wrote a small shell script that takes a list of filenames from stdin and outputs a text file to stdout, where each line corresponds to one individual audio file, and the attributes that SoX extracts are separated by the pipe character.

If you use this script you'll still need to do a bit of manual processing afterwards, getting rid of lines that don't have enough values, etc.

To try it out I processed around 2,000 MP3 files I had on my hard drive. The SoX processing took around 2 hours on an old dual core Macbook. Then I got rid of lines with missing values, brought it into R, and used the following function to transform attributes in scientific notation to actual numbers:

	function(x) {
	k.indices <- grep('k', x)
	M.indices <- grep('M', x)
	G.indices <- grep('G', x)
	nums <- as.numeric(gsub("[kMG]", "", x))
	nums[k.indices] <- nums[k.indices] * 1000
	nums[M.indices] <- nums[M.indices] * 1000000
	nums[G.indices] <- nums[G.indices] * 1000000000
	nums
	}

view raw gistfile1.r hosted with ❤ by GitHub

After that I pulled the Artist ID3 tag out of the comments attribute and used J48 (C4.5 implementation) from the RWeka package to generate a decision tree model based on artist. Without tweaking the parameters for J48 very much, I got around 23% accuracy. It's obviously not a great number, but comparing it to the accuracy of a completely random prediction model, which resulted in an accuracy of 0.7%, it's not all that bad.

The audio features that SoX extracts may not be the most useful, but considering how fast and easy to use it is, I think it's definitely worth a go.

# rootnote

Sunday, 14 August 2011

Music mining with SoX

No comments:

Post a Comment