It’s a story of refactoring when code that should be refactored isn’t prepared for it a single bit. If I say prepared I guess that it would mean to have test cases, dependency injection code, and so on. However, I have none of the above in the original code, just the code that works.
Let me explain what I have, what it does, and where it should end. The purpose of this here refactoring session isn’t about having better performance and keep the same functionality – it is to have same algorithm, already proven in various tests work on different data.
Before (a.k.a. now):
Component creates PK_HASH from a sound file. By PK_HASH I would mean “code name for our latest tech that can crunch whole audio file to few bytes later to compare that bytes to bytes crunched from the other file and tell you whether it’s the same sound file”. PK stands for PlayKontrol – the brand name.
So, there are few steps to produce the PK_HASH from sound file:
- decode and read the file – input is file on the disk, for example .mp3, .wma, .aac, and the output are PCM samples
- from any kind of PCM samples (stereo, mono, 8-bit, 16-bit) we produce array of shorts that must be rewindable
- hashing the data and producing the PK_HASH files
Decode and read the file
From the file on disk, that can be any file format that is streamable (we produce files with StreamSink – see the archive example here: http://access.streamsink.com/archive/). That is .mp3, .aac, .wma, .ogg, and whatnot.
Currently it’s done by using simple component that uses DirectShow to create a graph for the audio file, renders that graph and attaches SampleGrabber filter to fetch the samples. Component goes from file on the disk to whole PCM sample data into memory. It’s feasible for 5 minute file (5 x 60 x 4 x 44100 = 50MB). It can work even for 1h files. However, you can *feel* that this approach IS wrong, especially when told that for the rest of the algorithm, you don’t have to have access to the whole PCM data at once.
Rewindable sample array
PCM Samples are promoted to 16 bit (if needed), and channels are downmixed to mono. Again, that is done in memory, so for each PCM sample there are 2 bytes of data that are present in memory as a result of the operation.
Hashing and creating the file
Hashing process needs moving and overlapping window over sample array, and since we have everything in the memory, that is a piece of cake now. We take the data, process it, write it into the another byte array. Since it’s extremely dense now, I won’t cry about memory at this point, but yeah, it is written into the memory first, and then saved to file on disk.
So here I tried to explain how it works so far. It goes from the encoded audio file to PCM sample data in memory, downmixes that data in memory to one PCM channel, processes the mono PCM samples to obtain PK_HASH and then write it to file.
So what do we actually need?
If you take a peek at the archive you’ll find that every folder has audio files, and also has .hash file for ever audio file that is present in the directory. Please note that not every directory is processed, only 20 of those, because processing consumes CPU intensely, and I have only few PCs laying around to scrub the data. Will improve in the future. So, for crunching the archive, even POC (proof-of-concept) is OK, as it serve its needs. It will go through the archive and leave processes PK_HASHes.
Process that goes in parallel is waiting for the PK_HASH file to be created, reads it, and does matching against the database. However, next step should be taken, and it is REALTIME processing.
To be able to process in REALTIME, architecture goes somehow like this:
- StreamSink is attached to the network stream of any kind, and provides PCM sample output
- PCM sample output is downsampled and buffered
- hashing process uses buffered mono PCM samples and outputs results into the stream
- PK_HASH stream is again buffered and results processed with MATCHER process
StreamSink PCM decoding
StreamSink is the application that does internet media stream capture. It can, however, thanks to feature request from DigitalSyphon, process every media stream and provide PCM samples for it in real-time, in a form of the Stream derived class. So, what part of the process is covered completely.
Buffering PCM samples
Now, new component should be created – something that can buffer PCM samples from the Stream and provide floating, overlapping window reads for hashing process. With some thinking I combined inner workings of Circular Buffer stereotype with something that can be used almost directly in the hasher process – by replacing class implementation only.
Processing and creating PK_HASHes
Hasher process was reading the buffered MEMORY quasi-stream. However, it used kind-of simple interface to read the data, so my luck was that that interface could be extracted and implementation done with buffered stream data. Also, output of the class should be rewritten, since it now doesn’t have any Interface-able part to replace.
And so on – later should be implemented from scratch, so there is no story about refactoring here.
I can call it pyramid or tower, because after long time of procrastination (subconsciously processing the task at hand) I was able to put my hands on the keyboard and start. My premise was that everything has to be checked from the ground up, because NOW I have the algorithm that produces desired results, and since there are many steps involved, an error in a single step could be untraceable if I don’t check every step along the way.
I am kind of old fashioned, so this paragraph won’t be very long. I use Visual Studio 2008, and for writing test code snippets I use nUnit as a launcher so I won’t have to have some form to run tests or a console app.
For dependency injection I tested nInject, and it is great, but in this case here, it can’t help me, so I’ll do implementation replacement “by hand’.
I’ll finish this post for now, as this is the current state of affair, and will keep you updated as the story develops, with new post and fresh insights…