Although I usually don’t specifically outline ongoing projects, I’ve decided to write up a non-technical summary of my current project so that future related technical posts will have some context.

Parameters of the Project

I’m working on a project whose goal is to transform any image into an original piece of music. I’ll list out some of the self-imposed parameters of the project.

  • Any image that can be represented as raw pixel data can be used as input.
  • Each image must generate the exact same piece of music each time it is provided to the transformation plugin. In other words, the transformation plugin is a pure function, with only the image data as its input.
  • The output of a transformation function will be both a standalone MIDI file, and a WAV file generated by a soundfont specified by the transformation plugin.
  • Using random elements, even if they are seeded with a number generated from the image, is discouraged. (This one I’m not as confident about yet; take it as a loose guideline).

Beyond those rules, the field is pretty wide open. This project is basically about algorithmic composition, of which there is a fair amount of prior research. The project has already evolved quite a bit since I’ve started and I’m anticipating that it will continue to do so as I learn more.

At this point, I’m focusing on an iOS app as the carrier for this technology due to the easy access of the camera roll as a data source for photos.


There are four primary modules I’ve planned for.

  1. Deconstructing the image data into unique and useful representations.
  2. Transforming the image data into an intermediate musical representation.
  3. Synthesizing the musical representation into a playable MIDI and/or audio form.
  4. UI, because I eventually want this system to be used by the masses.


An image can be deconstructed into many forms. Its raw pixel data can be interpreted as grayscale, RGB, HSV, and other color spaces. These numbers can be normalized to a 0 to 1 floating point scale and used to make various micro-level decisions during the composition process. For example:

If the first pixel’s red value is greater than half its maximum value, add a kick drum to the second beat of the first measure. Otherwise, add it to the third beat of the first measure.

This raw color data can be further manipulated. We can average all pixels. We can average all rows of data or columns of data. We can use the absolute value difference between nearby pixels.

Higher-level image analysis can also be done. The number of faces in the image can be counted and used by the algorithm. The percentage of image that is covered by faces could also be used.

There are dozens, maybe hundreds of useful transformations that can be done. My goal thus far has been to develop a base of deconstructors which can be expanded indefinitely later.


The most creative part of the project is using the data we’ve deconstructed from the image to algorithmically generate unique compositions that hopefully sound pleasing to the ear.

The eventual goal behind the transformation step is that anyone (even non-programmers) will be able to write their own transformation plugin for use in the app. Someone with a hip-hop production background can write a hip-hop transformation plugin. Someone with a piano background could write a plugin that strictly generates piano compositions. Even those in the same genre will have different ideas of how they can use raw data to drive a decision engine, or make their own set of musical grammars. Users can choose between plugins like they do Instagram filters.

The only requirement of a transformer is that it generates a MIDI-like representation with a few features removed a few parameters added. Of course, MIDI itself is too low level to compose the sort of structured music we’re used to hearing. Thus, I’ve spent time in tandem with writing the transformer also writing a simple DSL for composing. It’s still very much a work in progress and it may only be useful for certain kinds of music. Keeping the required output format as generic as possible will allow other DSLs to be used.

I won’t dig too much into the creative process itself in this post.


MIDI on iOS and macOS is still a bit overwhelming. There are at least a few overlapping frameworks of various age, focus, completeness of documentation, and complexity, with some still being under semi-active development. Not only are there Apple frameworks, but also many popular third-party frameworks that supplement them.

My focus is non-realtime processing, which for tooling often takes a back seat to realtime MIDI, e.g. MIDI generated from keyboards.

There’s a few tasks that our synthesizing module is responsible for:

  • Converting the intermediate representation from the transform plugin into Apple’s MIDI format.
  • Generating a standard MIDI file, playable by other music applications.
  • Playing the MIDI file through the speakers using a soundfont.
  • Generating a WAV or mp3 file using the MIDI file and soundfont.
  • Generating a movie file with the original image and the generated mp3 file for sharing purposes.

Each of these steps uses a different set of technologies and frameworks.


At the time of this writing, I haven’t tackled any of the UI yet. The most I’ve done is pop up a UIImagePickerController to facilitate my own transform plugin development.

My goal for a shippable UI is pretty simple. An image picker that allows easy sampling of different images from the camera roll. Once the user selects the image, a video is produced for sharing that contains the song playing over the image. Alternately, since they’ll be available anyway, the user can choose to export the mp3 or MIDI file.

Eventually, once I’ve made more plugins or commissioned them from others, I’d like to have an interface where you can preview your photos and the available filters easily.

This app is probably the most iceberg-y one that I’ve worked on; one where the bulk of the complexity is behind the scenes and the UI is shallow.

Wrap Up

Those are the basics of the four primary modules of this project: Deconstructing, Transforming, Synthesizing, and UI.

In future posts I’d like to talk about some of the problems I’ve encountered from each module.