This is a speculative project exploring the use of Pix2Pix in a render pipeline, using it as a tool to explore thinking about the render process in a conceptual framework. Pix2Pix is an image translation ML algorithm that was groundbreaking when it was released in 2016. Created by Phillip Isola, Jun-Yan Zhu, Tinghui Zhou and Alexei A. Efros; it amazed the creative community with its simplicity of use and incredible learning ability. There have been many examples produced by artists since its release, translating black and white photos into color or the well known edges to cats. I wanted to explore using Pix2Pix as a part of a renderer solution, trying to teach it the translation from a mesh to an image of it’s potential objective partner.
Exploring renderers we see a mirage of techniques for realising virtual objects into “real-ish” ones. Many forms of renderers cut corners, especially real time, focusing on basic rasterization, a relationship between vector graphics (shape) and the pixels they cover on the image, using layers of rendering and post-processing to add more depth and realism. For me this sees a world formed from separate but converging layers, intertwined plateaus of how we as humans understand and comprehend visual phenomenon, dividing it into fast processes and concepts.
Much of the cutting edge rendering uses ray-tracing, a method that imitates our understanding of the physics of light. The process in some ways inverts the way a camera or the eye catches light, instead of light from external sources coming in, they fire out, each pixel ejects beams out reflecting and refracting through objects and lights, compositing the pixels color as they travel. This reversing of the apparatus is done to reduce the amounts of computation, taking the unimaginable amount of to just a fraction of the rays that would have actually hit the lense.
Diagram of Ray Tracing
Again an apparatus that imposes a skewed construction of the dynamics placing the viewer as the illuminator of a space instead of the opposing natural order, participants acting in a much larger scene.
Diagram from pg1, ‘Real-Time Rendering, Fourth Edition’, an Online chapter: Real-Time Ray Tracing, version 1.4
With both techniques we see textures for them being imposed on bodys, either mathematically conjured from various sets of noise or kept as a static image texture, both confining an object to an imposed construct. Many of these processes of texturing and rendering are fiddly complex illusions, created through blind calculations that focus on fractions, single pixels of a bigger image.
The Pix2Pix framework is a very different construct, it has two major components, the generator a U-Net, and the discriminator a Patch Gan, feeding into one another. The U-Net Generator does the actual image translation, but its ease of training and quality of outcome is through it’s symbiotic relationship with the Patch Gan, each testing and training the other. This Framework is an amazing organism in itself, but in this project I am more interested in the way it sees and translates the image.
The U-Net Generator takes the image through convolutions down to the point of almost nothing and back up again, with a contracting path and an expansive path. The contracting path breaks the image down into overlapping segmentations whilst each layer passes data through to the expansive path, which deconvolves the image back up to its original quality. In a way weaving the different qualities and attributes of the image back into itself, abstractly speaking what it looks as a silhouette, a segment, a texture. The discriminator similarly takes the image through convolutions down to a point of a singular outcome or a square of predictions dictating the verity of the generated and real image.
The main difference I want to extrapolate between this and rendering/texturing techniques is that the generation of the image is not through blind calculations but through a web of causality. The texture, form and overall outcome is because of many movements within the input field; speculatively speaking this translation is similar to the way we conceive the objective world, understanding an object through a network of signifiers, perceptions and contexts which interweave to form your virtual representation of it.
I was originally thinking about how we could envisage a generator model as a pure renderer, taking raw mesh data (pre-rendered verts, faces and vertex colors)) and using the model to imagine what it could look like, generating as a renderer does, from abstract arrays to comprehendable images. This was my ultimate goal but unfortunately the complexities of that task are beyond my current python and tensorflow skills and understanding, so I have been focusing on creating a tool for the rendering pipeline.
Moving on from this realisation I started to think about the possibilities of the project, how could I integrate this way of imagining objects and textures into a rendering pipeline. Looking at the work of Scott Eaton for inspiration we can see the power that Pix2Pix has, generating realistic and complex fleshy forms and clay compositions. He uses his trained models to imagine movements that would be near impossible to generate otherwise, using this U-Nets motion of unweaving and reweaving to produce photo-realistic imaginations.
Thinking about Eatons work and the strengths of the pix2pix model I decided to move into using it for texture generation, taking renders of simple meshes and re-rendering them with the complexities of its imagined real counterpart. Looking at successful Pix2Pix projects I would need to focus the aesthetic of the translation to something quite specific but complex, something that was easily recognisable but had space to be abstracted, for this I chose crystals, minerals and rocks. For me the visualisation of minerals also had a conceptual standpoint, as a constructor of the objective world, a natural building block. Using ML as a new way to re-render our image of the natural world, re-weaving our depiction and constructs of it.
At first I primarily scraped online to build a dataset of crystals and minerals, building one up to roughly 8k, using resources like the natural history museum database. I created a basic preparation pipeline for processing them with an Openframeworks application I built. The application used ofxdelauney to creating a faux triangulated mesh built off of the points of interest predicted by opencv. I used the basic Tensorflow Pix2Pix core example as a model.
Improving the Dataset
In a talk by Scott Eaton, he discusses how he has refined his image set to produce such realistic outcomes. He dedicates a lot of time to creating these immaculate training sets, standardizing the lighting to a regular 3 point set-up as well as restricting the clothing his models wear. This trains his Pix2Pix models to produce regular photorealistic predictions with really complex behaviours such as casting shadows on itself and maintaining a consistent light source within the scene. All of these elements would be very necessary to create a high quality rendering tool.
Thinking about Eatons process I started searching for a more regular source of images for my dataset. Luckily I found a perfect video of minerals rotating 360, all clearly presented on a white turntable with a white background. This provided me with 240 specimens and an almost unlimited amount of stills to use for my dataset, I had stuck gold, literally.
Returning to the preparation pipeline I realised that it was not creating a language that was translatable to mesh, the faux mesh was too faux. So how could I create a language for the model, one that is realistic to the minerals forms and the mesh input’s, whilst still providing space for imagination. I realised an amazing attribute of the dataset I had compounded was that it was rotating 360 videos of each of the mineral specimens, meaning I could use this data through a photogrammetry pipeline and recreate the minerals specimens as a virtual 3D mesh. I put together a pipeline for batch processing the images into meshes, all coded in python, it follows these steps:
- Firstly it extracts a sparse and dense point cloud from the images of each specimen with Colmap, recreating the form and where the cameras were placed.
- Then creates a Delauney mesh reconstruction of the point cloud with Colmap and then port that into Houdini, a procedural 3D software. I created this node tree to refine the meshes, procedurally removing traces of the base plate, reducing the poly count and transferring the color data from the point cloud to the blank mesh.
- Finally the entire colmap scene is imported into blender with the refined mesh and each camera rendered out and paired with the original photo.
From here I started to test out different variations of shaders and mesh complexity (polycount) to see what worked best to represent form to a Pix2Pix model, what would create a good language for the generator to tune into and translate. See the results on the next page.
Above is a very simplified mesh with basic vertex colouring
Bellow is a more detailed mesh with a wireframe overlay
Running the newly prepared image set through my updated 512×512 Pix2Pix model I started to get some really amazing results. I found that with the simplified mesh with basic vert colours gave great results but lots of the time it did not seem to understand depth in the image, so was not able to segment the different faces/layers of the mesh body, sometimes blurring textures between them or reconstructing the form in unwanted but still interesting ways.
To help the model understand the form more clearly I added a wireframe skeleton which faded from white to black with depth from the camera, this definitely helped it start to understand the structure of the forms but it still had some flaws. When it came to using it for the proposed purpose of animation, it had a lot of slippage, where the texture would flow all over the body and not hold its place.
Rendering with Pix2Pix
I implemented this pipeline in blender as you can interface fully in IPython, easily loading a model saved in a .h5 format, using just the U-Net generator as it is all we needed. The pipeline has these major elements:
- Renders out the scene as a basic mesh with vertex color in Eevee, so it is really fast.
- Processes the images through the trained U-Net to generate the texture.
- Project the generated image back onto the form as a texture.
- Manually set the shader preferences for optimal outcome, I used the principal shader.
- Render out in cycles (raytracing) with a script that animates the generated texture for each frame.
The outcomes from this looked really interesting, creating flowing and merging patterns that have the very distinct textures of rocks and minerals, making them feel as if the forms are passing through time, forming and shedding layers of sediment. If streamlined and refined this could be a really powerful tool, but at the moment it relies on a slightly clunky IPython interface, which is easy enough to use but isn’t as accessible as an in gui or addon.
The current 3 channel Pix2Pix model has problems distinguishing between layers within the mesh, what is on top of what, so projecting the generated textures sometimes flattened the models, blending limbs and folds into one another. To pronounce the forms more I added fresnel and displacement to the shader, this pronounced the movements more clearly. This full combination amplifies the generated textures, creating an outcome that would be hard to produce through other means, allowing the user to simply paint the vertex colours of a mesh in blender and generate it as a fully textured rock/mineral.
At the start of this project I set out to form a new way rendering mesh objects, one that was grounded in an interwoven understanding of materiality and movements of matter as we see it. Aiming to inverse the apparatus of ray tracing and other rendering techniques to realise a new way of imagining the objective world virtually; taking it from the point of the spectator and object to the point of the interwoven web of dependency, where the spectator, perspective and object are all interrelated. Have I fully dismounted this structure and bore a new way of imaging our objective world free of are human constructs? Not really, but pushing to see and imagine through tools like this, that can understand relations that we could not start to comprehend is definitely a start. The outcomes show this through their realisation of minerals in a way that would be unimaginable otherwise, transforming the possibilities of a raw mesh form into an actuality, and a complex and rich one at that.
This project has been really rewarding, I have been learning new languages, processes and come out with a powerful tool that has produced beautiful results. I am a little disappointed that I couldn’t get the 8 channel .EXR model working but i think that was quite adventurous. I am excited to push this workflow further and spend time exploring its possibilities, developing its interaction and presentation, building off of the current concept and framework. Moving on I had some ideas that I didn’t have time to test out, including reversing the colmap translation to place these rocks back into the real world, but these will all be unraveled in time.