REAL-TIME VR STYLE TRANSFER
In this project, I look at implementing a fast real-time style transfer network within Unity for VR experiences. The idea originated from seeing work published by Unity that involves applying real-time style transfer for traditional video games that are rendered to a flat 2D output. Unity offers a library called Barracuda for running inference on neural networks allowing this to be possible. As a result, I thought it would be a novel application and effort to borrow these methods for rendering to a VR headset, an effort I did not see published on the internet.
The likely reason why style transfer has not explored VR options is due to a number of reasons. Firstly, VR requires a higher FPS than 2D-rendered games for a comfortable experience and as a result, requires the rendering of two images per eye and hence requires running the network twice per frame, which can be performance heavy. Secondly, visual artifacts are a lot more apparent to a VR user meaning that temporal consistency is important for the style transfer, which is a challenge to this method. Lastly, there needs to be spatial consistency between the style textures applied to the stereoscopic images such that they align to be perceived as one uniform image.
Working solely and within the short span of this project, my goals are to get a better understanding of this niche problem by first implementing a style transfer environment in VR, which will also help me develop an understanding of the aforementioned problems. Then, I will explore ways to improve the quality of style transfer output by looking at methods of training the style transfer network.
This project also hopes to have fun with the idea, exploring what this could mean for creating novel immersive experiences and interactions in the future.
For implementing style transfer in Unity, Unity itself posted a blog post promoting its inference engine's ability to run style transfer in-engine. Another developer by the name of Christian Mills also posts on his blog and open-sources his projects regarding Barracuda and Unity. Mills' implementation of style transfer is what I will build on for my VR implementation.
To train the fast style transfer networks, I am using an implementation based on the original paper Perceptual Losses for Real-Time Style Transfer and Super-Resolution. An implementation of this paper is found on the official PyTorch example repository.
Example of Fast Style Transfer in Unity from Christian Mills
To speculate further about applying style transfer to a VR experience, multiple style transfer networks are trained such that the user can cycle through different style images to augment their experience. On the right-hand controller, the trigger enables the cycling of styles, and the inner grip allows for turning style transfer on and off. This also helps when comparing different models and styles against each other. There’s also a UI feature where the image of the style image used to train that model is attached to the palm of the VR user's right hand so that the user knows what the current style is or can affect their experience.
For the loss model, I use the VGG-19 model instead of the VGG-16 from the original paper. This pre-trained network will be useful for extracting the features when evaluating the content and style of the image.
The size of the layers in the model is parameterized by a filter tuple, which by default is (32, 64, 128). The issue with a network this size is that running inference in real-time for VR is not performant. A smaller network size of (8, 16, 32) seemingly works well enough for Unity inference on my current setup.
For training, I am using Adam optimization with a learning rate of 0.001. The loss for both style and content features is evaluated with mean squared error (MSE). For the inputs for training the network, I am using the COCO 2014 dataset, a collection of 82,783 images.
For weighting the Style and Content Loss the default weights used are:
Content Weight = 1e5
Style Weight = 1e10
Architecture from the original Fast Neural Style Transfer paper
When using the base style transfer network on stereoscopic pairs, especially noticeable in styles with more textural output, the two eyes in VR observe images whose styles do not spatially align. In the case of highly textural styles, you notice your eyes being tired within seconds of viewing. Less textural images and the experience can actually not be drastically noticeable, but still, tiring of the eyes occurs just over a longer span of a few minutes.
In order to account for this, I propose my own adjustment to calculating the losses for updating the image transform network that is loosely based on the ideas of the Stereoscopic Neural Style Transfer paper. In short, their idea included a Disparity Loss and another network that predicts the disparity of two stereo images. What this loss accounts for is the pixel-wise difference between the stylized stereo images in overlapping regions determined by the predicted disparity of the images.
In addition to the Content and Style Loss, I will call this addition the Pseudo-Stereo Loss, which is calculated with the original two style transfer losses with the VGG19 network. It is 'pseudo' because it is only there to act as an overly-simplistic warping of a single image from the COCO dataset, whereas in contrast to the Disparity Loss, is trained with synthetic stereoscopic pairs from the FlyingThings3D dataset. This is implemented using the Perspective transform from the
Torchvision library, stretching the left-most edge to create the warped left image and the right-most edge for the warped right image. This is parameterized as a 10% increase to the given height of the input image.
The Pseudo-Stereo Loss is quite simple, it takes the warped pair of images and stylizes them separately as input into the transform network. Their features are extracted with the VGG19 network and their MSE is calculated between their features, very similar to calculating the Content Loss. This loss is weighted and added as a tertiary sum to the total loss used to update the network. I try different weights to explore tuning the Pseudo-Stereo Loss, basing the weight on that of the Content Loss.
To test the Pseudo-Stereo loss, I take a look at the visual comparison between that of a number of different trained models on this particular style because, from this network, it had a particularly strong amount of visual artifacts:
No Pseudo-Stereo Loss
Pseudo-Stereo Loss at 1e4
Pseudo-Stereo Loss at 1e5
Pseudo-Stereo Loss at 4e5
Image-to-image, qualitatively it is noticeable the consistency of style transfer being done with the Pseudo-Stereo Loss compared to the model without when looking at the loss weighted 1e4 and 1e5. That being said, anything greater than 1e5, which I tested at varying degrees, makes the style transfer more prone to flickering and unstable. This is either due to the weighting of this loss being too strong and/or being greater than the Content Loss. Either way, having to tune three loss weights does make parameterizing these networks more difficult.
When it does make the experience better, it is because, between frames, it is more consistently stylized by the network, making it less of an eye strain. That being said, it is far from perfect in the details though perhaps makes the network ever so slightly more temporally consistent as well.
Pseudo-Stereo Loss at 8e5
Pseudo-Stereo Loss at 1e6
Base network without the Pseudo-Stereo Loss
Network with Pseudo-Stereo Loss at 1e4
We see emerging characteristics that distinguish the networks based on the different weighting of the Pseudo-Stereo Loss for this particular style image and parameters. In the base network, the flickering textures often do not align causing tiredness to the eyes. In the 1e4 weighted network, you do get lots of flickering still, but they are better aligned spatially between eyes causing a noticeable difference in the VR headset. The 1e5 network eliminates the flickering largely altogether, meaning that the stereo output is actually experienced a lot more similarly for the most part. That being, said, oddly there are still a number of notable patterned textures that do not mostly align, similar to the base model.
One interesting takeaway is that despite the 1e5 network being imperfect, toning down the textural effects of the style image helps significantly for VR. Highly texture-style images also are notably difficult when it comes to temporal consistency in video-style transfer. It seems that the condition of the Pseudo-Stereo Loss can help tone down the style's textural output without decreasing the weight of the Style Loss either.
Network with Pseudo-Stereo Loss at 1e5
In conclusion, I hope my project can help us speculate the potential for style transfer to feature in video games and VR. We saw that introducing a new Pseudo-Stereo Loss value did have an effect on the network's outcome, which at times might be preferable, but it did lead to temporal instability and other types of artifacts. Though there are technical limitations, if this area of research was given more attention, I believe that more sophisticated architectures can make this effect very promising given that proper computing resources are used. Personally, I felt that the ‘wow’ effect of style transfer was felt more strongly in VR than in 2D, simply due to the immersive and surreal nature of it - it truly is incredibly trippy and in a unique way that I would recommend someone to experience.
Something I had tried to implement and was not able to get working was targetted style transfer, where style transfer is only rendered to select objects in the scene via a masking layer. I had this working in 2D, but unfortunately not in 3D having trouble translating the 2D mask to the VR stereo views. I think for the sake of practical application, style transfer on an entire scene, especially in VR can be a lot. Having select objects allows for a more intentional application of style textures, which could be a really fun effect to play with. Imagine in VR, style transfer only activates once you touch an object - now wouldn't that be cool!