Documenting what drove the decision making around x2 vs x4 in the upscaling process.
First with some bridge views
the x4 starts to break down the buildings in this one. For example the buildings have soft lines preserved in x2 but in the x4 we start to lose semblance of the textures and just get broad thick brush strokes of building. seemingly, the wrong elements get emphasized.
This Times Square one with the billboards is so much fun to watch go through the upscaler, we start to see people start to turn into blobs and text gets garbled (looks to be an easy tell for when upscalers at work in a video you are watching).
These images are input 512×512 from the source image of 4096×4096 per eye. Consider that when you are watching this video, this part of the image is what you are looking at in terms of field of view and 512×512 from 4096 (which is what the canon outputs at its max 8K mode) so the original image we have to work with is this resolution. Will the new black magic camera mean that our original source resolution for will already come out 2x from the camera? What can the upscalers do from there?
The upscaler model takes the 512×512 image and turns it into 1024×1024 for 2x and 4x 2048×2048
For the Apple Vision, 1024x for this slice of the image should satisfy the screens needs and fill the image so the screen’s pixel that its turning on and off is on par with the video pixel data?
Anyways lots of blabber, take a look at these comparisons!
Finally one I have attempted to go back and re-shoot to see if the focus could be better but then we moved on in products and now it’s no longer Vision Pro. Oh well. Working from the original source, a considerable upgrade in the smoothness of the light that goes around the Apple Logo and the Vision Pro text is obfuscated by the parking sign which has text which are just continuous brush strokes.
I think the cleaning up the light really improves this images video quality.
Between the x2 and x4 again … things start to lose their structure all together. What was once a persons head turns into a wall painting? The reflective elevator inside sure confuses the upscaler as well.
At the end, the x2 upscale looks to be the one of choice.
Comfy workflow used for this blog post nov-02-blog-post.json
The next thing I wanted to write down was some brief numbers for execution time. Each frame image input is roughly 5 MB and output is usually 150 MB from the upscale process. Meaning reading from GPU and writing to disk takes a bit of time as well as just doing the upscale through the model.
With that in mind, a 5 MB frame with a A4000 takes about 100 – 120 seconds to process one frame. Scaled up through Vast, I’ve had about 30 cards in parallel ($4.50/hr for all 30) giving me about 30 frames every 100 seconds. 60fps video works out to 200 seconds, takes about 3.3 minutes for 1 second of video. Let’s say, 1 second video = 5 minutes of processing time at $4.50 an hour, so 12 seconds an hour at $4.50. 1 minute video takes 5 hours and costs $22.50.
Just some ballpark figures.
I think the A4000 has given me the best yield. I have tested the A100 and RTX 4090 but find the A4000 yields the best throughput for the cost (4090 is $0.45/hr per card, A100 is $1+/hr per card). And I still think the powerful GPU speeds up the upscaler model, but not necessarily reading back the newly generated bytes from the video card back to writing to disk. At the A100 speed, I spend more time transferring the file back and forth over the network than the upscaler model actually running on the GPU.