This is an edited transcript taken from a series of lightning talks focused on modern content creation techniques during SIGGRAPH 2020. The full-length post originally appeared on our Medium page.
ALEX: Our goal is to share insider information around volumetric video and photogrammetry and how to get the best capture, whichever type of capture method you choose. There are some intrinsic things that we’ll mention that are specific to the MOD Tech Labs’ processing solution, but overall, our goal is to create opportunities across the spectrum. MOD’s intake is completely universal — we can process any volumetric data, photogrammetry data, and scan data. Our output is completely universal as well — .obj and .fbx , etc.
ALEX: I’m Alex Porter, CEO, and Co-Founder of MOD Tech Labs. This is our second startup in the tech space — we both come from XR, and this tool was created when Tim and I were running Underminer Studios. Ultimately, what we wanted to do was create the opportunity for massively scalable content creation. And what we’ve come to after three and a half years of working on this tool is a highly scalable cloud SaaS solution. I’m not going to go too deep into that, but that’s the frame of reference of where we’re coming from.
My background is in interior design and construction technology, and across the last few years, we’ve worked in entertainment, media, medical tools, and more. We’ve always done sort of back-end tools creation. Whether building an AR scanning tool or VR wheelchair driving experience, there’s an opportunity for folks to build up and scale their own interactions and content, immersively in VFX, geospatial, medical, or elsewhere.
We are a venture-backed startup based in Austin, Texas, and over the last few years, we have been awarded Top Innovator awards by Intel for 3 years running and the City of Austin Innovation Award in 2019. We are also part of the NVIDIA Inception program.
TIM: And I’m Tim Porter, Co-Founder, and CTO at MOD Tech Labs. I’ve spent 20 years in the video game, movie, and immersive media industries. In games, I was a technical artist and a pipeline technical director. I am the Chair of the Consumer Technology Association’s XR working group and serves as their Vice-chair of XR standards.
My perspective comes from the maker side — how can I create tools that can reach everyone? I am very fortunate to pick up new technologies and then use them very quickly. Previously, I made automated tools and toys for artists and device-specific optimization, and that leads to where MOD is today. Taking technology that is really difficult to either build, automate, or is very time-consuming, distilling that down into something easy to use, quick, and doesn’t require infrastructure — something that small and mid-tier studios have a massive issue with.
ALEX: We’re gonna break down our suggestions into photogrammetry, scanning, and volumetric video for best capture practices.
ALEX: For your photogrammetry rig setup, there are a few essential things. Camera placement and camera focus are critical. Depending upon your floorspace and what you’re trying to capture, the still object is often placed in the center.
If you have multiple cameras or using a single point-and-shoot, you’re definitely going to want to use something like a tripod to have a professional quality with even photographs from every sort of angle and part that you’re trying to capture. That will help you get even more detailed. For extremely detailed objects, you need even more photos. You always want to use — where possible — identical cameras and lenses. Some solutions can use multiple cameras, but ultimately, it’s a lot easier to solve if you have a singular type. The general way we like to say it is a 15-degree section between each of the cameras, which helps create those overlapping data points that you’ll want to make it really high-end.
TIM: Exactly. This is just a light concept when you’re thinking about any level of scanning. I’ve seen people do a lot more. I’ve seen things get away with less. But a general rule of thumb — especially if you’re building a static rig — is 15 degrees in each direction. 14.5 degrees, if you’re really being super precise and want to do something high-end, especially capturing faces. But really, once you get below that number, you start running into data that’s not really needed — a lot of newer systems will actually throw away that data.
When you’re talking about having individual assets, you can go a little bit more on the topology and the flow of the asset. If it’s something that you have a handheld camera for, you know, follow the edges and go, “Okay, well, this is an overlapping area. So, I need to get a couple extra in this area.” But with static rigs, you’re looking for good coverage for each one of these different things. When we’re talking about identical lenses, the reason is it’s easier for the machine learning algorithms to solve. They do an amount of understanding as to the camera‘s intrinsic and extrinsic — basically, where it is in three-dimensional space, and the camera’s field of view, and a whole bunch of other weighty data points, and it does that over an aggregation of the images.
So, the more images that you have — especially if you have all of them the same, the calculation speed goes up (the amount of time that you spend calculating, that information goes down), and then as the quality goes higher because you have that same amount of technical capture information that goes through all of the different images as they go along. It ends up creating a higher quality result.
Yes, I’ve seen tons of different ones. Many professional rigs use multiple, but you’re talking about rigs with typically 100-200 cameras, and there really is no replacement for that. If you instead did something closer to single lenses and cameras, you can get away with a little bit less, for sure.
And of course, you always want to make sure that you prefocus your cameras, that the subject is in-frame as much as possible, and that you have overlap with the frame. The general rule of thumb is three images per point — that’s pretty good.
ALEX: Some of the typical configurations are dome coverage, with the subject in the middle. If you’re doing full-body, you will see cylindrical styles of rig setup. It has a lot more to do with what you’re capturing and what your physical footprint is. There are some ways to use both of these to your own benefit in different situations.
That “three-shots-per-point” is really important. We recently had some folks submit photogrammetry, with massive whitespace — they were too far back from the object they were capturing. And when you have that, you’re going to miss a lot of the detail, and you’re going to miss a lot of those really fine points that need to overlap. So, getting that object in-frame as much as possible with the least amount of extraneous stuff in the background or outside of the object is really important.
Then, for each scene, you want to overlap by 40%. Again, a lot of that has to do with mapping those points of interest across all of the data.
TIM: And of course, these numbers are going to continue to update themselves. At one point in time, it was 60%, and you wanted six images apiece. Now, it’s getting down to three, and with things like view synthesis, those numbers are going down regularly. I’ve seen view synthesis shots that can do a full capture of an asset in under 30 images — all the way around — and it gets absolutely everything, and the quality is just phenomenal; crisp edges, shine and sheen, and things like that.
But once we fall back and talk about today’s technology — what everyone’s using, what goes on queues and people processed with — it’s still three images right now. With 40% overlap, you can get away with a little bit less if you have either really high-interest points without much surface divisions — basically a massive silhouette change. If you have something like an earring, you’re going to need more information in there — especially if it’s an intricate earring versus a stud and different things like that.
So, the hard part is the balancing act between them. If you have a static object with something that isn’t, like a very detailed shirt, you might end up needing more photos to make the images — information based on either lighting changes, shadow pole, or something like that — just depending on what’s going in there. Things like stark assets are challenging because it’s looking for interest points to map between them.
Shooting on a white background is difficult because you get either reflection or refraction bounce from the ground. The same if you’re wearing just an all-black shirt. You can get good shots on black shirts and things like that, but it’s just more difficult to solve all-in-all, and you’ll get much cleaner results if you provide something like a plaid. But on the other end, with a material like that, you end up running into different issues like making sure that the line stays straight and things like that. Anything that is camera-safe but still has some form to it is a good answer there.
ALEX: Scanners are more of a continuous roll rather than individual images. The goal is to maintain the integrity and level all-around. You want to stay at the same angle and travel across the object evenly. Again, a tripod is required to have that stability and that professional quality.
There are lots of ways that scanners are used: LIDAR, drones, etc. Tim’s going to talk more about RGBD. There’s no focus required for a scanner; typically, they have all those things intrinsically set up. You’ll want to fill the frame as much as possible with the subject. One common thing that we are seeing is the combination of scanner data with photogrammetry, so that’s a really great way to supercharge your data sets.
TIM: The one thing that you always want to pay attention to is each one of these scanners has a minimum and maximum distance — it’s good to be in that sweet spot. There are physical charts that are out there, like the 435 that comes from Intel. The RealSense camera has a two-foot distance on that — once you get outside of that, you start losing quality. But if you get inside that range, you start having packed information that causes reconstruction issues. You can get issues with warping or stippling that comes across once you’re too close. If you’re too far away, it’s like a stippling — but it’s more of a mountainous/spurious kind of visual that comes out of it.
LiDAR — talking about most ground-based LiDAR — will scan the way it’s initially set up. FARO does a wonderful job of setting up their systems so that they do what they do. If you’re talking about plane base scanning, make sure that you get about a 20% to 30% overlap so that when the data comes back, you can use that to clean up as you’re getting a fly bypass.
Drone technology has come a long way now that quadcopters can carry heavier and heavier assets. I’m starting to see a lot more data revolving around photogrammetry on top of the scan data, so you’re seeing a lot of time-of-flight scanners out there and not nearly as many structured scanners that are on drones. Although I have seen some RealSense scanning drones out there that mix with a DSLR, of sorts, and they provide some decent feedback — but it really depends on what you’re going for. If you have to get a drone that close, you got to have an excellent pilot — so there are many trade-offs.
One of the better solutions is sky-based LiDAR with ground-based photogrammetry. Combining those two provides both crisp edges that you receive from LiDAR and a lot of the fill-in information that you get from photogrammetry — that kind of “pray and spray” setup — especially when you have a large area that you need to cover versus the precision that just photogrammetry and ground-based LiDAR will end up getting you. So, if you have big things, the combination of the two does provide more filled results in a shorter amount of time and is more economical.
ALEX: Volumetric video rig setup — this is interesting and fun. The typical model right now for much of the volumetric video capture is a dedicated stage. We believe it is definitely a valuable way for some people to access it, but it may not be realistic for other folks. That is part of the reason that we actually created our processing solution for MOD was to create an opportunity to bring volumetric video to others that already are doing photogrammetric capture — they already understand the tenants of this, they have the equipment to do photogrammetric capture, and all they really need to do is a few calibration tweaks to be able to capture volumetric video.
These things are relatively similar to the photogrammetric capture set up with a few qualifiers here and there. For starters, the minimum of three cameras per 15-degree section, which is based on the tenets of photogrammetry — call it “videogrammetry” if you will. We are working to create that overlap in data and make sure that you get the most detail that you can to create that moving object.
With the camera focus, you really want the same focal length in each camera. The same type of focus on each camera — no autofocus — definitely causes issues because all the cameras will do their own variations of autofocus. Then it’s harder to create that result where you’re combining them. The global shutter is preferred and stay away from fisheye lenses. You don’t want to warp or have any of the cameras be individual; you’ll want them all to have the same setup, and if have all the exact same cameras… even better. We have worked with everything from webcam rigs to DSLR rigs to bullet-time rigs and created opportunities to recalibrate those styles and bring them in for this opportunity.
TIM: Why do we talk about three sections? It really is the vertical overlap — you end up having a kicker section, a mid-section, and a facial section. You want to put at least a couple up over the top, and this is something I tend to see almost every client rig that we get — they don’t really count on the top of the head coming out that good. They aren’t that worried about it because many of them wear skullcaps and then put the hair on afterward. While I can tell you if you believe that you have the workforce to go ahead and put hair onto every single frame of volumetric video… you go ahead do that, but that sounds like enjoyment well past the level of entertainment that I find.
You really do have to capture now — it’s all now, or you’re going to be in for a lot of pain later. So, doing things over the head and making sure that you do count the floor — I see this even with large professional volumetric rigs where they don’t do a lot of work in getting that separation between the ground and people’s feet. You’ll end up seeing these flat feet. I know not everybody wears a pair of Chuck Taylors, there is a sole on these things… but they go through and chop off the bottom of the feet. You end up needing to have more on the ground than people really think that you should so that you can go ahead and separate these individuals. It’s crucial.
One caveat when it comes to fisheye lenses — fisheye lenses are nasty. They’re nasty because what they do is actually stretch the image on the edges. Every camera does do this — very true. Whether you realize it or not, the reason why we understand how far away a point is in three-dimensional space is because of the determination that a camera provides onto a flat image. So, you have this flat image, and then we get closer towards the center of the image… as it gets further out towards the edge, every single image has a stretching — even a prime lens has a certain amount of stretch that comes out — it’s just stuck to that exact focal length and provides a much better result.
Fisheye lenses do that at a much higher rate. What ends up happening out of this is that you lose more viable information — even if you do a wonderful de-warp, which I have seen some de-warps where your eye will not see it. I can promise you a computer vision will see every minute difference between each one of these images. And when it’s trying to go around the entire circle, it will see those little bit of results and it will provide a little over here versus a little over here. That may seem like that’s not a lot, but when you’re talking about “a little over here” for every single frame, that makes the edges dance and dancing edges make people nauseous, and people don’t like being nauseous. This is something that we try not to do.
There are several solutions that we are definitely working on a regularly involving machine learning algorithms that are obviously way smarter than the people that build them (i.e. me). This is something that will produce great results. I’ve seen a lot of good solutions at SIGGRAPH that have come up with how do we deal with fisheye lenses because sometimes you only have the space for a fisheye. Fisheyes are really wonderful at getting coverage. The problem is the coverage that they provide is not the coverage that you want.
ALEX: The rig coverage, again, is really similar to photogrammetry. Very typical, dome coverage — definitely want to make sure that you get the top of the head, as Tim mentioned… the feet and the head if you’re doing full body.
There are some technologies we’re experimenting with, and we’ve actually created a temporal illusion to recreate some missing data — we’ve had some datasets that did not have all of the ideal shots. That’s not always going to be feasible to be honest, depending upon what the subject is. So having that correct amount of coverage is very important for that fidelity. Especially on the face if you’re doing a bust, because the whole point of facial volumetric video is to get all the macro and micro-expressions — all that flushing, all of the little fine lines — the minute movements of our face that make us human.
So having cameras directed at all sides of the subject and making sure that you clearly get coverage for things like ears, hair, and the top and the back of the head, all those really interesting, weird places that are a little hidden.
The bust shots require a minimum of 210 degrees of record data. So it’s not a 180°, even though we typically call 180°, it’s really 210° because you do want to get that back of the ear. That’s a huge part of it.
TIM: So you get 210 degrees so you can chop it down to 180°. Go back to our rule of three… you need to add an extra 15 degrees on either side — call it a 210. Then you end up getting the area that you need because you’ll end up getting some warping and wobbling based on the fact that those areas only have one in certain points that are along there, or maybe two in certain other points, just based on the right-left north-south of the 180 degrees drop off that you actually need.
When we’re talking about cameras on all sides of the subject, it’s the same — it’s photogrammetry — you want it going all the way around. Cylinders do a really good job at this. The biggest issue with cylinders is the concept that you have arms (underneath and above) and groin areas that are much further away, under the bottom of them, than there is from any camera point. So you end up seeing a lot of issues in those areas. If you can do a full-dome, you still run into those issues.
So, a lot of things that I see always involves more cameras. When you get up to those 210° camera ranges, you’ll want them pointing upwards in certain areas. You have some of them that point in that cylindrical kind of pattern and then you have these ones that point up under the arms and other problems shots that are in there. If you’re smart about it, you point everybody in a single direction and then you get underarm and under groin shots so you end up getting quality results. It’s kind of a lot of fun trying to figure this out and get those good quality results.
ALEX: The other way that we’ve actually combated this internally, here at MOD, is we’ve created the opportunity to use the best of both worlds. So photogrammetry in an A pose or T pose for the body itself, and then that body can be rigged or you can put mocap on it — there’s a lot of cool things you can do with that point — all kinds of animations or sequences. You can map it to an actual motion capture suit capture and then volumetric for the bust. The other benefit there is that volumetric for the bust is a significantly smaller physical footprint, and that gives you know the opportunity to do that high fidelity facial and body capture then create a combination technology that maximizes your output.
TIM: Definitely. And let’s be honest, people know how to deal with mocap a lot better, and with the way that we’re doing all-in-one .fbx — where all the assets are in there at one time — it really is just a smart result for the output and for the use cases.
ALEX: We’re not going to go through all of this (pictured above) one-by-one — but this is also found in our capture guide. This is just a sort of an overview. General best practices. This pretty much goes across for almost all of them. Some of them are a little bit better — as you get into scanning, you’re going to capture those thin objects better, more effectively. Some of that shine and sheen will be less of a problem than it is with photogrammetry or volumetric video.
Again, on the camera specs, many of them are the same? If you use the same camera, it’s great. If you can minimize the extraneous features — the fisheye, the autofocus, the white balance — really try to keep them the same across. It’s ideal.
One other difference between how the capture data we intake works is that our processing solution actually is most functional when you do not have a blue, green, or white screen/set behind you. It works much more efficiently when we have data points/points of interest behind and around the subject — whether it’s photogrammetric or volumetric video. We use machine learning and computer vision to actually do the camera calibration, background extraction, edge detection, and things like that intrinsic to understanding that trigonometry, the depth, and where the subject ends and the world begins.
TIM: One of the big things that people often miss is using bounce cards for flat lighting. What you end up getting out of that are white reflections and refractions in human skin. So, panel lighting is good as long as it’s not too hot. You can find some excelled ones on Amazon — a pair for $50 or less now on the low-end — and get good results, and they have battery packs in them. We like them. I’ve used them in a couple of shots, and they’re great, especially for traveling around and things like that.
DSLRs are perfect in a lot of different areas. They have a limitation when it comes to volumetric video — and some of them can be quite loud when you do individual image captures — although, if you are switching over to video, they still provide wonderful results. Something that’s in the middle is things like RX0s and things like that because they are specifically meant for small-use photos. Basically anything minus a GoPro because GoPro lenses are pretty rough, being fisheye and whatnot.
And then, of course, as we’ve covered before, no autofocus, and you definitely want to white balance your devices normally. We do have a solution now that automates a color passport. We also have one coming out right now that uses a neural network to create the same color tone across all the images. So, if you provide a white balanced image that you want everything to look like that’s in your list, then you can do that, so you don’t have to really worry about that.
Then they do need to be processed, of course, afterward in our systems, but you can also do that on your own. It just really depends on where you’re going on that. But you definitely want your assets that go into processing to be white balanced — there’s just no replacement for the quality on that.
One thing that’s not necessarily on here but is well-known is is the use of raw. In a lot of cases, you do get better results. In some cases, you don’t. And that depends on how you’re doing that.
ALEX: So, how does MOD fit in? Why is this important to us? It’s important to us because capture is not our specialty. We are very familiar with capture — we understand the best practices, help others capture more effectively, and increase their capabilities. But we are a processing solution.
We actually have distributed processing, so we are 98% faster in a lot of cases. We use automated systems: so it’s drag and drop the imagery data into the project folder, it uploads on an unencrypted system, goes directly to our private secure cloud, we process to your specifications, and we deliver it back to you.
Really, the entire premise of this is to open up the ecosystem and the capability for people to do more functional things without having to have the infrastructure and be able to minimize the specialized staff that you have to have for a lot of these processes, and really drop that massive overtime. A lot of the things that we’re doing and processing are really manual, time-intensive tasks that are best served by a machine.
ALEX: We have another resource aside from the capture guide and our website: if you scroll down on our homepage, the capture guide is available for you to download as a PDF.
We’re always here for you as a resource. We really are interested in sharing our knowledge and giving people more capabilities.
TIM: So thank you, everybody! You can find us at modtechlabs.com. Alex’s email is firstname.lastname@example.org, and you can reach me at email@example.com. Thank you all very much