Off with their heads: Body and clothing segmentation in Spark AR

Implementing a filter with three different effects on the background, body and clothing.
Steve Maddenverse Instagram filters
Many of the augmented reality experiences that ROSE produces are focused on adding new objects to or ways of interacting with the already existing world around us, allowing the user to interface with virtual extensions of a product, brand, or idea. However, lately we have seen a renewed interest from brands in creating person-centric experiences ie. the selfie. Most recently, we delved into this world when working on the Steve Maddenverse campaign’s Instagram filters.  
Of course, person-centric experiences are hardly a new idea. Selfie filters developed for Instagram and Snapchat abound, having exploded in popularity through the last five years. These filters can do anything from magically beautifying someone’s face to aging them, warping them into fearsome orcs and goblins, changing their hair or facial features or jewelry and accessories, or swapping them entirely with someone else’s. This, too, is a kind of augmented reality, and it has its own huge potential.  
An Instagram face swap filter. Credit to amankerstudio on Instagram, 2017.
Alongside that potential come several unique challenges, of which the main one is body tracking. An AR engine needs to identify what sections of the camera feed belong to a person as well as how and where they move — perhaps even tracking the position and orientation of individual body parts. And once we have that information, we can take it a step further to address an even more specific hurdle: segmentation.  
A body tracking algorithm in action. Credit to MediaPipe and Google AI Blog, 2020.

What is Segmentation?

Segmentation is the process of identifying a body part or real object within a camera feed and isolating it, creating a “cutout” that can be treated as an individual object for purposes like transformation, occlusion, localization of additional effects, and so on.

Types of Segmentation:

Hair Segmentation: Changing a user’s hairstyle requires precise segmentation of that user’s real hair so that it can be recolored, resized, or even removed from the rendered scene and replaced entirely without affecting other parts of the scene, such as the user’s face. Body Segmentation: Allows for the user’s background to be replaced without tools like a green screen, throwing the user into deep space, lush jungles, the Oval Office, or anywhere else you would like to superimpose your body outline against. Skin Segmentation: Skin segmentation identifies the user’s skin. This could power an experience in which a user wears virtual tattoos that stop at the boundaries of their clothes and move along with their tracked body parts — almost perfectly lifelike. Object Segmentation: Gives us the ability to perform occlusion so that AR objects might be partially hidden under or beneath real ones as they would logically be in reality, or even to “cut and paste” those real objects into virtual space.  
Person, skin, and hair segmentation via Spark AR. Credit to Facebook, 2021.

Achieving Segmentation

How do we achieve segmentation? Approximating shapes from a database would never be even close to realistic. Identifying boundaries by color contrast is a no go for people with hair or clothes that are close to their skin tone. Establishing a body position at experience start (“Strike a pose as per this outline:”) and then tracking changes over time is clunky and unreliable. We need something near-instantaneous that can recalibrate on the fly and have a wide margin of approximation for adjustment. We need something smarter! Of course, then, the answer is artificial intelligence. These days, “AI” is more often than not a buzzword thrown around to mean everything and yet nothing at all, but in this case we have a practical application for a specific form of AI: neural networks. These are machine learning algorithms that can be trained to recognize shapes or perform operations on data. By taking huge sets of data (for example, thousands and thousands of photos with and without people in them) and comparing them, neural networks have been trained to recognize hands, feet, faces, hair, horses, cars, and various other animate and inanimate entities…perfect for our use case.  

Training a neural network to identify objects and remove backgrounds. Credit to Cyril Diagne, 2020.

All of this is not to say that segmentation is on the cutting razor edge of new technology. Spark AR, for example, has had segmentation capabilities for at least two years. However, it is a pretty recent update to the platform that allows users to use multiple classes of segmentation in a single effect, and you can read more about that update here. This new capability opens the door to a host of more complex effects, and so in this case study, we use multiple-class segmentation to apply separate effects to the user’s background, body (face, hair, and skin), and clothing.
Sketching out a triple segmentation filter. Credit to Eric Liang, 2021.
Each of these layers is easily accomplished on its own using a segmentation texture from the camera. For example, Spark AR provides a “Background” template that shows how to accomplish person segmentation and insert a background image. Breaking the template down, we see that this is accomplished by first creating two flat image rectangles that overlay and fill the device screen. The topmost of these will be the person, and the one underneath will feature the background image. For the top layer (named “user” in the template), the extracted camera feed is used as a color texture. Beginners will observe that there’s no visible distinction from a blank front-facing camera project at this time. This is because the normal display is, for all practical purposes, exactly that: just a flat image rectangle that fills the screen and displays the camera feed. We’ve basically just doubled that in a way that we can tinker with and put our version on top, obscuring the “normal” display. Next, a person segmentation texture is created and used as the alpha texture for the user rectangle. This sets the alpha value, which determines transparency, for all parts of the user rectangle outside of the identified person to 0, so that it is completely transparent and shows what is layered underneath it instead. Within the area that is an identified person, the camera feed continues to show through. This shows us that the segmentation texture is actually made up of two binary areas: is and isn’t, without any information as to what that is/isn’t is actually referring to. Those familiar with image manipulation know this concept as “layer masking”. The camera feed is accessed twice per frame: once to determine that is/isn’t binary and create a texture map (practically, equivalent to a layer mask) recording that information, and once to check what color each pixel within that map should be. (Astute observers will note that it doesn’t matter in which order these checks occur.) Finally, the template allows for any desired background image to be slotted in as the background rectangle’s color map. Voilà: person segmentation! We’ll replace the stock image with a bit of outer space for our aesthetic.  
Background segmentation using Spark AR’s template.

Next step: adding an effect to the face. Problem: we don’t have a built-in “clothes” segmentation! We have “hair” and “body”, but nothing that will allow us to easily separate face and skin from clothes. Snapchat’s Lens Studio is nice enough to provide built-in “upper garment” segmentation, but Spark AR is not so forthcoming. We’ll have to get a little creative with the options available to us. Quick thinkers may have already seen the simple mathematical solution. Our segmentation options are “person”, “hair”, and “skin”. Person minus hair and skin is…exactly what we’re looking for. By combining the hair and skin segmentation textures and subtracting that from the person texture, we get the clothes left behind. Let’s get cracking on what exactly this looks like in patch form.  
Demonstrating multiple segmentation.

As a very basic implementation of the concept, it’s a little rough around the edges, but it gives us what we need. I implement some tweaks for the sample screenshots, but they will not be covered in this case study, and I encourage you to explore, create, and refine your own solutions! “EZ Segmentation” is a patch asset straight from the Spark AR library, and provides options for adding effects to either the foreground (body) or the background (clothes). It’s pretty easy to build effects on their own and then pass the texture into the slot. Here, we add in a light glow gradient paired with a rippled lens flare to the foreground and a starry animation sequence to the background.  
The filter in action.

You can already imagine the kinds of things we can do here with the power to animate designs on the user’s clothing. Inversely, we can leave the clothing untouched and add effects to the user’s skin, whether that be coloring it in à la Smurf or Hulk, or erasing it entirely for an “Invisible Man”-type filter. These suggestions are just a place to start, of course; multiple-class segmentation is powerful enough to open the door to a galaxy’s worth of potential. Show us what you can do!

Render streaming: taking AR to the next level

What’s the deal with AR, anyway?

XR technology is widely touted as having infinite potential to create new worlds. You can design scenes with towering skyscrapers, alien spacecraft, magical effects, undersea expanses, futuristic machinery, really anything your heart desires. Within those spaces, you can fly, throw, slash, burn, freeze, enchant, record, create, draw and paint⁠ — any verb you can come up with. The only limit is your imagination!
Painting in VR with Mozilla’s A-Painter XR project. Credit: Mozilla 2018.

Sounds cool. What’s the problem?

Well, all of that is true — to a point. Despite all of our optimism about this AR and VR potential, we find that we are still bound by the practical limitations of the hardware. One of the biggest obstacles to creating immersive, interactive, action-packed, high-fidelity XR experiences is that the machines used to run them just don’t have the juice to render them well. Or, if they do, they’re either high-end devices that have a steep monetary barrier to entry, making them inaccessible, or too large to be portable and therefore inconducive to the free movement you would expect from an immersive experience. That’s not to say that we can’t do cool things with our modern XR technology. We’re able to summon fashion shows in our living rooms, share cooperative creature-catching gaming experiences, alter our faces, clothing, and other aspects of our appearance, and much, much more. But it’s easy to imagine what we could do past our hardware limitations. Think of the depth, detail, and artistry boasted by popular open-world games on the market: The Elder Scrolls: Skyrim, The Legend of Zelda: Breath of the Wild, No Man’s Sky, and Red Dead Redemption 2, just to name a few. Now imagine superimposing those kinds of experiences against the real world, augmenting our reality with endless new content: fantastic flora and fauna wandering our streets, digital store facades that overlay real ones, information, and quests available to learn about at landmarks and local institutions.  
Promotional screenshot from The Legend of Zelda: Breath of the Wild. Credit: Nintendo 2020.
  There are many possibilities outside of the gaming and entertainment sphere, too. Imagine taking a walking tour through the Roman Coliseum or Machu Picchu or the Great Wall of China in your own home, with every stone in as fine detail as you might see if you were really there. Or imagine browsing through a car dealership or furniture retailer’s inventory with the option of seeing each item in precise, true-to-life proportion and detail in whatever space you choose. We want to get to that level, obviously, but commercially available AR devices (i.e. typical smartphones) simply cannot support them. High-fidelity 3D models can be huge files with millions of faces and vertices. Large open worlds may have thousands of objects that require individual shadows, lighting, pathing, behavior, and other rendering considerations. User actions and interactions within a scene may require serious computational power. Without addressing these challenges and more, AR cannot live up to the wild potential of our imaginations.  

So what can we do about it?

Enter render streaming. Realistically, modern AR devices can’t take care of all these issues…but desktop machines have more than enough horsepower. The proof is in the pudding: we see in the examples of open-world video games previously mentioned that we can very much create whole worlds from scratch and render them fluidly at high FPS rates. So let’s outsource the work! The process of render streaming starts with an XR application running on a machine with a much stronger GPU than a smartphone (at scale, a server, physical or cloud-based). Then, each processed, rendered frame of the experience, generated in real time, is sent to the display device (your smartphone). Any inputs from the display device, such as the camera feed and touch, gyroscope, and motion sensors, are transmitted back to the server to be processed in the XR application, then the next updated frame is sent to the display device. It’s like on-demand video streaming, with an extra layer of input from the viewing device. This frees the viewing device from actually having to handle the computational load. Its only responsibility now is to stream the graphics and audio, which modern devices are more than capable of doing efficiently. Even better, this streaming solution is browser-compatible through the WebRTC protocol, meaning that developers don’t need to worry about cross-platform compatibility, and users don’t need to download additional applications.
Diagram of render streaming process using Unreal Engine. Credit: Unreal Engine 2020.
  There is just one problem: it takes time for input signals to move from the streaming device to the server, be processed, and have results be transmitted back. Nor is this a new challenge; we have long struggled with the same latency issue in modern multiplayer video games and other network applications. For render streaming to become an attractive, widespread option, 5G network connectivity and speeds will be necessary to reduce latency to tolerable levels. Regardless, it would be wise for developers to get familiar with the technology. All the components are already at hand; not only is 5G availability increasing, but Unity and Unreal Engine have also released native support for render streaming, and cloud services catering to clients who want render streaming at scale are beginning to crop up. The future is already here — we just need to grab onto our screens and watch as the cloud renders the ride.