Off with their heads: Body and clothing segmentation in Spark AR

Implementing a filter with three different effects on the background, body and clothing.

Many of the augmented reality experiences that ROSE produces are focused on adding new objects to or ways of interacting with the already existing world around us, allowing the user to interface with virtual extensions of a product, brand, or idea. However, lately we have seen a renewed interest from brands in creating person-centric experiences ie. the selfie. Most recently, we delved into this world when working on the Steve Maddenverse campaign’s Instagram filters.

Of course, person-centric experiences are hardly a new idea. Selfie filters developed for Instagram and Snapchat abound, having exploded in popularity through the last five years. These filters can do anything from magically beautifying someone’s face to aging them, warping them into fearsome orcs and goblins, changing their hair or facial features or jewelry and accessories, or swapping them entirely with someone else’s. This, too, is a kind of augmented reality, and it has its own huge potential.

An Instagram face swap filter. Credit to amankerstudio on Instagram, 2017.

Alongside that potential come several unique challenges, of which the main one is body tracking. An AR engine needs to identify what sections of the camera feed belong to a person as well as how and where they move — perhaps even tracking the position and orientation of individual body parts. And once we have that information, we can take it a step further to address an even more specific hurdle: segmentation.

A body tracking algorithm in action. Credit to MediaPipe and Google AI Blog, 2020.

What is Segmentation?

Segmentation is the process of identifying a body part or real object within a camera feed and isolating it, creating a “cutout” that can be treated as an individual object for purposes like transformation, occlusion, localization of additional effects, and so on.

Types of Segmentation:

Hair Segmentation: Changing a user’s hairstyle requires precise segmentation of that user’s real hair so that it can be recolored, resized, or even removed from the rendered scene and replaced entirely without affecting other parts of the scene, such as the user’s face.

Body Segmentation: Allows for the user’s background to be replaced without tools like a green screen, throwing the user into deep space, lush jungles, the Oval Office, or anywhere else you would like to superimpose your body outline against.

Skin Segmentation: Skin segmentation identifies the user’s skin. This could power an experience in which a user wears virtual tattoos that stop at the boundaries of their clothes and move along with their tracked body parts — almost perfectly lifelike.

Object Segmentation: Gives us the ability to perform occlusion so that AR objects might be partially hidden under or beneath real ones as they would logically be in reality, or even to “cut and paste” those real objects into virtual space.

Person, skin, and hair segmentation via Spark AR. Credit to Facebook, 2021.

Achieving Segmentation

How do we achieve segmentation? Approximating shapes from a database would never be even close to realistic. Identifying boundaries by color contrast is a no go for people with hair or clothes that are close to their skin tone. Establishing a body position at experience start (“Strike a pose as per this outline:”) and then tracking changes over time is clunky and unreliable. We need something near-instantaneous that can recalibrate on the fly and have a wide margin of approximation for adjustment. We need something smarter!

Of course, then, the answer is artificial intelligence. These days, “AI” is more often than not a buzzword thrown around to mean everything and yet nothing at all, but in this case we have a practical application for a specific form of AI: neural networks. These are machine learning algorithms that can be trained to recognize shapes or perform operations on data. By taking huge sets of data (for example, thousands and thousands of photos with and without people in them) and comparing them, neural networks have been trained to recognize hands, feet, faces, hair, horses, cars, and various other animate and inanimate entities…perfect for our use case.

Training a neural network to identify objects and remove backgrounds. Credit to Cyril Diagne, 2020.

All of this is not to say that segmentation is on the cutting razor edge of new technology. Spark AR, for example, has had segmentation capabilities for at least two years. However, it is a pretty recent update to the platform that allows users to use multiple classes of segmentation in a single effect, and you can read more about that update here. This new capability opens the door to a host of more complex effects, and so in this case study, we use multiple-class segmentation to apply separate effects to the user’s background, body (face, hair, and skin), and clothing.

Sketching out a triple segmentation filter. Credit to Eric Liang, 2021.

Each of these layers is easily accomplished on its own using a segmentation texture from the camera. For example, Spark AR provides a “Background” template that shows how to accomplish person segmentation and insert a background image. Breaking the template down, we see that this is accomplished by first creating two flat image rectangles that overlay and fill the device screen. The topmost of these will be the person, and the one underneath will feature the background image. For the top layer (named “user” in the template), the extracted camera feed is used as a color texture. Beginners will observe that there’s no visible distinction from a blank front-facing camera project at this time. This is because the normal display is, for all practical purposes, exactly that: just a flat image rectangle that fills the screen and displays the camera feed. We’ve basically just doubled that in a way that we can tinker with and put our version on top, obscuring the “normal” display.

Next, a person segmentation texture is created and used as the alpha texture for the user rectangle. This sets the alpha value, which determines transparency, for all parts of the user rectangle outside of the identified person to 0, so that it is completely transparent and shows what is layered underneath it instead. Within the area that is an identified person, the camera feed continues to show through. This shows us that the segmentation texture is actually made up of two binary areas: is and isn’t, without any information as to what that is/isn’t is actually referring to. Those familiar with image manipulation know this concept as “layer masking”. The camera feed is accessed twice per frame: once to determine that is/isn’t binary and create a texture map (practically, equivalent to a layer mask) recording that information, and once to check what color each pixel within that map should be. (Astute observers will note that it doesn’t matter in which order these checks occur.)

Finally, the template allows for any desired background image to be slotted in as the background rectangle’s color map. Voilà: person segmentation! We’ll replace the stock image with a bit of outer space for our aesthetic.

Background segmentation using Spark AR’s template.

Next step: adding an effect to the face. Problem: we don’t have a built-in “clothes” segmentation! We have “hair” and “body”, but nothing that will allow us to easily separate face and skin from clothes. Snapchat’s Lens Studio is nice enough to provide built-in “upper garment” segmentation, but Spark AR is not so forthcoming. We’ll have to get a little creative with the options available to us.

Quick thinkers may have already seen the simple mathematical solution. Our segmentation options are “person”, “hair”, and “skin”. Person minus hair and skin is…exactly what we’re looking for. By combining the hair and skin segmentation textures and subtracting that from the person texture, we get the clothes left behind. Let’s get cracking on what exactly this looks like in patch form.

As a very basic implementation of the concept, it’s a little rough around the edges, but it gives us what we need. I implement some tweaks for the sample screenshots, but they will not be covered in this case study, and I encourage you to explore, create, and refine your own solutions!

“EZ Segmentation” is a patch asset straight from the Spark AR library, and provides options for adding effects to either the foreground (body) or the background (clothes). It’s pretty easy to build effects on their own and then pass the texture into the slot. Here, we add in a light glow gradient paired with a rippled lens flare to the foreground and a starry animation sequence to the background.

You can already imagine the kinds of things we can do here with the power to animate designs on the user’s clothing. Inversely, we can leave the clothing untouched and add effects to the user’s skin, whether that be coloring it in à la Smurf or Hulk, or erasing it entirely for an “Invisible Man”-type filter. These suggestions are just a place to start, of course; multiple-class segmentation is powerful enough to open the door to a galaxy’s worth of potential. Show us what you can do!