Differentiating between pixels based on their depth

Hello everyone,

I’m a beginner in AR and Unity, and I’d like to get some feedback on an approach I’m considering. I want to know if this method works or if there’s a more efficient solution.

Goal:
I need to help users identify the facade of a building they are pointing their mobile device at and then draw contours on the front-facing facade.

My proposed approach is as follows:

  1. Detect the building using Semantic Segmentation:
    First, I’ll apply semantic segmentation to identify the pixels in the camera’s view that belong to a building.
  2. Estimate the distance (depth) for each pixel classified as a building:
    For each pixel labeled as building, I’ll compute or assign a depth value that represents its distance from the camera.
  3. Filter out building pixels based on depth:
    Since the user is interested in the front-facing facade, I will filter out any building pixels that are too far away from the camera. Only the closer pixels, which likely belong to the facade directly in front of the user, will be retained.

In the image below, after applying semantic segmentation, I believe the pixels highlighted in yellow represent areas that are farther from the camera. By using depth maps, I can determine that these pixels likely don’t belong to the front-facing facade and can exclude them from further processing.

Question:
Does this approach seem reasonable for detecting and contouring the front-facing facade, or is there a better solution that would achieve the same goal more efficiently?

Hi there,

Your approach sounds reasonable for your intended use case. I would recommend implementing this and reviewing how it performs in the real world before making changes, optimizing, or seeking an alternative.

Kind regards,
Maverick L.

Hi Maveric,

Thank you for your response. I’m progressing in this direction, but I’m currently facing challenges with further processing the pixels of the building after segmentation.

Specifically, I need to access the raw pixels corresponding to the semantic class “building” so that I can filter them to retain only those on the front-facing facade, using depth information.

I’ve been researching XRCpuImage and Texture2D, but as I’m still a beginner in Unity, I’m struggling to understand how to access the raw pixel data effectively.

Could you please share any documentation, references, tutorials, or advice that could help clarify this process for me? Any guidance would be greatly appreciated.

Thank you for your assistance!

Best regards,

You’re very welcome! In order to access the “raw pixels” corresponding to a particular semantic class (or channel) you can query the AR Semantic Segmentation Manager for its texture with GetSemanticChannelTexture. For an example of how you would do this, take a look at How to Query Semantics and Highlight Semantic Channels in our documentation. What’s more, you would want to use the depth texture to filter out parts of the semantic channel texture that are too far away to be the front facade. For a guide on how you would obtain the depth texture, take a look at How to Access and Display Depth Information as a Shader.

The main idea of how you would accomplish what you want would be considering both the depth information texture and the semantic channel texture. You would look at the pixels of the semantic channel texture either through a multi-pass shader (the frag function) or by writing a script to consider the pixels of the two textures with Texture2D’s GetPixelData method. To make things as simple as possible, I would ensure both the semantic channel and depth information textures are aligned and the same size before you begin looking at each individual pixel. Recall that the depth information texture returns an image in grayscale – meaning that brighter pixels are closer than darker ones. If a valid pixel in the semantic channel texture is too dark in the depth information texture, you would filter it out. What is too dark or “too far” is up to you and your expectations of how your application will be used.

Kind regards,
Maverick L.

1 Like