Skip to main content

Object Detection

With over 200 new classes of objects, the Object Detection subsystem enhances Lightship's contextual awareness capabilities by creating semantically labeled 2D bounding boxes that dynamically update as real-world objects appear on-screen. For each bounding box, the subsystem processes the central square crop of the image, then makes an independent prediction for every subclass and returns the probability that the detected object belongs to each of them. Lightship Object Detection also provides the following model card which explains how detections were trained for person, a human hand, or a human face.

Image with Bounding Boxes around Detected Objects

Basic Usage

By placing Lightship's ARObjectDetectionManager in a scene and subscribing to the ObjectDetectionsUpdated event, developers can receive realtime detection information in the form of XRDetectedObjects. You can also listen for the MetadataInitialized event to receive the list of object classes when the model becomes available to use.

The frame rate of the ARObjectDetectionManager can also be adjusted to save performance or detect objects at a faster rate.

Image displaying ARObjectDetectionManager properties

Object Detection Categories

note

While the probability of each class is computed independently, the subclasses of each category are used to train the categorical classes they belong to. Because of this, objects can be detected as members of their categorical classes rather than their specific subclasses. For example, a French horn will return a high probability that the object is in the french horn, brass instrument, and musical instrument classes.

CategorySubclasses
Aircraftairplane, helicopter, hot air balloon, parachute, rocket
Building featuresdoor, door handle, window
Carcar, taxi
Outdoor furniturebarrel, bench, billboard, fire hydrant, flag, parking meter, sculpture, snowman, street light, traffic light, waste container
Vehiclevehicle, bicycle, boat, bus, car, cart, motorcycle, taxi, train, truck, wheel, wheelchair
Water featurefountain, swimming pool
Accessoriesbackpack, glasses, handbag, umbrella
Clothingcoat, dress, shirt, shorts, skirt, sock, suit, tie, trousers
Footwearfootwear, roller skates
Headwearheadwear, fedora
Musical instrumentaccordion, brass instrument, drum, flute, guitar, piano, string instrument, violin
Brass instrumentfrench horn, saxophone, trombone, trumpet
String instrumentbanjo, cello, harp, guitar, violin
Foodfood, apple, banana, berry, bread, broccoli, cake, carrot, cheese, citrus, coconut, dessert, donut, egg, fast food, grape, hamburger, hot dog, ice cream, mushroom, pear, pizza, pumpkin, sandwich, sushi, tomato
Berryberry, raspberry, strawberry
Citruscitrus, grapefruit, lemon, lime, orange
Dessertdessert, cake, donut, ice cream
Fast foodfast food, french fries, hot dog, pizza, hamburger, sandwich
Pumpkinpumpkin, squash
Drinkdrink, hot drink, juice
Hot Drinktea, coffee (recognized when in a cup)
Applianceshair dryer, microwave, oven, refrigerator, toaster
Cooking panfrying pan, pressure cooker, slow cooker, waffle iron, wok
Indoor furniturefurniture, bed, chair, christmas tree, couch, curtains, poster, shelves, storage cabinet, table
Jugjug, teapot
Lamplamp, candle
Home featuresbathtub, fireplace, sink, tap, toilet
Miscellaneous itemsbook, bottle, bowl, box, cannon, chopsticks, coin, cup, flowerpot, fork, knife, pen, pillow, plate, potted plant, scissors, skull, spoon, tin can, toothbrush, wine glass
Screenscreen, computer display, tablet, TV
Sports ballsports ball, football, rugby ball, tennis ball
Sports equipmentbaseball bat, baseball glove, frisbee, kite, paddle, skateboard, skis, snowboard, tennis racket
Techcamera, clock, computer keyboard, computer mouse, headphones, microphone, phone, remote, watch
Toytoy, doll, teddy bear
Animalanimal, alpaca, bear, big cat, bird, camel, cat, cow, crocodile, deer, dog, dolphin, elephant, fish, frog, giraffe, goldfish, hippopotamus, horse, jellyfish, kangaroo, panda, parrot, pig, polar bear, rabbit, reptile, rhinoceros, seal, sheep, shellfish, squirrel, turtle, water bird, whale, zebra
Big catcheetah, jaguar, leopard, lion, lynx, tiger
Birdbird, parrot, water bird
Camelidsalpaca, camel, llama
Crocodilecrocodile, alligator
Deerantelope, deer, moose
Flowerflower, rose, sunflower
Horsedonkey, horse, mule
Insectinsect, butterfly
Fishfish, goldfish, jellyfish, manta ray, seahorse, shellfish, squid
Reptilereptile, crocodile, turtle
Sealseal, sea lion, walrus
Sheepgoat, sheep
Shellfishcrab, lobster, oyster, shrimp, snail, starfish
Turtleturtle, sea turtle, tortoise
Water birdduck, goose, swan
Personperson, human face, human hand,

Person Detection Model Card v0.4

Model Details

  • Model last updated: 2024-02-29
  • Model version: v0.4
  • License: refer to the terms of service for Lightship.

Technical specifications

The object detection model returns a set of bounding boxes and reports the probability that the box is a person, a human hand, or a human face.

Intended use

Intended use cases

  • Identifying people (more specifically, human hands or human faces) in an image.
  • Querying the presence or absence of people, human hands, or human faces in an image.

Permitted users

Augmented reality developers through Niantic Lightship.

Out-of-scope use cases

This model does not provide the capability to:

  • Track individuals
  • Identify or recognise individuals

Factors

The following factors apply to all object detection provided in the Lightship ARDK, including person detection:

  • Scale: objects / classes may not be detected if they are very far away from the camera.
  • Lighting: extreme light conditions may affect the overall performance.
  • Viewpoint: extreme camera views that have not been seen during training may lead to a miss in detection or a class confusion.
  • Occlusion: objects may not be detected if they are covered by other objects.
  • Motion blur: fast camera or object motion may degrade the performance of the model.
  • Flicker: there may be a ‘jittering’ effect between predictions of temporally adjacent frames.

For person detection specifically, based on known problems with computer vision technology, we identify potential relevant factors that include subgroups for:

  • Geographical region
  • Skin tone
  • Gender
  • Body posture: certain body configurations may be harder to predict due to appearing less often in the training corpus.
  • Other: age, fashion style, accessories, body alterations, etc.

Fairness evaluation

At Niantic, we strive for our technology to be inclusive and fair by following strict equality and fairness practices when building, evaluating, and deploying our models. We define person detection fairness as follows: a model makes fair predictions if it performs equally on images that depict a variety of the identified subgroups. The evaluation results focus on measuring the performance of the union of the human channels (person, human hand, and human face) on the first three main subgroups (geographical region, skin tone, and gender).

Instrumentation and dataset details

Our benchmark dataset comprises 5650 images captured around the world using the back camera of a smartphone, with these specifications:

  • Only one person per image is depicted.
  • Both indoors and outdoors environments.
  • Captured with a variety of devices.
  • No occlusions.

Images are labeled with the following attributes:

  • Geographical region: based on the UN geoscheme with the merge of European subregions and Micronesia, Polynesia, and Melanesia:
    • Northern Africa
    • Eastern Africa
    • Middle Africa
    • Southern Africa
    • Western Africa
    • Caribbean
    • Central America
    • South America
    • Northern America
    • Central Asia
    • Eastern Asia
    • South Eastern Asia
    • Southern Asia
    • Western Asia
    • Europe
    • Australia and New Zealand
    • Melanesia, Micronesia, and Polynesia
  • Skin tone: following the Fitzpatrick scale, images are annotated from subgroup 1 to 6. Skin tone is a self-reported value provided by the person in each image.
  • Gender: images are annotated with self-reported gender.

Metrics

The standard metric for evaluating object detection models -- and the one we use -- is Intersection over Union (IoU). It is computed as follows:

IoU = (overlap between predicted and g.t. boxes) / (union between predicted and g.t. boxes)

Reported IoUs are averages (mean IoU or mIoU) over images belonging to the referenced subgroup unless stated otherwise.

Fairness criteria

A model is considered to be making unfair predictions if it yields a performance (mIoU) for a particular subgroup that is three standard deviation units or more from the mean across all subgroups.

Results

Geographical evaluation

Average performance across all 17 regions is 78.74% with a standard deviation of 1.22%. All regions exhibit a performance in the range of [76.92%, 82.17%]. The maximum difference between the mean and the worst performing region is 1.83%, within our fairness criterion threshold of 3 standard deviations (3x1.22% = 3.65%).

RegionsmIoUstdevNumber of images
Northern Africa78.26%15.04%301
Eastern Africa77.41%17.11%336
Middle Africa77.30%15.72%322
Southern Africa79.09%14.93%368
Western Africa79.04%13.26%364
Caribbean79.01%12.20%412
Central America79.44%13.79%415
South America78.39%14.21%397
Northern America79.09%13.00%335
Central Asia79.52%12.56%229
Eastern Asia77.60%15.37%346
South Eastern Asia77.86%14.86%333
Southern Asia79.34%12.15%353
Western Asia78.80%14.91%370
Europe79.40%13.14%320
Australia and New Zealand76.92%18.13%374
Melanesia, Micronesia and Polynesia82.17%11.08%75
Average (across all images)78.55%14.55%5650
Average (across regions)78.74%1.22%-

Skin tone evaluation results

Average performance across all six skin tones is 78.58% with a standard deviation of 0.24%. All skin tone subgroups yield a performance in the range of [78.23%, 78.97%]. The maximum difference between the mean and the worst performing skin tone subgroup is 0.34%, within our fairness criterion threshold of 3 stdevs (3x0.24% = 0.71%).

Skin tone (Fitzpatrick scale)mIoUstdevNumber of images
178.59%12.00%247
278.49%14.59%1919
378.61%14.39%1463
478.23%16.52%457
578.97%13.60%706
678.56%14.67%858
Average (across all images)78.55%14.55%5650
Average (across skin tones)78.58%0.24%-

Gender evaluation results

Average performance of all evaluated gender subgroups is 78.53% with a range [78.01%, 79.05%]. The difference between the average and the worst performing gender is 0.52%, within our fairness criterion threshold of 3 stdevs (3x0.74% = 2.22%).

Perceived gendermIoUstdevNumber of images
Female78.01%15.08%2585
Male79.05%13.96%3065
Average (across all images)78.55%14.55%5650
Average (across genders)78.53%0.74%-

Ethical Considerations

  • Privacy: When the model is used in ARDK, inference is only applied on-device and the image is not transferred off the user device.
  • Human Life: This model is designed for entertainment purposes within an augmented reality application. It is not intended to be used for making human life-critical decisions.
  • Bias: Training datasets have not been audited for diversity and may present biases not surfaced by our benchmarks.

Caveats and Recommendations

  • Our annotated dataset only contains binary genders, which we include as male/female. Further data would be needed to evaluate across a spectrum of genders.
  • An ideal skin tone evaluation dataset would additionally include camera details, and more environment details such as lighting and humidity. Furthermore, the Fitzpatrick scale has limitations as it doesn't fully represent the full spectrum of human skin tones.
  • This model card is based on the work of Mitchell, Margaret, et al. "Model cards for model reporting." Proceedings of the conference on fairness, accountability, and transparency. 2019. Link