Skip to main content

Object Detection

With over 200 new classes of objects, the Object Detection subsystem enhances Lightship's contextual awareness capabilities by creating semantically labeled 2D bounding boxes that dynamically update as real-world objects appear on-screen. For each bounding box, the subsystem processes the central square crop of the image, then makes an independent prediction for every subclass and returns the probability that the detected object belongs to each of them. Lightship Object Detection also provides the following model card which explains how detections were trained for person, a human hand, or a human face.

Image with Bounding Boxes around Detected Objects

Basic Usage

By placing Lightship's ARObjectDetectionManager in a scene and subscribing to the ObjectDetectionsUpdated event, developers can receive realtime detection information in the form of XRDetectedObjects. You can also listen for the MetadataInitialized event to receive the list of object classes when the model becomes available to use.

The frame rate of the ARObjectDetectionManager can also be adjusted to save performance or detect objects at a faster rate.

Image displaying ARObjectDetectionManager properties

Object Detection Categories

There are 206 different categories that the neural network looks for inside a bounding box.

Category List
CategoryDescription
human_facehuman face
human_handhuman hand
personperson, man, woman, boy, girl, human body
skullskull
aircraftaircraft, airplane, helicopter, rocket, parachute
bicyclebicycle, stationary bicycle, unicycle
boatboat, watercraft, barge, gondola, canoe, jet ski, submarine, personal flotation device
busbus
carcar, snowmobile, golf cart, tank, snowplow, ambulance, van, limousine, taxi, bus, truck
cartcart
motorcyclemotorcycle
taxitaxi
traintrain
trucktruck
vehiclevehicle, car, land vehicle, snowmobile, golf cart, tank, train, snowplow, ambulance, bicycle, unicycle
wheelwheel, tire, bicycle wheel
wheelchairwheelchair
benchbench
billboardbillboard, scoreboard
christmas_treechristmas_tree
doordoor
door_handledoor_handle
fire_hydrantfire_hydrant
flagflag
parking_meterparking_meter
posterposter, picture frame
sculpturesculpture, bust, bronze sculpture
street_lightstreet_light
traffic_lighttraffic_light
traffic_signtraffic_sign, stop sign
waste_containerwaste_container, garbage bin, trash can
water_featurewater_feature, swimming_pool, jacuzzi, fountain
windowwindow (both indoor and outdoor)
backpackbackpack
clothingclothing, sports uniform
coatcoat, jacket
dressdress
fedorafedora, sun hat, cowboy hat
footwearfootwear, roller skates, boot, high heels, sandal
glassesglasses, sunglasses, goggles
handbaghandbag, briefcase, picnic basket, luggage and bags
headwearheadwear, hat, cowboy hat, fedora, sombrero, sun hat, swim cap, helmet, bicycle helmet, football helmet
roller_skatesroller_skates
shirtshirt
shortsshorts
skirtskirt, miniskirt
socksock
suitsuit
suitcasesuitcase, briefcase
tietie
trouserstrousers, jeans
umbrellaumbrella
baseball_batbaseball_bat
baseball_glovebaseball_glove
footballfootball (soccer)
frisbeefrisbee, flying disc
kitekite
paddlepaddle
rugby_ballrugby_ball
skateboardskateboard
skisskis, ski
snowboardsnowboard
sports_ballsports_ball, ball, football, cricket ball, volleyball, tennis ball, rugby ball
surfboardsurfboard
tennis_balltennis_ball
tennis_rackettennis_racket, table tennis racket, racket
accordionaccordion
brass_instrumentbrass_instrument, french horn, saxophone, trombone, trumpet
drumdrum
fluteflute, harmonica, oboe
guitarguitar
musical_instrumentmusical_instrument, organ, banjo, cello, drum, french horn, guitar, harp, harpsichord, harmonica, oboe,
pianopiano, organ, harpsichord, musical keyboard
string_instrumentstring_instrument, guitar, banjo, cello, harp, violin
violinviolin
appleapple
bananabanana
berryberry, strawberry, raspberry
broccolibroccoli
carrotcarrot
citruscitrus, orange, lemon, grapefruit
coconutcoconut
eggegg
foodfood, fast food, hot dog, french fries, waffle, pancake, burrito, snack, pretzel, popcorn, cookie,
grapegrape
mushroommushroom
pearpear
pumpkinpumpkin, squash
tomatotomato
drinkdrink, beer, cocktail, coffee, juice, tea, wine, bottle
hot_drinkhot_drink, tea, coffee
juicejuice
breadbread
cakecake, tart, muffin
cheesecheese
dessertdessert, ice cream, cake, dessert, muffin, doughnut, donut, bagel, cookie, biscuit, waffle, pancake,
donutdonut, doughnut, bagel, pretzel
fast_foodfast_food, hot_dog, french_fries, pizza, burrito, hamburger, sandwich
french_friesfrench_fries
hamburgerhamburger
hot_doghot_dog
ice_creamice_cream
pizzapizza
sandwichsandwich, submarine sandwich, burrito
sushisushi
bedbed, infant bed, dog bed
chairchair, stool
couchcouch, sofa, studio couch, loveseat, sofa bed
furniturefurniture, chair, cabinetry, desk, wine rack, couch, sofa bed, loveseat, wardrobe, nightstand,
shelvesshelves, wine rack, bookcase, spice rack
storage_cabinetstorage_cabinet, wardrobe, cupboard, closet, cabinetry, filing cabinet, chest of drawers, bathroom cabinet
tabletable, dining table, desk, table, coffee table, kitchen table, billiard table, countertop, nightstand,
bathtubbathtub
fireplacefireplace, wood-burning stove
microwavemicrowave, microwave oven
ovenoven
refrigeratorrefrigerator
screenscreen, tv, television, computer monitor, tablet computer
sinksink
taptap, shower
toastertoaster
toilettoilet, bidet
balloonballoon
barrelbarrel
bookbook
bottlebottle
bowlbowl, mixing bowl
boxbox
cameracamera, binoculars
candlecandle
cannoncannon
chopstickschopsticks
clockclock, wall clock, alarm clock
coincoin
computer_keyboardcomputer_keyboard, keyboard
computer_mousecomputer_mouse
cooking_pancooking_pan, frying pan, wok, waffle iron, slow cooker, pressure cooker
cupcup, mug, coffee cup
curtaincurtain, window blind
dolldoll
flowerpotflowerpot, vase
forkfork
hair_dryerhair_dryer
headphonesheadphones
jugjug, measuring cup, teapot, cocktail shaker, pitcher, beaker, kettle
knifeknife, kitchen knife, pizza cutter, chisel, dagger, sword
lamplamp, lantern, candle, light bulb, flashlight, torch, ceiling fan
laptoplaptop
microphonemicrophone
penpen, pencil
phonephone, telephone, cell phone, mobile phone, smartphone, corded phone, ipod
pillowpillow
plateplate, saucer, platter, cake stand
potted_plantpotted_plant, houseplant
remoteremote, remote control
scissorsscissors
snowmansnowman
spoonspoon, ladle, spatula
teapotteapot, kettle
teddy_bearteddy_bear
tin_cantin_can, cooking spray
toothbrushtoothbrush
toytoy, doll, dice, flying disc, teddy bear
watchwatch
wine_glasswine_glass
flowerflower
roserose
sunflowersunflower
animalanimal, squid, shellfish, oyster, lobster, shrimp, crab, bird, magpie, woodpecker, blue jay, ostrich,
birdbird, magpie, woodpecker, blue jay, ostrich, penguin, raven, chicken, eagle, owl, duck, canary, goose,
parrotparrot
water_birdwater_bird, duck, goose, swan
butterflybutterfly, moths and butterflies
insectinsect, tick, centipede, isopod, bee, beetle, ladybug, ant, moths and butterflies, caterpillar, butterfly
dolphindolphin
fishfish, goldfish, shark, "Rays and Skates", seahorse, squid
goldfishgoldfish
jellyfishjellyfish
sealseal, sea lion, harbor seal, walrus
shellfishshellfish, lobster, oyster, shrimp, crab, starfish, snail
whalewhale
alpacaalpaca
bearbear, brown bear
big_catbig_cat, lynx, jaguar, tiger, lion, leopard, cheetah
camelcamel
catcat
cowcow, bull, cattle
crocodilecrocodile, alligator
deerdeer, antelope
dogdog
elephantelephant
frogfrog
giraffegiraffe
hippopotamushippopotamus
horsehorse, donkey, mule
kangarookangaroo
pandapanda
pigpig
polar_bearpolar_bear
rabbitrabbit
reptilereptile, lizard, snake, turtle, tortoise, sea turtle, crocodile, frog
rhinocerosrhinoceros
sheepsheep, goat
squirrelsquirrel
turtleturtle, tortoise, sea turtle
zebrazebra
note

Some of these categories are also covered under one of the other 206 categories; for example, "cat," "dog," and a few others fall under the "animal" category. The neural network makes an independent prediction for each of the 206 categories. For example, the neural network will predict that the bounding box of a cat is both a "cat" and "animal" with relatively high, but likely different, confidences, and it's not guaranteed that one will always be predicted with higher confidence than the other. So if your application is looking for a specifc type of object (in this case, either "cat" or "animal"), make sure to check the first several most confidence categorizations for each bounding box instead of only the most confidence categorization.

Super-categories List
CategoriesCovers
carcar, taxi
vehiclevehicle, car, train, bicycle, taxi, motorcycle, bus, truck
footwearfootwear, roller skates
headwearheadwear, fedora
sports ballsports ball, football, rugby ball, tennis ball
musical instrumentbrass instrument, string instrument, piano, accordion, drum, flute
string instrumentstring instrument, guitar, violin
foodfood, apple, banana, berry, broccoli, carrot, citrus, coconut, egg, grape, pear, pumpkin, tomato, bread, cake, cheese, dessert, donut, fast food, hamburger, hot dog, ice cream, pizza, sandwich, sushi
drinkdrink, hot drink, juice
dessertdessert, cake, ice cream, donut
fast foodfast food, french fries, hot dog, pizza, hamburger, sandwich
furniturefurniture, bed, chair, couch, shelves, storage cabinet, table
jugjug, teapot
lamplamp, candle
toytoy, doll, teddy bear
flowerflower, rose, sunflower
animalanimal, bird, parrot, water bird, dolphin, fish, goldfish, jellyfish, seal, shellfish, whale, alpaca, bear, big cat, camel, cat, cow, crocodile, deer, dog, elephant, frog, giraffe, hippopotamus, horse, kangaroo, panda, pig, polar bear, rabbit, reptile, rhinoceros, sheep, squirrel, turtle, zebra
birdbird, parrot, water bird
insectinsect, butterfly
fishfish, goldfish
reptilereptile, crocodile, frog, turtle

Person Detection Model Card v0.4

Model Details

  • Model last updated: 2024-02-29
  • Model version: v0.4
  • License: refer to the terms of service for Lightship.

Technical specifications

The object detection model returns a set of bounding boxes and reports the probability that the box is a person, a human hand, or a human face.

Intended use

Intended use cases

  • Identifying people (more specifically, human hands or human faces) in an image.
  • Querying the presence or absence of people, human hands, or human faces in an image.

Permitted users

Augmented reality developers through Niantic Lightship.

Out-of-scope use cases

This model does not provide the capability to:

  • Track individuals
  • Identify or recognise individuals

Factors

The following factors apply to all object detection provided in the Lightship ARDK, including person detection:

  • Scale: objects / classes may not be detected if they are very far away from the camera.
  • Lighting: extreme light conditions may affect the overall performance.
  • Viewpoint: extreme camera views that have not been seen during training may lead to a miss in detection or a class confusion.
  • Occlusion: objects may not be detected if they are covered by other objects.
  • Motion blur: fast camera or object motion may degrade the performance of the model.
  • Flicker: there may be a ‘jittering’ effect between predictions of temporally adjacent frames.

For person detection specifically, based on known problems with computer vision technology, we identify potential relevant factors that include subgroups for:

  • Geographical region
  • Skin tone
  • Gender
  • Body posture: certain body configurations may be harder to predict due to appearing less often in the training corpus.
  • Other: age, fashion style, accessories, body alterations, etc.

Fairness evaluation

At Niantic, we strive for our technology to be inclusive and fair by following strict equality and fairness practices when building, evaluating, and deploying our models. We define person detection fairness as follows: a model makes fair predictions if it performs equally on images that depict a variety of the identified subgroups. The evaluation results focus on measuring the performance of the union of the human channels (person, human hand, and human face) on the first three main subgroups (geographical region, skin tone, and gender).

Instrumentation and dataset details

Our benchmark dataset comprises 5650 images captured around the world using the back camera of a smartphone, with these specifications:

  • Only one person per image is depicted.
  • Both indoors and outdoors environments.
  • Captured with a variety of devices.
  • No occlusions.

Images are labeled with the following attributes:

  • Geographical region: based on the UN geoscheme with the merge of European subregions and Micronesia, Polynesia, and Melanesia:
    • Northern Africa
    • Eastern Africa
    • Middle Africa
    • Southern Africa
    • Western Africa
    • Caribbean
    • Central America
    • South America
    • Northern America
    • Central Asia
    • Eastern Asia
    • South Eastern Asia
    • Southern Asia
    • Western Asia
    • Europe
    • Australia and New Zealand
    • Melanesia, Micronesia, and Polynesia
  • Skin tone: following the Fitzpatrick scale, images are annotated from subgroup 1 to 6. Skin tone is a self-reported value provided by the person in each image.
  • Gender: images are annotated with self-reported gender.

Metrics

The standard metric for evaluating object detection models -- and the one we use -- is Intersection over Union (IoU). It is computed as follows:

IoU = (overlap between predicted and g.t. boxes) / (union between predicted and g.t. boxes)

Reported IoUs are averages (mean IoU or mIoU) over images belonging to the referenced subgroup unless stated otherwise.

Fairness criteria

A model is considered to be making unfair predictions if it yields a performance (mIoU) for a particular subgroup that is three standard deviation units or more from the mean across all subgroups.

Results

Geographical evaluation

Average performance across all 17 regions is 78.74% with a standard deviation of 1.22%. All regions exhibit a performance in the range of [76.92%, 82.17%]. The maximum difference between the mean and the worst performing region is 1.83%, within our fairness criterion threshold of 3 standard deviations (3x1.22% = 3.65%).

RegionsmIoUstdevNumber of images
Northern Africa78.26%15.04%301
Eastern Africa77.41%17.11%336
Middle Africa77.30%15.72%322
Southern Africa79.09%14.93%368
Western Africa79.04%13.26%364
Caribbean79.01%12.20%412
Central America79.44%13.79%415
South America78.39%14.21%397
Northern America79.09%13.00%335
Central Asia79.52%12.56%229
Eastern Asia77.60%15.37%346
South Eastern Asia77.86%14.86%333
Southern Asia79.34%12.15%353
Western Asia78.80%14.91%370
Europe79.40%13.14%320
Australia and New Zealand76.92%18.13%374
Melanesia, Micronesia and Polynesia82.17%11.08%75
Average (across all images)78.55%14.55%5650
Average (across regions)78.74%1.22%-

Skin tone evaluation results

Average performance across all six skin tones is 78.58% with a standard deviation of 0.24%. All skin tone subgroups yield a performance in the range of [78.23%, 78.97%]. The maximum difference between the mean and the worst performing skin tone subgroup is 0.34%, within our fairness criterion threshold of 3 stdevs (3x0.24% = 0.71%).

Skin tone (Fitzpatrick scale)mIoUstdevNumber of images
178.59%12.00%247
278.49%14.59%1919
378.61%14.39%1463
478.23%16.52%457
578.97%13.60%706
678.56%14.67%858
Average (across all images)78.55%14.55%5650
Average (across skin tones)78.58%0.24%-

Gender evaluation results

Average performance of all evaluated gender subgroups is 78.53% with a range [78.01%, 79.05%]. The difference between the average and the worst performing gender is 0.52%, within our fairness criterion threshold of 3 stdevs (3x0.74% = 2.22%).

Perceived gendermIoUstdevNumber of images
Female78.01%15.08%2585
Male79.05%13.96%3065
Average (across all images)78.55%14.55%5650
Average (across genders)78.53%0.74%-

Ethical Considerations

  • Privacy: When the model is used in ARDK, inference is only applied on-device and the image is not transferred off the user device.
  • Human Life: This model is designed for entertainment purposes within an augmented reality application. It is not intended to be used for making human life-critical decisions.
  • Bias: Training datasets have not been audited for diversity and may present biases not surfaced by our benchmarks.

Caveats and Recommendations

  • Our annotated dataset only contains binary genders, which we include as male/female. Further data would be needed to evaluate across a spectrum of genders.
  • An ideal skin tone evaluation dataset would additionally include camera details, and more environment details such as lighting and humidity. Furthermore, the Fitzpatrick scale has limitations as it doesn't fully represent the full spectrum of human skin tones.
  • This model card is based on the work of Mitchell, Margaret, et al. "Model cards for model reporting." Proceedings of the conference on fairness, accountability, and transparency. 2019. Link