Giving AI a Sense of Place: How Mapillary Imagery Powers Spatial Grounding

AI models are getting smarter, but most still have no real sense of place. They know coordinates and street names, not what a corner actually looks like, where a shop's front door is, or how to describe the building across the street. Zephr, a startup building spatial AI tools, has spent the past year working on that problem, using Mapillary street-level imagery as the foundation. This post looks at how Mapillary helps Zephr give AI a real understanding of the places people live, work, and walk through every day.
Sean Gorman
20 May 2026

Figure 1: Using Mapillary imagery to predict place entrances.

Hand an LLM a bounding box of POI records and it will struggle to tell you which coffee shop is across the street. Coordinates and category labels are a start, but they say nothing about what a place looks like, where its entrance is, or what it feels like to stand in front of it. That's the gap Zephr exists to close. Over the past year, the team has been building data structures that give AI models the kind of spatial intuition you only get from actually walking down the block, and Mapillary has been the foundation under all of it.

The gap between coordinates and conversation

Traditional geospatial data was built for human eyes. Tiles render beautifully at zoom 14, but the same data hands an LLM very little to reason with. It's structured for display, not conversation. Names and categories tell you a shop exists. They don't tell you it has a green awning, a hand-painted sign, and an entrance tucked behind a planter.

Zephr set out to build a new kind of spatial data structure from the ground up, one designed for AI consumption. They call them embedding tiles: z14 web mercator tiles that carry not just coordinates and names, but 768-dimensional semantic embeddings, LLM-generated descriptions, visual descriptions derived from Mapillary street-level imagery, predicted entrance coordinates, navigation context like facade bearing and co-tenants, and links back to the source Mapillary images that made it all possible.

The key insight is simple: Mapillary imagery turns a database record into something an AI model can actually describe in conversation.

From imagery to visual descriptions

Take a POI record for a coffee shop: name, address, coordinates, category. That's enough to list it in search results. Now add a visual description generated from Mapillary imagery:

"A narrow storefront with a hand-painted sign reading 'Boxcar Coffee' above a single glass door. Brick facade with a green awning. Outdoor seating with two small tables on the sidewalk."

Suddenly the model can tell you what to look for as you walk down the street. The video below shows the same idea at scale, visual search over 15 million Mapillary images.

Figure 2: Visual search of 15 million Mapillary images for abandoned buildings.

The Zephr pipeline runs in a few stages. First, images are associated with POIs using the projected camera view to check whether the storefront was actually captured. Then vision-language models read the facade and generate a visual description. Finally, semantic embeddings are computed over both the textual POI description and the visual one. The result: queries like "purple building with green awnings" or "pizza place with bike racks" return relevant results, no keyword matching required.

Figure 3: Searching Mapillary-augmented POIs for "pizza with bike racks".

These visual descriptions and embeddings are baked into the tiles alongside the structured data, so every POI ships with enough context for a model to have a real conversation about it, powered by what Mapillary cameras actually saw at street level.

Relocalization: Mapillary imagery fixes the map

Visual descriptions are only half the story. Mapillary imagery also directly improves the positional accuracy of the underlying map.

Through work with the Overture Maps Place-Imagery Task Force, Zephr built a relocalization pipeline that uses sign detections in Mapillary imagery to correct POI positions. The pipeline detects business signs across multiple Mapillary images, clusters them by text and proximity, matches clusters to Overture POIs, and relocalizes each cluster onto the correct building facade using multi-view ray geometry.

The numbers speak for themselves. Across test areas in Louisville, Boulder, and Denver, relocalized positions hit a median error of 2.16 meters, compared to 6.56 meters for baseline Overture POIs. That's a 67% reduction measured against RTK GPS ground truth.

Figure 4: POI localization accuracy benchmarks using Mapillary imagery.

The pipeline uses a minimum of two unique Mapillary images per cluster, with a median of five contributing views and a median baseline angle of 27.5 degrees, plenty of geometry for robust triangulation. At scale, the localizer can geometrically process 15 million Mapillary images in under a day for about $100 of compute. That makes imagery-derived relocalization a practical path to better positional quality across Overture globally.

The tile pipeline: canonical data creation

All of this comes together in a canonical tile creation pipeline. Starting from raw Overture data and Mapillary imagery, the pipeline seeds a geographic area by querying Overture for places, buildings, addresses, and roads. Then it enriches each POI through a chain of steps: building assignment via address matching, facade-snap entrance prediction, OSM cross-referencing for independent confirmation, visual description generation from associated Mapillary images, and semantic embedding computation.

Figure 5: The Zephr embedding tile pipeline.

The output is a set of z14 protobuf tiles, each holding up to several hundred POIs with full navigation context. A single tile for downtown Boulder contains 993 POIs with:

  • Document embeddings for every POI
  • Visual embeddings wherever Mapillary coverage exists
  • Predicted entrance coordinates
  • Facade bearing and cardinal direction
  • Co-tenant information
  • Enclosing road segments

The tile format extends the familiar web mercator tiling scheme, but carries data structures optimized for AI consumption rather than visual rendering.

Grounding AI in real time

Embedding tiles are the foundation for a real-time grounding service. The Zephr MCP (Model Context Protocol) server uses semantic embedding search over the pre-computed tiles to answer place queries. When a user asks "where is the nearest coffee shop?", the service runs a vector similarity search over the tile's embeddings, computes distances and directions relative to the user's current position and heading, and returns a concise natural-language response.

In token-efficiency testing against Google Maps, Mapbox, and other grounding services, the Zephr MCP came in at 38 tokens per place, compared to 118 for Mapbox MCP, 137 for Google Maps Places, and 675 for Google Grounding Lite MCP.

Figure 6: Token analysis for place-based grounding services.

That efficiency comes from two deliberate design decisions: returning plain text instead of verbose JSON, and including only information useful for conversation, not opaque identifiers and tracking URLs. The response format is spatially aware in a way other services don't offer:

"The Laughing Goat (Coffee Shop): 23m, in front of you, towards the north. 1709 Pearl St, Boulder, CO 80302-5516 (Directly across the street from you. Across Pearl Street)."

On-device models: bringing grounding to the edge

The same Mapillary-powered grounding that runs in the cloud can also run entirely on a phone. That means a chatbot can answer place questions like "where's the nearest coffee shop?", "navigate me to the library", or "what's that building across the street?" without ever calling out to a server, with the imagery-derived context baked right into the local tiles.

Figure 7: Zephr's AI grounding service providing place reasoning for a chatbot.

Mapillary imagery isn't just feeding a backend somewhere. It's powering the spatial intuition behind real, conversational interactions, whether the model lives in the cloud or in a user's pocket.

Extending to world models: cross-view embeddings

Looking further out, Mapillary imagery unlocks something more ambitious than describing what's visible from the sidewalk. Zephr's cross-view embedding work (XVEE) learns to align ground-level Mapillary images with aerial and satellite views, producing a shared embedding space where street-level observations and overhead perspectives of the same location end up as neighbors in latent space.

The architecture uses a DINOv2 ViT-L/14 backbone to encode both street-level and aerial views, giving a geometrically co-registered understanding of a place. Trained with contrastive loss plus a geometric consistency loss that preserves facade bearing relationships, the model achieves 96.7% top-1 retrieval accuracy matching aerial chips to their corresponding street-level images. Matryoshka representation learning allows truncation from 512 dimensions down to 64 with minimal accuracy loss.

Figure 8: Predicting Mapillary images of a POI from a 30cm aerial chip using cross-view embeddings.

The practical implication: an AI agent can reason about a building's appearance from any direction, even without Mapillary coverage from that specific viewpoint. Given a cross-view embedding and a target approach bearing, the system can predict what a facade looks like: "Two-story brick building with glass storefront and green awning, facing south onto Pearl Street." It's a first step toward grounding world models not through physics simulation, but through learned visual-geometric correspondences anchored in real Mapillary imagery.

The bigger picture

Every piece of this is built on the same foundation: Mapillary street-level imagery, linked to open geospatial data through Overture identifiers, processed through canonical pipelines into AI-native data structures. The progression looks like this:

  1. Mapillary imagery provides visual ground truth of the physical world.
  2. Vision-language models extract descriptions and detect signs, entrances, and amenities.
  3. Multi-view geometry relocalizes POIs to facade-accurate positions.
  4. Embedding models encode visual and semantic information into compact vectors.
  5. Tiles package everything into a format optimized for AI consumption.
  6. On-device models use those tiles for real-time spatial grounding.
  7. Cross-view and predictive architectures extend street-level observations into full 3D spatial understanding.

That's what it means to give AI a sense of place: not a single model or a single dataset, but a pipeline that turns street-level imagery into the spatial intelligence AI models need to have real conversations about the real world.

Zephr's open data and code is up on GitHub, including embedding-tiles, jepa-entrance, and cross-view-embeddings. The team is building this in the open because the infrastructure for spatial AI should be as accessible as the imagery it's built on. Access to the hosted services will be shared later this year. A lot of this is still R&D and not everything will make it to launch, but Mapillary has been a critical resource for exploring what's possible, and Zephr is just getting started.

Happy mapping!

Sean Gorman
mapillary.com
© Mapillary 2026. Made with ❤️ in 🇨🇭.