Skip to content

Best Practices

This guide provides recommendations for storing, formatting, and sharing geo-embeddings with the community using cloud-native geospatial standards.

Storage

We recommend using blob storage for sharing geo-embeddings. The key requirement is HTTP access with range-read support. Options include:

  • Public buckets — Great if you can pay for egress costs
  • Requester-pays buckets — Enables public access with no egress cost to you

Compatible storage providers include Source Coop, AWS S3, Google Cloud Storage, Azure Blob Storage, and Hugging Face.

Hosting with Source Cooperative

To host data via Source Cooperative, fill out this intake form.

Cost Considerations

If blob storage from large providers like AWS or GCS is cost-prohibitive, check out the CNG Storage Guide for alternative options and cost comparisons.

Cloud Native Geospatial Formats

Choose your format based on how your geo-embedding outputs are gridded.

New to Cloud Native Formats?

Check out the Cloud Native Geospatial Formats Guide for an introduction to these formats.

Zarr

Recommended for regularly gridded data.

AspectSpecification
CoordinatesTime, Embedding, Y, X
DimensionsTime, Embedding, Y, X
Time formatInteger (year) or datetime (timestamp, timedelta)
CompressionBLOSC with ZSTD
ShardingUse Zarr's sharding codec
Conventionsgeo-proj, spatial, embeddings-stac

For multi-temporal embeddings:

AspectSpecification
Coordinatestimedelta, Embedding, Y, X
Dimensionstimedelta, Embedding, Y, X

Zarr Recommendations

  • One store per CRS — e.g., one store for global datasets in EPSG:4326, or one store per UTM zone for regional datasets.
  • Chunk sizes — Aim for chunks < 1GB. Optimal chunking depends on file size and available compute resources.

Cloud Optimized GeoTIFFs (COGs)

Alternative for regularly gridded data. Zarr is preferred, but COGs are a viable option.

SettingRecommendation
InterleaveTILE (requires GDAL ≥ 3.11)
PredictorHorizontal differencing (predictor=2)
CompressionZSTD

GeoParquet

Recommended for sparse or irregularly gridded data. These recommendations are based on the GeoParquet best practices guide.

SettingRecommendation
Spatial orderingHilbert curve or similar
Bbox columnInclude with covering metadata
CompressionZSTD
Row group size~128MB
Page sizeUse case dependent; embedding size recommended for vector search
MetadataEmbed STAC asset metadata in file header (see below)

Embedding STAC Metadata in GeoParquet

Embed STAC asset metadata directly in the GeoParquet file header using an emb key, similar to how GeoParquet uses the geo key. This approach only works when each file contains embeddings from a single STAC item. Alternatively, include a link to the STAC item in each row.

Example file with embedded STAC metadata

Reader Support

Some readers like GeoPandas don't expose custom metadata tags. Use PyArrow directly to read the emb metadata from file headers:

python
import json
import pyarrow.parquet as pq

parquet_file = pq.ParquetFile("embeddings.parquet")
emb_metadata = json.loads(parquet_file.schema_arrow.metadata[b"emb"])

Tooling

Zarr

COGs

  • GDAL — Geospatial Data Abstraction Library
  • Rasterio — Python interface to GDAL
  • rio-cogeo — COG creation and validation

GeoParquet

Data Provenance

Providing comprehensive metadata is highly encouraged. Include:

  • Data products used
  • Exact input imagery (preferably a STAC Item ID)
  • Processing pipeline details

Stay tuned for more examples. If you have thoughts, please reach out to contribute.