Best Practices

This guide provides recommendations for storing, formatting, and sharing geo-embeddings with the community using cloud-native geospatial standards.

Storage

We recommend using blob storage for sharing geo-embeddings. The key requirement is HTTP access with range-read support. Options include:

Public buckets — Great if you can pay for egress costs
Requester-pays buckets — Enables public access with no egress cost to you

Compatible storage providers include Source Coop, AWS S3, Google Cloud Storage, Azure Blob Storage, and Hugging Face.

Hosting with Source Cooperative

To host data via Source Cooperative, fill out this intake form.

Cost Considerations

If blob storage from large providers like AWS or GCS is cost-prohibitive, check out the CNG Storage Guide for alternative options and cost comparisons.

Cloud Native Geospatial Formats

Choose your format based on how your geo-embedding outputs are gridded.

New to Cloud Native Formats?

Check out the Cloud Native Geospatial Formats Guide for an introduction to these formats.

Zarr

Recommended for regularly gridded data.

Aspect	Specification
Coordinates	Time, Embedding, Y, X
Dimensions	Time, Embedding, Y, X
Time format	Integer (year) or datetime (timestamp, timedelta)
Compression	BLOSC with ZSTD
Sharding	Use Zarr's sharding codec
Conventions	geo-proj, spatial, embeddings-stac

For multi-temporal embeddings:

Aspect	Specification
Coordinates	timedelta, Embedding, Y, X
Dimensions	timedelta, Embedding, Y, X

Zarr Recommendations

One store per CRS — e.g., one store for global datasets in EPSG:4326, or one store per UTM zone for regional datasets.
Chunk sizes — Aim for chunks < 1GB. Optimal chunking depends on file size and available compute resources.

Cloud Optimized GeoTIFFs (COGs)

Alternative for regularly gridded data. Zarr is preferred, but COGs are a viable option.

Setting	Recommendation
Interleave	TILE (requires GDAL ≥ 3.11)
Predictor	Horizontal differencing (predictor=2)
Compression	ZSTD

GeoParquet

Recommended for sparse or irregularly gridded data. These recommendations are based on the GeoParquet best practices guide.

Setting	Recommendation
Spatial ordering	Hilbert curve or similar
Bbox column	Include with covering metadata
Compression	ZSTD
Row group size	~128MB
Page size	Use case dependent; embedding size recommended for vector search
Metadata	Embed STAC asset metadata in file header (see below)

Embedding STAC Metadata in GeoParquet

Embed STAC asset metadata directly in the GeoParquet file header using an emb key, similar to how GeoParquet uses the geo key. This approach only works when each file contains embeddings from a single STAC item. Alternatively, include a link to the STAC item in each row.

Example file with embedded STAC metadata

Reader Support

Some readers like GeoPandas don't expose custom metadata tags. Use PyArrow directly to read the emb metadata from file headers:

python

import json
import pyarrow.parquet as pq

parquet_file = pq.ParquetFile("embeddings.parquet")
emb_metadata = json.loads(parquet_file.schema_arrow.metadata[b"emb"])

Tooling

Zarr

zarr-python — Python implementation
zarrs — Rust implementation

COGs

GDAL — Geospatial Data Abstraction Library
Rasterio — Python interface to GDAL
rio-cogeo — COG creation and validation

GeoParquet

geoparquet-io — GeoParquet utilities
GeoPandas — Geospatial data in Python

Data Provenance

Providing comprehensive metadata is highly encouraged. Include:

Data products used
Exact input imagery (preferably a STAC Item ID)
Processing pipeline details

Stay tuned for more examples. If you have thoughts, please reach out to contribute.

Best Practices ​

Storage ​

Cloud Native Geospatial Formats ​

Zarr ​

Cloud Optimized GeoTIFFs (COGs) ​

GeoParquet ​

Embedding STAC Metadata in GeoParquet ​

Tooling ​

Zarr ​

COGs ​

GeoParquet ​

Data Provenance ​

Best Practices

Storage

Cloud Native Geospatial Formats

Zarr

Cloud Optimized GeoTIFFs (COGs)

GeoParquet

Embedding STAC Metadata in GeoParquet

Tooling

Zarr

COGs

GeoParquet

Data Provenance