Best Practices
This guide provides recommendations for storing, formatting, and sharing geo-embeddings with the community using cloud-native geospatial standards.
Storage
We recommend using blob storage for sharing geo-embeddings. The key requirement is HTTP access with range-read support. Options include:
- Public buckets — Great if you can pay for egress costs
- Requester-pays buckets — Enables public access with no egress cost to you
Compatible storage providers include Source Coop, AWS S3, Google Cloud Storage, Azure Blob Storage, and Hugging Face.
Hosting with Source Cooperative
To host data via Source Cooperative, fill out this intake form.
Cost Considerations
If blob storage from large providers like AWS or GCS is cost-prohibitive, check out the CNG Storage Guide for alternative options and cost comparisons.
Cloud Native Geospatial Formats
Choose your format based on how your geo-embedding outputs are gridded.
New to Cloud Native Formats?
Check out the Cloud Native Geospatial Formats Guide for an introduction to these formats.
Zarr
Recommended for regularly gridded data.
| Aspect | Specification |
|---|---|
| Coordinates | Time, Embedding, Y, X |
| Dimensions | Time, Embedding, Y, X |
| Time format | Integer (year) or datetime (timestamp, timedelta) |
| Compression | BLOSC with ZSTD |
| Sharding | Use Zarr's sharding codec |
| Conventions | geo-proj, spatial, embeddings-stac |
For multi-temporal embeddings:
| Aspect | Specification |
|---|---|
| Coordinates | timedelta, Embedding, Y, X |
| Dimensions | timedelta, Embedding, Y, X |
Zarr Recommendations
- One store per CRS — e.g., one store for global datasets in
EPSG:4326, or one store per UTM zone for regional datasets. - Chunk sizes — Aim for chunks < 1GB. Optimal chunking depends on file size and available compute resources.
Cloud Optimized GeoTIFFs (COGs)
Alternative for regularly gridded data. Zarr is preferred, but COGs are a viable option.
| Setting | Recommendation |
|---|---|
| Interleave | TILE (requires GDAL ≥ 3.11) |
| Predictor | Horizontal differencing (predictor=2) |
| Compression | ZSTD |
GeoParquet
Recommended for sparse or irregularly gridded data. These recommendations are based on the GeoParquet best practices guide.
| Setting | Recommendation |
|---|---|
| Spatial ordering | Hilbert curve or similar |
| Bbox column | Include with covering metadata |
| Compression | ZSTD |
| Row group size | ~128MB |
| Page size | Use case dependent; embedding size recommended for vector search |
| Metadata | Embed STAC asset metadata in file header (see below) |
Embedding STAC Metadata in GeoParquet
Embed STAC asset metadata directly in the GeoParquet file header using an emb key, similar to how GeoParquet uses the geo key. This approach only works when each file contains embeddings from a single STAC item. Alternatively, include a link to the STAC item in each row.
Example file with embedded STAC metadata
Reader Support
Some readers like GeoPandas don't expose custom metadata tags. Use PyArrow directly to read the emb metadata from file headers:
import json
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile("embeddings.parquet")
emb_metadata = json.loads(parquet_file.schema_arrow.metadata[b"emb"])Tooling
Zarr
- zarr-python — Python implementation
- zarrs — Rust implementation
COGs
- GDAL — Geospatial Data Abstraction Library
- Rasterio — Python interface to GDAL
- rio-cogeo — COG creation and validation
GeoParquet
- geoparquet-io — GeoParquet utilities
- GeoPandas — Geospatial data in Python
Data Provenance
Providing comprehensive metadata is highly encouraged. Include:
- Data products used
- Exact input imagery (preferably a STAC Item ID)
- Processing pipeline details
Stay tuned for more examples. If you have thoughts, please reach out to contribute.