Skip to content

Accessing Alpha Earth Foundations (AEF) embeddings with xarray and zarr-python

This tutorial shows how to access, inspect, and fetch geospatial embedding data from a cloud-hosted GeoZarr dataset. We'll use Alpha Earth Foundations (AEF), a Google DeepMind project that produced planetary-scale satellite embeddings, but the same access patterns are broadly useful for other cloud-hosted GeoZarr datasets too.

What are geospatial embeddings?

Geospatial embeddings are compact numerical vectors that represent satellite imagery. A pretrained model encodes each image, patch, or pixel into a fixed-length vector that captures its visual and spectral content. These embeddings can be used for downstream tasks like land-cover classification, change detection, or similarity search without reprocessing the raw imagery.

The dataset: Alpha Earth Foundations

Alpha Earth Foundations (AEF) provides annual embeddings derived from multi-modal satellite time series from 2017-2025. The dataset is available in its original COG format in the Earth Engine Data Catalog, and LGND.ai has also reprocessed it into a single global mosaic. This tutorial uses that GeoZarr copy, which stores the data in a cloud-native, chunked array format designed for multi-dimensional geospatial data. The full collection is over 3 petabytes.

Where is it hosted?

The data used here lives on Source Cooperative (Source Coop), an open data hosting platform backed by S3-compatible cloud storage. Source Coop provides free, anonymous access to many geospatial datasets, so no account or API key is required to read the data.

Libraries

This tutorial uses:

  • xarray to open the Zarr store as a labeled Dataset
  • zarr-python to fetch small subregions directly from the underlying store
  • obstore to access the S3-hosted Zarr store efficiently
python
import xarray as xr

STORE = 's3://us-west-2.opendata.source.coop/tge-labs/aef-mosaic/'
python
import warnings
from zarr.errors import ZarrUserWarning

warnings.simplefilter('ignore', ZarrUserWarning)
ds = xr.open_zarr(STORE, storage_options={'anon': True}, consolidated=False)

Explore the xarray Dataset and DataArray

The full xarray Dataset is the top-level container. It shows the available variables, dimensions, and coordinates.

python
ds
<xarray.Dataset> Size: 3PB
Dimensions:     (time: 7, embedding: 64, y: 1859584, x: 4009984)
Coordinates:
  * time        (time) int32 28B 2018 2019 2020 2021 2022 2023 2024
  * y           (y) float64 15MB 83.69 83.69 83.69 ... -83.36 -83.36 -83.36
  * x           (x) float64 32MB -180.0 -180.0 -180.0 ... 180.2 180.2 180.2
Dimensions without coordinates: embedding
Data variables:
    embeddings  (time, embedding, y, x) int8 3PB ...

Variables in an xarray Dataset are DataArray objects: labeled n-dimensional arrays. This store has one main variable named embeddings.

python
ds.data_vars
Data variables:
    embeddings  (time, embedding, y, x) int8 3PB ...
python
ds['embeddings']
<xarray.DataArray 'embeddings' (time: 7, embedding: 64, y: 1859584, x: 4009984)> Size: 3PB
[3340692134821888 values with dtype=int8]
Coordinates:
  * time     (time) int32 28B 2018 2019 2020 2021 2022 2023 2024
  * y        (y) float64 15MB 83.69 83.69 83.69 83.69 ... -83.36 -83.36 -83.36
  * x        (x) float64 32MB -180.0 -180.0 -180.0 -180.0 ... 180.2 180.2 180.2
Dimensions without coordinates: embedding
Attributes:
    proj:code:               EPSG:4326
    spatial:dimensions:      ['y', 'x']
    spatial:transform:       [8.983111749910169e-05, 0.0, -180.0, 0.0, -8.983...
    spatial:transform_type:  affine
    spatial:bbox:            [-180.0, -83.36280346631479, 180.2213438735178, ...
    spatial:shape:           [1859584, 4009984]
    spatial:registration:    pixel
    embedding_dimensions:    ['A00', 'A01', 'A02', 'A03', 'A04', 'A05', 'A06'...
    zarr_conventions:        [{'uuid': 'f17cb550-5864-4468-aeb7-f3180cfb622f'...
    _zarrs:                  {'description': 'This array was created with zar...
python
{'dims': ds['embeddings'].dims, 'shape': ds['embeddings'].shape, 'dtype': str(ds['embeddings'].dtype)}
{'dims': ('time', 'embedding', 'y', 'x'),
 'shape': (7, 64, 1859584, 4009984),
 'dtype': 'int8'}

If you're coming from multispectral imagery, it helps to think of this as a raster cube whose values are learned features rather than physical measurements. Each (time, y, x) location stores a learned 64-dimensional feature vector, so the embedding axis is not like RGB or multispectral bands with direct semantic meaning per channel.

Dimensions are the axis names and their lengths. Here, time, embedding, y, and x define the positional structure of the array.

python
ds.dims
FrozenMappingWarningOnValuesAccess({'time': 7, 'embedding': 64, 'y': 1859584, 'x': 4009984})

Coordinates give semantic labels for positional axes. In this dataset, time, x, and y are explicit coordinate arrays.

python
ds.coords
Coordinates:
  * time     (time) int32 28B 2018 2019 2020 2021 2022 2023 2024
  * y        (y) float64 15MB 83.69 83.69 83.69 83.69 ... -83.36 -83.36 -83.36
  * x        (x) float64 32MB -180.0 -180.0 -180.0 -180.0 ... 180.2 180.2 180.2
python
ds.coords['time']
<xarray.DataArray 'time' (time: 7)> Size: 28B
array([2018, 2019, 2020, 2021, 2022, 2023, 2024], dtype=int32)
Coordinates:
  * time     (time) int32 28B 2018 2019 2020 2021 2022 2023 2024
Attributes:
    _zarrs:   {'description': 'This array was created with zarrs', 'repositor...
python
ds.coords['x']
<xarray.DataArray 'x' (x: 4009984)> Size: 32MB
array([-179.999955, -179.999865, -179.999775, ...,  180.221119,  180.221209,
        180.221299], shape=(4009984,))
Coordinates:
  * x        (x) float64 32MB -180.0 -180.0 -180.0 -180.0 ... 180.2 180.2 180.2
Attributes:
    _zarrs:   {'description': 'This array was created with zarrs', 'repositor...
python
ds.coords['y']
<xarray.DataArray 'y' (y: 1859584)> Size: 15MB
array([ 83.68566 ,  83.685571,  83.685481, ..., -83.362579, -83.362669,
       -83.362759], shape=(1859584,))
Coordinates:
  * y        (y) float64 15MB 83.69 83.69 83.69 83.69 ... -83.36 -83.36 -83.36
Attributes:
    _zarrs:   {'description': 'This array was created with zarrs', 'repositor...

Inspect attributes and geospatial metadata

Dataset attrs are metadata attached to the dataset object itself.

python
ds.attrs
{}
python
{'proj:code': ds['embeddings'].attrs['proj:code'], 'zarr_conventions': ds['embeddings'].attrs['zarr_conventions']}
{'proj:code': 'EPSG:4326',
 'zarr_conventions': [{'uuid': 'f17cb550-5864-4468-aeb7-f3180cfb622f',
   'name': 'proj:',
   'description': 'Coordinate reference system information for geospatial data',
   'spec_url': 'https://github.com/zarr-experimental/geo-proj/blob/v1/README.md',
   'schema_url': 'https://raw.githubusercontent.com/zarr-experimental/geo-proj/refs/tags/v1/schema.json'},
  {'uuid': '689b58e2-cf7b-45e0-9fff-9cfc0883d6b4',
   'name': 'spatial:',
   'description': 'Spatial coordinate information',
   'spec_url': 'https://github.com/zarr-conventions/spatial/blob/v1/README.md',
   'schema_url': 'https://raw.githubusercontent.com/zarr-conventions/spatial/refs/tags/v1/schema.json'}]}

In this store, the dataset-level attrs are empty. The geospatial and Zarr-specific metadata lives on the embeddings DataArray instead. Multispectral raster values usually correspond to physical measurements like reflectance or backscatter, but embedding values are learned features with no direct semantic meaning per channel.

python
ds['embeddings'].attrs['embedding_dimensions'][:8]
['A00', 'A01', 'A02', 'A03', 'A04', 'A05', 'A06', 'A07']

The primary geospatial metadata is attached to the embeddings DataArray attrs.

python
ds['embeddings'].attrs.keys()
dict_keys(['proj:code', 'spatial:dimensions', 'spatial:transform', 'spatial:transform_type', 'spatial:bbox', 'spatial:shape', 'spatial:registration', 'embedding_dimensions', 'zarr_conventions', '_zarrs'])
python
ds['embeddings'].attrs['spatial:transform']
[8.983111749910169e-05,
 0.0,
 -180.0,
 0.0,
 -8.983111749910169e-05,
 83.68570533713473]
python
ds['embeddings'].attrs['spatial:bbox']
[-180.0, -83.36280346631479, 180.2213438735178, 83.68570533713473]
python
ds['embeddings'].attrs['spatial:shape']
[1859584, 4009984]

spatial:transform is the affine geotransform. It tells you pixel size, rotation terms, and the origin used to map array indices onto geographic space.

Inspect data volume and trying different access patterns for Zarr chunks and Zarr shards

Xarray exposes this array as 256 x 256 logical chunks, but the store is physically written as 4096 x 4096 shards. In raw int8 terms, one logical chunk is 4,194,304 bytes (~4 MiB), while one physical shard is 1,073,741,824 bytes (~1 GiB). A small bounding-box query can look lightweight at the API level, but if it intersects multiple chunks or shards it can result in much larger IO requests.

With this AEF store, xarray gives you labels and coordinate-aware slicing, while zarr-python gives you more low-level control over IO. On a sharded Zarr store, xarray.open_zarr(... chunks=...) will still end up reading shard-sized units during .compute(), .load(), or plotting, even when your requested window looks much smaller than a full shard.

python
import math

import zarr
from obstore.store import S3Store
from zarr.storage import ObjectStore
python
store = S3Store.from_url(STORE, skip_signature=True, region='us-west-2')
embeddings = zarr.open_group(store=ObjectStore(store, read_only=True), mode='r')['embeddings']

You can inspect chunking and sharding from metadata alone, without fetching array values.

python
{'shape': embeddings.shape, 'chunks': embeddings.chunks, 'shards': embeddings.shards}
{'shape': (7, 64, 1859584, 4009984),
 'chunks': (1, 64, 256, 256),
 'shards': (1, 64, 4096, 4096)}
python
{'chunk_nbytes': math.prod(embeddings.chunks) * embeddings.dtype.itemsize, 'shard_nbytes': math.prod(embeddings.shards) * embeddings.dtype.itemsize}
{'chunk_nbytes': 4194304, 'shard_nbytes': 1073741824}

A physical shard spans (time, embedding, y, x) = (1, 64, 4096, 4096). Because the dtype is int8, that is 1,073,741,824 raw bytes (~1 GiB) per shard before compression. A logical chunk spans (1, 64, 256, 256), which is 4,194,304 raw bytes (~4 MiB). We use one representative point inside the example bounding box to identify a chunk and shard that intersect that area.

python
bbox_lon = -113.03
bbox_lat = 37.30
dx, _, x_origin, _, dy, y_origin = embeddings.attrs['spatial:transform']

x_index = math.floor((bbox_lon - x_origin) / dx)
y_index = math.floor((bbox_lat - y_origin) / dy)

x_chunk = x_index // embeddings.chunks[3]
y_chunk = y_index // embeddings.chunks[2]
x_shard = x_index // embeddings.shards[3]
y_shard = y_index // embeddings.shards[2]
python
x_chunk_start = x_chunk * embeddings.chunks[3]
y_chunk_start = y_chunk * embeddings.chunks[2]
x_shard_start = x_shard * embeddings.shards[3]
y_shard_start = y_shard * embeddings.shards[2]

This zarr-python example reads a single logical chunk that intersects the bounding box.

python
bbox_chunk = embeddings[
    0,
    :,
    y_chunk_start:y_chunk_start + embeddings.chunks[2],
    x_chunk_start:x_chunk_start + embeddings.chunks[3],
]
bbox_chunk.shape
python
{'chunk_nbytes_loaded': bbox_chunk.nbytes, 'full_shard_nbytes': math.prod(embeddings.shards) * embeddings.dtype.itemsize}

This makes the size difference concrete after a real read: the loaded chunk uses .nbytes in memory, while a full shard stays much larger.

You can plot one embedding channel from that chunk with matplotlib.pyplot.imshow. A single embedding channel is not especially interpretable on its own, so follow-up tutorials can show how to use dimension reduction to make the learned feature space easier to reason about.

python
import matplotlib.pyplot as plt

plt.figure(figsize=(6, 6))
plt.imshow(bbox_chunk[0], cmap='viridis')
plt.colorbar()

Here is the equivalent xarray code to force a read of exactly one physical shard. Leave this cell unrun if you just want the example. For this AEF array, one full shard is about 1 GiB raw before compression, even though the embedding=0 slice below would only materialize a 4096 x 4096 result in memory.

python
(
    ds['embeddings']
    .isel(
        time=0,
        embedding=0,
        y=slice(y_shard_start, y_shard_start + embeddings.shards[2]),
        x=slice(x_shard_start, x_shard_start + embeddings.shards[3]),
    )
    .load()
)