Optimizing Overture Maps Data Using Apache Sedona and GeoParquet

Nov 7, 2025
3 min read

Managing enormous volumes of map data effectively has become essential for businesses, startups, and academic institutions alike in the current era of geospatial analytics. Rich geospatial information is provided by the extensive open-source mapping collection, Overture Maps; however, processing and utilizing this data at scale can be challenging. GeoParquet and Apache Sedona are useful in this situation. In this post, we'll explore how to utilize cutting-edge tools to optimize Overture Maps data for high-performance spatial analytics.

Overture Maps Data Using Apache Sedona and GeoParquet (Created by Google Gemini)

Why Sedona Apache?

Built on top of Apache Spark, Apache Sedona (previously GeoSpark) is a distributed computing platform created especially for processing spatial data. Among its main benefits are:

Scalability: Manages massive clusters of terabytes of geographic data.
For effective query execution, spatial indexing supports Hilbert Curve, QuadTree, and R-Tree indexes.
Spatial Queries: Offers high-level APIs for range queries, k-nearest neighbor (kNN) queries, and spatial joins.
Compatibility: Easily integrates with current big data pipelines by working with Spark DataFrames.

Overture Maps datasets may be queried and processed with distributed parallelism by utilizing Sedona, which significantly reduces processing times when compared to single-node alternatives.

Introducing GeoParquet

GeoParquet is a geospatial data-specific variant of the Apache Parquet columnar file format. By adding spatial types (such as POINT, LINESTRING, and POLYGON) and encoding, GeoParquet enables high-performance analytics without compromising geospatial accuracy.

Advantages of GeoParquet

Columnar storage efficiency speeds up analytical queries and lowers disk I/O.
Interoperability: Able to interact with contemporary data processing frameworks such as DuckDB, Dask, and Apache Spark.
For reliable geospatial operations, spatial metadata stores geometry type and CRS (Coordinate Reference System) data.

Organizations can significantly increase query speed and data storage efficiency by transforming raw Overture Maps data into GeoParquet.

Step-by-Step: Optimizing Overture Maps with Sedona and GeoParquet

Setting Up Your Environment

To begin, ensure you have a Spark cluster with Sedona installed:

pip install apache-sedona

pip install pyarrow pandas geopandas

Loading Overture Maps Data

Overture Maps data is often provided in GeoJSON or Shapefile formats. Using Sedona, these files can be loaded into Spark DataFrames:

from sedona.register import SedonaRegistrator

from sedona.utils import SedonaKryoRegistrator

from sedona.sql.types import GeometryType

from pyspark.sql import SparkSession

spark = SparkSession.builder \

.appName("OvertureMapsOptimization") \

.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \

.config("spark.kryo.registrator", SedonaKryoRegistrator.getName()) \

.getOrCreate()

SedonaRegistrator.registerAll(spark)

overture_df = spark.read.format("geojson").load("/data/overture_maps/roads.geojson")

Spatial Indexing with Sedona

For faster queries, create spatial indexes on the geometry column:

from sedona.sql.types import GeometryType

overture_df.createOrReplaceTempView("roads")

spark.sql("CREATE INDEX roads_index ON roads (geometry) USING QUADTREE")

Transforming Data to GeoParquet

Using PyArrow or Sedona’s DataFrame API, the dataset can be saved in GeoParquet format:

overture_df.write \

.format("parquet") \

.option("geospatial", "true") \

.mode("overwrite") \

.save("/data/overture_maps/optimized/roads.parquet")

Querying Optimized Data

Once in GeoParquet format, queries are dramatically faster:

optimized_df = spark.read.parquet("/data/overture_maps/optimized/roads.parquet")

optimized_df.createOrReplaceTempView("optimized_roads")

# Example: Find all roads within a bounding box

result = spark.sql("""

SELECT * FROM optimized_roads

WHERE ST_Contains(ST_PolygonFromEnvelope(-74.0, 40.7, -73.9, 40.8), geometry)

""")

result.show()

Improvements in Performance

By merging the columnar storage of GeoParquet with the spatial indexing of Sedona:

For large datasets, query execution times can be reduced by five to ten times.
When compared to raw GeoJSON, storage efficiency can increase by up to 60%.
In cloud systems, cluster resource consumption is optimized to lower computation expenses.

With little overhead, this method allows for real-time geospatial analytics on terabyte-scale datasets.

For businesses handling massive geospatial datasets, optimizing Overture Maps data with Apache Sedona and GeoParquet offers a potent, scalable solution. Analysts can execute complicated geographic queries more quickly and affordably by utilizing distributed computing, spatial indexing, and effective columnar storage.

This workflow provides a best-in-class method for high-performance geospatial analytics for data engineers, GIS experts, and developers working with large Overture Maps datasets.

For more information or any questions regarding Apache Sedona and GeoParquet, please don't hesitate to contact us at

Email: info@geowgs84.com

USA (HQ): (720) 702–4849

GeoWGS84AI

(A GeoWGS84 Corp Company)

https://www.geowgs84.ai

https://www.geowgs84.com/services/deep-learning-with-geospatial-data