Optimizing Overture Maps Data Using Apache Sedona and GeoParquet
- Nov 7, 2025
- 3 min read
Managing enormous volumes of map data effectively has become essential for businesses, startups, and academic institutions alike in the current era of geospatial analytics. Rich geospatial information is provided by the extensive open-source mapping collection, Overture Maps; however, processing and utilizing this data at scale can be challenging. GeoParquet and Apache Sedona are useful in this situation. In this post, we'll explore how to utilize cutting-edge tools to optimize Overture Maps data for high-performance spatial analytics.

Why Sedona Apache?
Built on top of Apache Spark, Apache Sedona (previously GeoSpark) is a distributed computing platform created especially for processing spatial data. Among its main benefits are:
Scalability: Manages massive clusters of terabytes of geographic data.
For effective query execution, spatial indexing supports Hilbert Curve, QuadTree, and R-Tree indexes.
Spatial Queries: Offers high-level APIs for range queries, k-nearest neighbor (kNN) queries, and spatial joins.
Compatibility: Easily integrates with current big data pipelines by working with Spark DataFrames.
Overture Maps datasets may be queried and processed with distributed parallelism by utilizing Sedona, which significantly reduces processing times when compared to single-node alternatives.
Introducing GeoParquet
GeoParquet is a geospatial data-specific variant of the Apache Parquet columnar file format. By adding spatial types (such as POINT, LINESTRING, and POLYGON) and encoding, GeoParquet enables high-performance analytics without compromising geospatial accuracy.
Advantages of GeoParquet
Columnar storage efficiency speeds up analytical queries and lowers disk I/O.
Interoperability: Able to interact with contemporary data processing frameworks such as DuckDB, Dask, and Apache Spark.
For reliable geospatial operations, spatial metadata stores geometry type and CRS (Coordinate Reference System) data.
Organizations can significantly increase query speed and data storage efficiency by transforming raw Overture Maps data into GeoParquet.
Step-by-Step: Optimizing Overture Maps with Sedona and GeoParquet
Setting Up Your Environment
To begin, ensure you have a Spark cluster with Sedona installed:
pip install apache-sedona
pip install pyarrow pandas geopandas
Loading Overture Maps Data
Overture Maps data is often provided in GeoJSON or Shapefile formats. Using Sedona, these files can be loaded into Spark DataFrames:
from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator
from sedona.sql.types import GeometryType
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("OvertureMapsOptimization") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryo.registrator", SedonaKryoRegistrator.getName()) \
.getOrCreate()
SedonaRegistrator.registerAll(spark)
overture_df = spark.read.format("geojson").load("/data/overture_maps/roads.geojson")
Spatial Indexing with Sedona
For faster queries, create spatial indexes on the geometry column:
from sedona.sql.types import GeometryType
overture_df.createOrReplaceTempView("roads")
spark.sql("CREATE INDEX roads_index ON roads (geometry) USING QUADTREE")
Transforming Data to GeoParquet
Using PyArrow or Sedona’s DataFrame API, the dataset can be saved in GeoParquet format:
overture_df.write \
.format("parquet") \
.option("geospatial", "true") \
.mode("overwrite") \
.save("/data/overture_maps/optimized/roads.parquet")
Querying Optimized Data
Once in GeoParquet format, queries are dramatically faster:
optimized_df = spark.read.parquet("/data/overture_maps/optimized/roads.parquet")
optimized_df.createOrReplaceTempView("optimized_roads")
# Example: Find all roads within a bounding box
result = spark.sql("""
SELECT * FROM optimized_roads
WHERE ST_Contains(ST_PolygonFromEnvelope(-74.0, 40.7, -73.9, 40.8), geometry)
""")
result.show()
Improvements in Performance
By merging the columnar storage of GeoParquet with the spatial indexing of Sedona:
For large datasets, query execution times can be reduced by five to ten times.
When compared to raw GeoJSON, storage efficiency can increase by up to 60%.
In cloud systems, cluster resource consumption is optimized to lower computation expenses.
With little overhead, this method allows for real-time geospatial analytics on terabyte-scale datasets.
For businesses handling massive geospatial datasets, optimizing Overture Maps data with Apache Sedona and GeoParquet offers a potent, scalable solution. Analysts can execute complicated geographic queries more quickly and affordably by utilizing distributed computing, spatial indexing, and effective columnar storage.
This workflow provides a best-in-class method for high-performance geospatial analytics for data engineers, GIS experts, and developers working with large Overture Maps datasets.
For more information or any questions regarding Apache Sedona and GeoParquet, please don't hesitate to contact us at
Email: info@geowgs84.com
USA (HQ): (720) 702–4849
(A GeoWGS84 Corp Company)




Comments