# Elasticsearch 在地理信息空间索引的探索和演进

2022/06/27 09:20

vivo 互联网服务器团队- Shuai Guangying

# 一、业务背景

LBS服务是当前互联网重要的一环，涉及餐饮、娱乐、打车、零售等场景。在这些场景中，有很重要的一项基础能力：搜索附近的POI。比如搜索附近的美食，搜索附近的电影院，搜索附近的专车，搜索附近的门店。例如：以某个坐标点为中心查询出1km半径范围的POI坐标，如下图所示：

Elasticsearch在地理位置信息检索上具备了毫秒级响应的能力，而毫秒级响应对于用户体验至关重要。上面的问题使用Elasticsearch，只需用到geo_distance查询就可以解决业务问题。使用Elasticsearch的查询语法如下：

GET /my_locations/_search
{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"geo_distance": {
"distance": "1km",
"pin.location": {
"lat": 40,
"lon": 116
}
}
}
}
}
}

# 二、背景知识

1.如何精确定位一个地址？

2. 如何计算两个地址距离？

// 代码摘自lucene-core-8.2.0， SloppyMath工具类

/**
* Returns the Haversine distance in meters between two points
* given the previous result from {@link #haversinSortKey(double, double, double, double)}
* @return distance in meters.
*/
public static double haversinMeters(double sortKey) {
}

/**
* Returns a sort key for distance. This is less expensive to compute than
* {@link #haversinMeters(double, double, double, double)}, but it always compares the same.
* This can be converted into an actual distance with {@link #haversinMeters(double)}, which
* effectively does the second half of the computation.
*/
public static double haversinSortKey(double lat1, double lon1, double lat2, double lon2) {
double x1 = lat1 * TO_RADIANS;
double x2 = lat2 * TO_RADIANS;
double h1 = 1 - cos(x1 - x2);
double h2 = 1 - cos((lon1 - lon2) * TO_RADIANS);
double h = h1 + cos(x1) * cos(x2) * h2;
// clobber crazy precision so subsequent rounding does not create ties.
return Double.longBitsToDouble(Double.doubleToRawLongBits(h) & 0xFFFFFFFFFFFFFFF8L);
}
// haversin
// TODO: remove these for java 9, they fixed Math.toDegrees()/toRadians() to work just like this.
public static final double TO_RADIANS = Math.PI / 180D;
public static final double TO_DEGREES = 180D / Math.PI;

// Earth's mean radius, in meters and kilometers; see http://earth-info.nga.mil/GandG/publications/tr8350.2/wgs84fin.pdf
private static final double TO_METERS = 6_371_008.7714D; // equatorial radius
private static final double TO_KILOMETERS = 6_371.0087714D; // equatorial radius

/**
* Returns the Haversine distance in meters between two points
* specified in decimal degrees (latitude/longitude).  This works correctly
* even if the dateline is between the two points.
* <p>
* Error is at most 4E-1 (40cm) from the actual haversine distance, but is typically
* much smaller for reasonable distances: around 1E-5 (0.01mm) for distances less than
* 1000km.
*
* @param lat1 Latitude of the first point.
* @param lon1 Longitude of the first point.
* @param lat2 Latitude of the second point.
* @param lon2 Longitude of the second point.
* @return distance in meters.
*/
public static double haversinMeters(double lat1, double lon1, double lat2, double lon2) {
return haversinMeters(haversinSortKey(lat1, lon1, lat2, lon2));
}

​​​​​​​3.如何方便在互联网分享经纬度坐标？

Geohash是2008-02-26由Gustavo Niemeyer在自己的个人博客上公布的算法服务。其初衷在于通过对经纬度的编码对外提供简短的URL标识地图位置，方便在电子邮件、论坛和网站中使用。

1. Geohash给地图上每个坐标提供了独一无二的ID，这个唯一ID就相当于给每个地理位置提供了一个身份证。唯一ID在数据库中应用场景非常丰富。

2. 在数据库中给坐标点提供了另一种存储方式，将二维的坐标点转化成为一维的字符串，对于一维数据就可以借助B树等索引来加速查询。

3. Geohash是一种前缀编码，位置相近的坐标点前缀相同。通过前缀提供了高性能的邻近位置POI查询，而邻近位置POI查询是LBS服务的核心能力。

Geohash是一种前缀编码，位置相近的坐标点前缀相同。Geohash编码长度不同，所覆盖区域范围不同。

• 暴力算法

• 二次筛选

1. 基于坐标中心点计算出geohash, 基于半径确定geohash前缀。

2. 通过Geohash前缀初筛出大致符合要求的坐标点(需要将中心点所在Geohash块周围8个Geohash块纳入初筛范围)。

3. 对于初筛结果使用Haversine公式进行二次筛选。

# 三、方案演进

Elasticsearch从2.0版本开始支持geo_distance查询，到当前已更新到7.14版本。

## 3.1 史前时代

Elasticsearch是基于Lucene构建的搜索引擎。Lucene最开始的设想是一个全文检索工具箱，即支持字符串检索，并没有考虑数值类型的处理。其核心思想非常简单，将文档分词后，为每个词构建一个term => array[docIds]的映射。

Lucene提供了一种适配方案RangeQuery。就是用枚举来模拟数值查询。简单来说：RangeQuery=BooleanQuery+TermQuery，所以限制查询是整数且区间最大不能超过1024。这种实现是可以说是非常鸡肋的，好在Lucene 2.9.0版本真正支持数值查询。

LUCENE-1470,LUCENE-1582,LUCENE-1602,LUCENE-1673,LUCENE-1701, LUCENE-1712

Added NumericRangeQuery and NumericRangeFilter, a fast alternative to RangeQuery/RangeFilter for numeric searches. They depend on a specific structure of terms in the index that can be created by indexing using the new NumericField or NumericTokenStream classes. NumericField can only be used for indexing and optionally stores the values as string representation in the doc store. Documents returned from IndexReader/IndexSearcher will return only the String value using the standard Fieldable interface. NumericFields can be sorted on and loaded into the FieldCache. (Uwe Schindler, Yonik Seeley, Mike McCandless)

## 3.2 Elasticsearch 2.0 版本

public static final class GeoPointFieldType extends MappedFieldType {

private MappedFieldType geohashFieldType;
private int geohashPrecision;
private boolean geohashPrefixEnabled;

private MappedFieldType latFieldType;
private MappedFieldType lonFieldType;

public GeoPointFieldType() {}
}

// 计算经纬度坐标+距离得到的矩形区域
// GeoDistance类
public static DistanceBoundingCheck distanceBoundingCheck(double sourceLatitude, double sourceLongitude, double distance, DistanceUnit unit) {
// angular distance in radians on a great circle
// assume worst-case: use the minor axis
double radDist = unit.toMeters(distance) / GeoUtils.EARTH_SEMI_MINOR_AXIS;

double minLon, maxLon;
if (minLat > MIN_LAT && maxLat < MAX_LAT) {
if (minLon < MIN_LON) minLon += 2d * Math.PI;
if (maxLon > MAX_LON) maxLon -= 2d * Math.PI;
} else {
// a pole is within the distance
minLat = Math.max(minLat, MIN_LAT);
maxLat = Math.min(maxLat, MAX_LAT);
minLon = MIN_LON;
maxLon = MAX_LON;
}

GeoPoint topLeft = new GeoPoint(Math.toDegrees(maxLat), Math.toDegrees(minLon));
GeoPoint bottomRight = new GeoPoint(Math.toDegrees(minLat), Math.toDegrees(maxLon));
if (minLon > maxLon) {
return new Meridian180DistanceBoundingCheck(topLeft, bottomRight);
}
return new SimpleDistanceBoundingCheck(topLeft, bottomRight);
}

public class IndexedGeoBoundingBoxQuery {

public static Query create(GeoPoint topLeft, GeoPoint bottomRight, GeoPointFieldMapper.GeoPointFieldType fieldType) {
if (!fieldType.isLatLonEnabled()) {
throw new IllegalArgumentException("lat/lon is not enabled (indexed) for field [" + fieldType.names().fullName() + "], can't use indexed filter on it");
}
//checks to see if bounding box crosses 180 degrees
if (topLeft.lon() > bottomRight.lon()) {
return westGeoBoundingBoxFilter(topLeft, bottomRight, fieldType);
} else {
return eastGeoBoundingBoxFilter(topLeft, bottomRight, fieldType);
}
}

private static Query westGeoBoundingBoxFilter(GeoPoint topLeft, GeoPoint bottomRight, GeoPointFieldMapper.GeoPointFieldType fieldType) {
BooleanQuery.Builder filter = new BooleanQuery.Builder();
filter.setMinimumNumberShouldMatch(1);
return new ConstantScoreQuery(filter.build());
}

private static Query eastGeoBoundingBoxFilter(GeoPoint topLeft, GeoPoint bottomRight, GeoPointFieldMapper.GeoPointFieldType fieldType) {
BooleanQuery.Builder filter = new BooleanQuery.Builder();
return new ConstantScoreQuery(filter.build());
}
}

// GeoDistanceRangeQuery类的实现
@Override
public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
final Weight boundingBoxWeight;
if (boundingBoxFilter != null) {
boundingBoxWeight = searcher.createNormalizedWeight(boundingBoxFilter, false);
} else {
boundingBoxWeight = null;
}
return new ConstantScoreWeight(this) {
@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
final DocIdSetIterator approximation;
if (boundingBoxWeight != null) {
approximation = boundingBoxWeight.scorer(context);
} else {
}
if (approximation == null) {
// if the approximation does not match anything, we're done
return null;
}
final TwoPhaseIterator twoPhaseIterator = new TwoPhaseIterator(approximation) {
@Override
public boolean matches() throws IOException {
final int doc = approximation.docID();
values.setDocument(doc);
final int length = values.count();
for (int i = 0; i < length; i++) {
GeoPoint point = values.valueAt(i);
if (distanceBoundingCheck.isWithin(point.lat(), point.lon())) {
double d = fixedSourceDistance.calculate(point.lat(), point.lon());
if (d >= inclusiveLowerPoint && d <= inclusiveUpperPoint) {
return true;
}
}
}
return false;
}
};
return new ConstantScoreScorer(this, score(), twoPhaseIterator);
}
};
}

1. 利用中心点坐标和半径确定矩形区域边界。

2. 利用Bool查询综合两个NumericRangeQuery查询，实现矩形区域初筛。

3. 利用Haversine公式计算中心点和矩形区域内每个坐标点距离，进行第二阶段过滤操作，筛选出最终符合条件的docId集合。

## 3.3 Elasticsearch 2.2 版本

ES2.0版本的实现有个问题， 就是没有很好解决二维组合条件查询的数据筛选。它是分别获取符合纬度范围条件的文档集合和符合经度范围条件的文档集合然后进行交集，初筛了太多无效的文档集合。

The region quadtree represents a partition of space in two dimensions by decomposing the region into four equal quadrants, subquadrants, and so on with each leaf node containing data corresponding to a specific subregion. Each node in the tree either has exactly four children, or has no children (a leaf node). The height of quadtrees that follow this decomposition strategy (i.e. subdividing subquadrants as long as there is interesting data in the subquadrant for which more refinement is desired) is sensitive to and dependent on the spatial distribution of interesting areas in the space being decomposed. The region quadtree is a type of trie.

morton编码：在理解ES的处理思路前，需要科普一个知识点，那就是morton编码。关于morton编码，跟geohash类似，是一种将二维数据按二进制位交叉编码成一维数据的一种网格编码，其用法和特点跟geohash也是类似的。对于64位的morton码，其经纬度定位精度范围控制到了厘米级别，对于地理位置场景而言，是非常非常高的精度了。

(取shift=27,36)

(取shift=36,45)

double centerLon = 116.433322;
double centerLat = 39.900255;
GeoRect geoRect = GeoUtils.circleToBBox(centerLon, centerLat, radiusMeters);
System.out.println( geoRect );

IndexGeoPointFieldData indexFieldData = parseContext.getForField(fieldType);
final Query query;
if (parseContext.indexVersionCreated().before(Version.V_2_2_0)) {
query = new GeoDistanceRangeQuery(point, null, distance, true, false, geoDistance, geoFieldType, indexFieldData, optimizeBbox);
} else {
query = new GeoPointDistanceQuery(indexFieldData.getFieldNames().indexName(), point.lon(), point.lat(), distance);
}

if (queryName != null) {
}

## 3.4 Elasticsearch 5.0 版本

• LUCENE-6825

This can be used for very fast 1D range filtering for numerics, removing the 8 byte (long/double) limit we have today, so e.g. we could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.

It can also be used for > 1D use cases, like 2D (lat/lon) and 3D (x/y/z with geo3d) geo shape intersection searches.

...

It should give sizable performance gains (smaller index, faster searching) over what we have today, and even over what auto-prefix with efficient numeric terms would do.

【优化内存查询】：BST(binary-search-tree) > Self-balanced BST > kd-tree。

【优化外存(硬盘)查询】：B-tree > K-D-B-tree > BKD tree。

kd-tree其实就是多维的BST。例如：

【数据存储】：BKD tree的核心思路是非常简单的，将N维点集合形成的矩形空间(southWest,northEast)递归分割成更小的矩形空间。跟常见的kd-tree不同，当分割到网格区域里面坐标点的数量小于一定数量(比如1024)就停止了。

## 3.5 后续发展

Geo查询能力的迭代和变迁，其实也是Elasticsearch作为一个数据库对数值查询能力的升级和优化，扩展产品的适用场景，让使用者打破对Elasticsearch只能做全文检索的偏见。从全文检索数据库扩展到分析型数据库，Elasticsearch还有很长的路要走。

3 评论
38 收藏
2