《DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node》论文粗读笔记
序号:
文献背景:NeurIPS2019
研究方法:DiskANN
存在问题:当前sota的ANNS的索引必须存储在内存中以获得high-recall search,但是这使得搜索expensive,并且limits the size of datasets。
解决方案:
——提出了一个graph based indexing and search system called DiskANN,能够index, store and search a billion point database on a single workstation with just 64 GB RAM and an inexpensive solid-state drive(SSD)。
创新点:
——提出了Vamana索引结构,具有相比于NSG和HNSW更小的diameter,从而minimize the number of sequential disk reads.
——Vamana索引在面对大规模数据集的时候能够生成small indices for overlapping partitions,并且能够轻易地merge into one index,提供同样的search performance as a single-shot index constructed for the entire dataset。因此,这使得DiskANN能够支持size超过内存容量的dataset。
——在构建DiskANN的时候,Vamana能够结合off-the-shelf的vector compression scheme such as product quantization(PQ)。因此,图索引和原本的向量数据存储在磁盘上,而将压缩后的向量数据存在内存中。
下一步/不足之处:
数据集:
SIFT1M
GIST1M
DEEP1M
DEEP1B
ANN_SIFT1B