Python简易搜索引擎原理及实现(二)查询处理
实现查询词条的与(AND)、或(OR)、与非(ANDNOT)操作,三种操作均通过Hash方法实现,即通过Hash表记录所需查询的词条在250篇文章中出现的次数:
AND操作即$hash[i] == len(opt)$, OR操作即$hash[i] == 1$, ANDNOT操作即$hash[i] == 0$
查询处理
对输入格式为:word1 OP word2 OP word3 …, OP: AND/OR/ANDNOT 的内容进行查询
查询并合并结果(merge)的思路如下图:
-
AND操作:分别查询出每个词汇对应的posting list(即文章ID 的list),然后对各结果之间做取交集的操作。
-
OR操作:分别查询出每个词汇对应的posting list(即文章ID 的list),然后对各结果之间做取并集的操作。
-
ANDNOT操作:分别查询出每个词汇对应的posting list(即文章ID 的list),取出那些存在于操作符左侧同时不存在于操作符右侧的文章ID。 举个例子来说就是 word1 ANDNOT word2,word1的结果是 1 -> 3 -> 8 -> 11,word2的结果是 2 -> 3 -> 5 -> 8 ,那么最终得到的ANDNOT结果就是 1 -> 11
def solve(instr, resultdict):
if ' ' not in instr:
instr = instr + ' AND ' + instr
firstsplit = instr.split()
if len(firstsplit) < 2:
return '格式错误, 多输入空格'
operation = firstsplit[1]
if operation not in ('AND', 'OR', 'ANDNOT') or '*' in instr:
return '操作代码错误'
if len(firstsplit) > 4:
former = solve(firstsplit[0] + ' ' + operation + ' ' + firstsplit[2], resultdict)
later = solve(firstsplit[4], resultdict)
myhash = [0]
ans = list()
for i in range(1, 250 + 1):
myhash.append(0)
for i in former:
myhash[i] = myhash[i] + 1
for i in later:
myhash[i] = myhash[i] + 1
if firstsplit[3] == 'AND':
for i in range(1, 250 + 1):
if myhash[i] == 2:
ans.append(i)
else:
for i in range(1, 250 + 1):
if myhash[i] == 1:
ans.append(i)
return ans
secondsplit = instr.split(operation)
myhash = [0]
for i in range(1, 250 + 1):
myhash.append(0)
optsize = len(secondsplit)
for items in secondsplit:
items = items.strip()
DOCIDs = resultdict.get(items)
if DOCIDs is None:
return '该词条不存在'
for docid, tf in DOCIDs.items():
if docid == 'df':
continue
myhash[docid] = myhash[docid] + 1
ans = list()
for i in range(1, 250 + 1):
if myhash[i] == optsize and operation == 'AND' or myhash[i] >= 1 and operation == 'OR' or operation == 'ANDNOT' and myhash[i]==0:
ans.append(i)
return ans