简易搜索引擎(二)

Python简易搜索引擎原理及实现(二)查询处理

Posted by CY on January 4, 2021

Python简易搜索引擎原理及实现(二)查询处理

实现查询词条的与(AND)、或(OR)、与非(ANDNOT)操作,三种操作均通过Hash方法实现,即通过Hash表记录所需查询的词条在250篇文章中出现的次数:

AND操作即$hash[i] == len(opt)$, OR操作即$hash[i] == 1$, ANDNOT操作即$hash[i] == 0$

查询处理

对输入格式为:word1 OP word2 OP word3 …, OP: AND/OR/ANDNOT 的内容进行查询

查询并合并结果(merge)的思路如下图:

image-20210401121445827

image-20210401121509183

  • AND操作:分别查询出每个词汇对应的posting list(即文章ID 的list),然后对各结果之间做取交集的操作。

  • OR操作:分别查询出每个词汇对应的posting list(即文章ID 的list),然后对各结果之间做取并集的操作。

  • ANDNOT操作:分别查询出每个词汇对应的posting list(即文章ID 的list),取出那些存在于操作符左侧同时不存在于操作符右侧的文章ID。 举个例子来说就是 word1 ANDNOT word2,word1的结果是 1 -> 3 -> 8 -> 11,word2的结果是 2 -> 3 -> 5 -> 8 ,那么最终得到的ANDNOT结果就是 1 -> 11

def solve(instr, resultdict):
    if ' ' not in instr:
        instr = instr + ' AND ' + instr
    firstsplit = instr.split()
    if len(firstsplit) < 2:
        return '格式错误, 多输入空格'
    operation = firstsplit[1]
    if operation not in ('AND', 'OR', 'ANDNOT') or '*' in instr:
        return '操作代码错误'
    if len(firstsplit) > 4:
        former = solve(firstsplit[0] + ' ' + operation + ' ' + firstsplit[2], resultdict)
        later = solve(firstsplit[4], resultdict)
        myhash = [0]
        ans = list()
        for i in range(1, 250 + 1):
            myhash.append(0)
        for i in former:
            myhash[i] = myhash[i] + 1
        for i in later:
            myhash[i] = myhash[i] + 1
        if firstsplit[3] == 'AND':
            for i in range(1, 250 + 1):
                if myhash[i] == 2:
                    ans.append(i)
        else:
            for i in range(1, 250 + 1):
                if myhash[i] == 1:
                    ans.append(i)
        return ans
    secondsplit = instr.split(operation)
    myhash = [0]
    for i in range(1, 250 + 1):
        myhash.append(0)
    optsize = len(secondsplit)
    for items in secondsplit:
        items = items.strip()
        DOCIDs = resultdict.get(items)
        if DOCIDs is None:
            return '该词条不存在'
        for docid, tf in DOCIDs.items():
            if docid == 'df':
                continue
            myhash[docid] = myhash[docid] + 1
    ans = list()
    for i in range(1, 250 + 1):
        if myhash[i] == optsize and operation == 'AND' or myhash[i] >= 1 and operation == 'OR' or operation == 'ANDNOT' and myhash[i]==0:
            ans.append(i)
    return ans