信息检索与数据挖掘 | 【实验】排名检索模型

文章目录

📚实验内容
📚相关概念
📚实验步骤
- 🐇分词预处理
- 🐇构建倒排索引表
- 🐇计算query和各个文档的相似度
- 🐇queries预处理及检索函数
- - 🔥对输入的文本进行词法分析和标准化处理
  - 🔥检索函数
- 🐇调试结果

📚实验内容

在Experiment1的基础上实现最基本的Ranked retrieval model
- Input：a query (like Ron Weasley birthday)
- Output: Return the top K (e.g., K = 100) relevant tweets.
Use SMART notation: lnc.ltn
- Document: logarithmic tf (l as first character), no idf and cosine normalization
- Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization
改进Inverted index
- 在Dictionary中存储每个term的DF
- 在posting list中存储term在每个doc中的TF with pairs (docID, tf)

📚相关概念

信息检索与数据挖掘 | （五）文档评分、词项权重计算及向量空间模型
词项频率（term frequencey）：t在文档中的出现次数。
文档集频率（collection frequency）：词项在文档集中出现的次数。
文档频率（document frequency）：出现t的所有文档的数目。
逆文档频率：
$tf-idf_{t,d}$ 计算：
相似度计算：
查询权重机制：

📚实验步骤

🐇分词预处理

将输入的推特文档转换为小写，这里统一处理，使得后续查询不区分大小写。
根据特定标记在推特文档中查找并确定关键部分信息的位置索引，并提取出推特文档中的tweetid和tweet内容。
对提取出的文本内容进行分词处理，并将单词转换为其单数形式。
对分词后的词列表进行词形还原，主要针对动词的还原操作。同时，筛去[“text”, “tweetid”]

将筛选出的有效词添加到最终结果列表中，并返回。

#分词预处理
def tokenize_tweet(document):# 统一处理使查询不区分大小写document = document.lower()# 根据特定标记在推特文档中查找并确定关键部分信息的位置索引# 这里的减1减3是对引号逗号切入与否的调整a = document.index("tweetid") - 1b = document.index("errorcode") - 1c = document.index("text") - 1d = document.index("timestr") - 3# 将推特文档中的tweetid和text内容主要信息提取出来document = document[a:b] + document[c:d]# 分词处理，并将单词转换为其单数形式terms = TextBlob(document).words.singularize()# 将分词后的词列表进行词形还原，并筛选出不属于无用词的有效词result = []for word in terms:# 将当前词转换为Word对象expected_str = Word(word)# 动词的还原操作expected_str = expected_str.lemmatize("v")if expected_str not in uselessTerm:# 筛去["text", "tweetid"]，添加到result中result.append(expected_str)return result

🐇构建倒排索引表

存储term在每个doc中的TF with pairs (docID, tf)。

首先明确，在该过程计算文档词项的对应权重，采用lnc规则，即 logarithmic tf (l as first character), no idf and cosine normalization。
具体流程如下：
- 读取内容。文件中每行都代表一条推特。将每一行推特文本分解为单词（词条化），并存储在一个列表line中
- 利用一个全局变量N记录读取的推特文档数量。
- 从line中提取tweetid，并从line中删除。
- 创建一个空字典tf用于统计每个词在当前文档中的出现次数。遍历line中的每个词，通过判断词是否已经在tf字典的键中存在来更新词的出现次数。
- 对tf字典中的每个词项频率进行logarithmic tf的计算，即将出现次数加1并取对数。（对应logarithmic tf (l as first character)）
- 归一化（对应cosine normalization），遍历tf字典的键（即词项），得到归一化因子。最后，代码再次遍历tf字典的键，并将每个词项的频率乘以归一化因子。得到最后的对应tf权重。
- 将line转换为集合unique_terms并遍历其中的每个词。
  - 如果该词已经在postings字典的键中存在，则更新该词对应的字典项，将tweetid和权重加入其中。
  - 如果该词不存在于postings字典的键中，则创建该键，并将tweetid和权重加入其中。

统计词频频率

# 统计词项频率，记录每个词在当前文档中的出现次数
tf = {}for word in line:if word in tf.keys():tf[word] += 1else:tf[word] = 1

1+log(tf_{t,d})

 # logarithmic tffor word in tf.keys():tf[word] = 1 + math.log(tf[word])

\frac{1}{\sqrt{{w_1}^2+{w_2}^2+...+{w_m}^2}}

 # 归一化，cosine normalizationcosine = 0for word in tf.keys():cosine = cosine + tf[word] * tf[word]cosine = 1.0 / math.sqrt(cosine)for word in tf.keys():tf[word] = tf[word] * cosine

🐇计算query和各个文档的相似度

首先明确，该过程分为两个步骤，首先计算query词项的对应权重，然后求相似度（也即对应词项两个权重相乘并求和）并降序排序。Query权重采用ltn规则，即 logarithmic tf (l in leftmost column), idf (t in second column), no normalization。
具体流程如下：
- 遍历查询词列表query，对每个词进行词项频率统计，将结果存储在tf中。
- 遍历tf字典的键（即查询词），根据每个词在postings中的文档频率（文档出现的次数）计算文档频率df。若一个词不在postings中，则将文档频率设置为全局变量 N（表示总的文档数量）。
- 计算权重tf[word] = (math.log(tf[word]) + 1) * math.log(N / df)，对应ltn（logarithmic tf, idf, no normalization）。
- 对于每个查询词，检查它是否postings字典中存在。若存在，则遍历该查询词的倒排索引（文档编号及对应的词项权重），根据每个文档的词项权重和查询词的tf-idf值计算相似度得分。
- 存储得分并进行降序排序，得到一个按照相似度排名的列表，并将其返回作为结果。
```
def similarity(query):global score_tidtf = {}# 统计词项频率for word in query:if word in tf:tf[word] += 1else:tf[word] = 1# 统计文档频率for word in tf.keys():if word in postings:df = len(postings[word])else:df = N# 对应ltn,logarithmic tf (l in leftmost column), idf (t in second column), no normalizationtf[word] = (math.log(tf[word]) + 1) * math.log(N / df)# 计算相似度for word in query:if word in postings:for tid in postings[word]:if tid in score_tid.keys():score_tid[tid] += postings[word][tid] * tf[word]else:score_tid[tid] = postings[word][tid] * tf[word]# 按照得分（相似度）进行降序排序similarity = sorted(score_tid.items(), key=lambda x: x[1], reverse=True)return similarity
```

🐇queries预处理及检索函数

🔥对输入的文本进行词法分析和标准化处理

def token(doc):# 将输入文本转换为小写字母，以便统一处理。doc = doc.lower()# 将文本拆分为单个词项，并尝试将词项转换为单数形式terms = TextBlob(doc).words.singularize()# 将分词后的词列表进行词形还原,返回结果列表resultresult = []for word in terms:expected_str = Word(word)expected_str = expected_str.lemmatize("v")result.append(expected_str)return result

🔥检索函数

def Union(sets):return reduce(set.union, [s for s in sets])def do_search():query = token(input("please input search query >> "))result = []if query == []:sys.exit()# set()去除查询词列表中的重复项unique_query = set(query)# 生成一个包含每个查询词对应的tweet的id集合的列表，并且利用Union()函数将这些集合取并集relevant_tweetids = Union([set(postings[term].keys()) for term in unique_query])print("一共有" + str(len(relevant_tweetids)) + "条相关tweet！")if not relevant_tweetids:print("No tweets matched any query terms for")print(query)else:print("the top 100 tweets are:")scores = similarity(query)i = 1for (id, score) in scores:if i <= 100:  # 返回前n条查询到的信息result.append(id)print(str(score) + ": " + id)i = i + 1else:breakprint("finished")

🐇调试结果

在这里插入图片描述

最终代码

import sys
from collections import defaultdict
from textblob import TextBlob
from textblob import Word
import math
from functools import reduceuselessTerm = ["text", "tweetid"]
# 构建倒排索引表，存储term在每个doc中的TF with pairs (docID, tf)
postings = defaultdict(dict)
# 文档数目N
N = 0
# 最终权值
score_tid = defaultdict(dict)#分词预处理
def tokenize_tweet(document):# 统一处理使查询不区分大小写document = document.lower()# 根据特定标记在推特文档中查找并确定关键部分信息的位置索引# 这里的减1减3是对引号逗号切入与否的调整a = document.index("tweetid") - 1b = document.index("errorcode") - 1c = document.index("text") - 1d = document.index("timestr") - 3# 将推特文档中的tweetid和text内容主要信息提取出来document = document[a:b] + document[c:d]# 分词处理，并将单词转换为其单数形式terms = TextBlob(document).words.singularize()# 将分词后的词列表进行词形还原，并筛选出不属于无用词的有效词result = []for word in terms:# 将当前词转换为Word对象expected_str = Word(word)# 动词的还原操作expected_str = expected_str.lemmatize("v")if expected_str not in uselessTerm:# 筛去["text", "tweetid"]，添加到result中result.append(expected_str)return result# 构建倒排索引表，存储term在每个doc中的TF with pairs (docID, tf)
# lnc：logarithmic tf, no idf and cosine normalization
def get_postings():global postings, Ncontent = open(r"Tweets.txt")# 内容读取，每一条推特作为一个元素存储在lines中lines = content.readlines()for line in lines:N += 1# 预处理line = tokenize_tweet(line)# 提取处理后的词列表中的第一个元素，即推特文档的tweetidtweetid = line[0]# 提取后删除，不作为有效词line.pop(0)# 统计词项频率，记录每个词在当前文档中的出现次数tf = {}for word in line:if word in tf.keys():tf[word] += 1else:tf[word] = 1# logarithmic tffor word in tf.keys():tf[word] = 1 + math.log(tf[word])# 归一化，cosine normalizationcosine = 0for word in tf.keys():cosine = cosine + tf[word] * tf[word]cosine = 1.0 / math.sqrt(cosine)for word in tf.keys():tf[word] = tf[word] * cosine# 将处理后的词列表转换为集合，获取其中的唯一词unique_terms = set(line)for key_word in unique_terms:if key_word in postings.keys():postings[key_word][tweetid] = tf[key_word]else:postings[key_word][tweetid] = tf[key_word]# query标准化处理
def token(doc):# 将输入文本转换为小写字母，以便统一处理。doc = doc.lower()# 将文本拆分为单个词项，并尝试将词项转换为单数形式terms = TextBlob(doc).words.singularize()# 将分词后的词列表进行词形还原,返回结果列表resultresult = []for word in terms:expected_str = Word(word)expected_str = expected_str.lemmatize("v")result.append(expected_str)return result# 计算query和各个文档的相似度
def similarity(query):global score_tidtf = {}# 统计词项频率for word in query:if word in tf:tf[word] += 1else:tf[word] = 1# 统计文档频率for word in tf.keys():if word in postings:df = len(postings[word])else:df = N# 对应ltn,logarithmic tf (l in leftmost column), idf (t in second column), no normalizationtf[word] = (math.log(tf[word]) + 1) * math.log(N / df)# 计算相似度for word in query:if word in postings:for tid in postings[word]:if tid in score_tid.keys():score_tid[tid] += postings[word][tid] * tf[word]else:score_tid[tid] = postings[word][tid] * tf[word]# 按照得分（相似度）进行降序排序similarity = sorted(score_tid.items(), key=lambda x: x[1], reverse=True)return similaritydef Union(sets):return reduce(set.union, [s for s in sets])def do_search():query = token(input("please input search query >> "))result = []if query == []:sys.exit()# set()去除查询词列表中的重复项unique_query = set(query)# 生成一个包含每个查询词对应的tweet的id集合的列表，并且利用Union()函数将这些集合取并集relevant_tweetids = Union([set(postings[term].keys()) for term in unique_query])print("一共有" + str(len(relevant_tweetids)) + "条相关tweet！")if not relevant_tweetids:print("No tweets matched any query terms for")print(query)else:print("the top 100 tweets are:")scores = similarity(query)i = 1for (id, score) in scores:if i <= 100:  # 返回前n条查询到的信息result.append(id)print(str(score) + ": " + id)i = i + 1else:breakprint("finished")def main():get_postings()while True:do_search()if __name__ == "__main__":main()