自己动手实现scikit库中的fit和transform方法

文本分析第一步要解决的是如何将文本非结构化信息转化为结构化信息，其中最关键的是特征抽取，我们使用scikit-learn库fit和tranform方法实现了文本数据的特征抽取。

但是对于fit和transform，大家可能还是有点迷糊。最近又将《Applied Text Analysis WIth Python》读了一遍（别惊讶，82页过一遍很快的。之前一直以为这本书82页，今天才发现这本书完整版是400多页。）我主要结合这本书代码和自己的理解，实现了fit和tranform算法，方便大家更好的理解文本分析特征抽取。

一、scikit库代码实例

fit方法作用：给文本数据建立词典的过程
transform方法作用：根据词典对所有的文本数据进行编码（转码）

1.1 我们先看看fit代码实例

corpus = ["Hey hey hey lets go get lunch today :)","Did you go home?","Hey!!! I need a favor"]from sklearn.feature_extraction.text import CountVectorizervectorize = CountVectorizer()#fit学会语料中的所有词语，构建词典vectorize.fit(corpus)#这里我们查看下“词典”，也就是特征集(11个特征词)print(vectorize.get_feature_names())
['did','favor','get','go','hey','home','lets','lunch','need','today','you']

1.2 transform实例

根据建立好的词典vectorize对corpus进行编码。这里为了便于观看理解，我们使用pandas处理下数据输出。

import pandas as pddtm = vectorize.transform(corpus)colums_name = vectorize.get_feature_names()series = dtm.toarray()print(pd.DataFrame(series, columns = colums_name ))

从上面的dataframe表中，行代表一个文档，列代表特征词。比如第1行，hey列的所对应的单元格值为3，说明corpus中第一个document（Hey hey hey lets go get lunch today :）出现了三次hey。

二、fit 与 transform算法实现

思路：

首先要对输入的文本数据能够分词（这里我们假设是英文吧）
对英文字符能够识别是否为符号，防止出现如“good_enough”这种中间含有非英文字符。
剔除停止词，如“a”、“ the”等
词干化
经过步骤1-4清洗，输出干净的词语列表数据。
基于词语列表，这里需要有一个容器存储每一个新出现的单词，构建出特征词词典。
根据建立好的词典，对输入的数据进行编码。

2.1 分词

这里我们直接使用nltk.tokenize库中的word_tokenize分词函数。

from nltk.tokenize import word_tokenizeword_tokenize("Today is a beatiful day!")
['Today', 'is', 'a', 'beatiful', 'day', '!']

我们看到上面结果有“！”，所以接下来我们要判断分词结果是否为单词。

2.2 标点符号判断

《Applied text analysis with python》一书中判别分词结果是否为符号代码为

def is_punct(token):return all(unicodedata.category(char).startswith('P') for char in token)

测试了下发现，category(符号)，结果为“Po”。

import unicodedata#这里以“！”做个测试
unicodedata.category('!')
Po

而all(data)函数是Python内置函数，当data内各个元素一致时返回True，否则返回False。

print(all([True, False]))
print(all([True, True]))
False
True

2.3 停止词

nltk提供了丰富的文本分析工具，停止词表全部为小写单词，所以判断前要先将token小写化。

def is_stopword(token):stopwords = nltk.corpus.stopwords.words('english')return token.lower() in stopwords

2.4 词干化

对单复数、不同时态、不同语态等异形词归并为一个统一词。这里有stem和lemmatize两种实现方法,下面我们分别看看算法。

2.4.1 stem

import nltkdef stem(token):stem = nltk.stem.SnowballStemmer('english')return stem.stem(token)

2.4.2 lemmatize

from nltk.corpus import wordnet as wnfrom nltk.stem.wordnet import WordNetLemmatizerdef lemmatize(token, pos_tag):lemmatizer = WordNetLemmatizer()tag = {'N': wn.NOUN,'V': wn.VERB,'R': wn.ADV,'J': wn.ADJ}.get(pos_tag[0])if tag:return lemmatizer.lemmatize(token.lower(), tag)else:return None    
print(stem('better'))
print(lemmatize('better', 'JJ'))
better
good

从中我们可以看出lemmatize更准确，对于小数据量的分析，为了力求精准我个人建议用lemmatize。

2.5 清洗数据

def clean(document):return [lemmatize(token, tag)  for (token, tag) in nltk.pos_tag(word_tokenize(document)) if not is_punct(token) and not is_stopword(token)]print(clean('He was a soldier 20 years ago!'))
['soldier', None, 'year', 'ago']

结果中出现None，这是不能允许的。原因应该是lemmatize函数。所以我们要加一个判断

def clean(document):return [lemmatize(token, tag) for (token, tag) in nltk.pos_tag(word_tokenize(document))if not is_punct(token) and not is_stopword(token) and lemmatize(token, tag)]print(clean('He was a soldier 20 years ago!'))
['soldier', 'year', 'ago']

2.6 构建词典-fit

我们需要将待分析的文本数据中抽取出所有的特征词，并将其存入一个词典列表中。思路：凡是新出现，不存在于词典列表vocab中，就将其加入到vocab中。

def fit(X, y=None):vocab = []for doc in X:for token in clean(doc):if token not in vocab:vocab.append(token)return vocabX = ["The elephant sneezed at the sight of potatoes.Its very interesting thing.\nBut at the sight of potatoes",    "Bats can see via echolocation. See the bat sight sneeze!\nBut it is a bats","Wondering, she opened the door to the studio.\nHaha!good"]print(fit(X))
['elephant', 'sneeze', 'sight', 'potatoes.its', 'interesting', 'thing', 'potato', 'bat', 'see', 'echolocation', 'wondering', 'open', 'door', 'studio', 'haha', 'good']

词典已经构建好了。

2.7 对待分析文本数据编码-transform

根据构建好的词典列表，我们开始对文本数据进行转码。思路不难，只要对文档分词结果与词典列表一一分析，该特征词出现几次就为几。

def transform(documents):vacab = fit(documents)for doc in documents:result = []tokens = clean(doc)for va in vacab:result.append(tokens.count(va))yield resultdocuments = ["The elephant sneezed at the sight of potatoes.Its very interesting thing.\nBut at the sight of potatoes","Bats can see via echolocation. See the bat sight sneeze!\nBut it is a bats","Wondering, she opened the door to the studio.\nHaha!good"]print(list(transform(documents)))
[[1, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
[0, 1, 1, 0, 0, 0, 0, 3, 2, 1, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]]

三、完整版

现在我们将上面的代码合并为TextExtractFeature类

import nltkimport unicodedatafrom collections import defaultdictfrom nltk.corpus import wordnet as wnfrom nltk.stem.wordnet import WordNetLemmatizerfrom nltk.tokenize import word_tokenizeclass TextExtractFeature(object):def __init__(self, language='english'):self.stopwords = set(nltk.corpus.stopwords.words(language))self.lemmatizer = WordNetLemmatizer()def is_punct(self, token):return all(unicodedata.category(char).startswith('P') for char in token)def is_stopword(self, token):return token.lower() in self.stopwords        def lemmatize(self, token, pos_tag):tag = {'N': wn.NOUN,'V': wn.VERB,'R': wn.ADV,'J': wn.ADJ}.get(pos_tag[0])if tag:return self.lemmatizer.lemmatize(token.lower(), tag)else:return None        def clean(self, document):return [self.lemmatize(token, tag).lower() for (token, tag) in nltk.pos_tag(word_tokenize(document)) if not self.is_punct(token) and not self.is_stopword(token) and self.lemmatize(token, tag)]def fit(self, X, y=None):self.y = yself.vocab = []self.feature_names = defaultdict(int)for doc in X:for token in self.clean(doc):if token not in self.vocab:self.feature_names[token] = len(self.vacab)self.vocab.append(token)def get_feature_names(self):return self.feature_names        def transform(self, documents):for idx,doc in enumerate(documents):result = []tokens = self.clean(doc)for va in self.vocab:result.append(tokens.count(va))if self.y:result.append(self.y[idx])yield result
documents = ["The elephant sneezed at the sight of potatoes.Its very interesting thing.\nBut at the sight of potatoes","Bats can see via echolocation. See the bat sight sneeze!\nBut it is a bats","Wondering, she opened the door to the studio.\nHaha!good",]y = [1, 1, 1]tef = TextExtractFeature(language='english')#构建词典tef.fit(documents, y)#打印词典映射关系。即特征词print(tef.get_feature_names())for s in tef.transform(documents):print(s)
defaultdict(<class 'int'>, {'elephant': 0, 'sneeze': 1, 'sight': 2, 'potatoes.its': 3, 'interesting': 4, 'thing': 5, 'potato': 6, 'bats': 7, 'see': 8, 'echolocation': 9,  'bat': 10, 'wondering': 11,'open': 12,  'door': 13, 'studio': 14, 'haha': 15, 'good': 16})[1, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[0, 1, 1, 0, 0, 0, 0, 1, 2, 1, 2, 0, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

如何学习大模型

现在社会上大模型越来越普及了，已经有很多人都想往这里面扎，但是却找不到适合的方法去学习。

作为一名资深码农，初入大模型时也吃了很多亏，踩了无数坑。现在我想把我的经验和知识分享给你们，帮助你们学习AI大模型，能够解决你们学习中的困难。

我已将重要的AI大模型资料包括市面上AI大模型各大白皮书、AGI大模型系统学习路线、AI大模型视频教程、实战学习，等录播视频免费分享出来，需要的小伙伴可以扫取。

一、AGI大模型系统学习路线

很多人学习大模型的时候没有方向，东学一点西学一点，像只无头苍蝇乱撞，我下面分享的这个学习路线希望能够帮助到你们学习AI大模型。

在这里插入图片描述

二、AI大模型视频教程

在这里插入图片描述

三、AI大模型各大学习书籍

在这里插入图片描述

四、AI大模型各大场景实战案例

在这里插入图片描述

五、结束语

学习AI大模型是当前科技发展的趋势，它不仅能够为我们提供更多的机会和挑战，还能够让我们更好地理解和应用人工智能技术。通过学习AI大模型，我们可以深入了解深度学习、神经网络等核心概念，并将其应用于自然语言处理、计算机视觉、语音识别等领域。同时，掌握AI大模型还能够为我们的职业发展增添竞争力，成为未来技术领域的领导者。

再者，学习AI大模型也能为我们自己创造更多的价值，提供更多的岗位以及副业创收，让自己的生活更上一层楼。

因此，学习AI大模型是一项有前景且值得投入的时间和精力的重要选择。