第三百零一节 Lucene教程 - Lucene索引文件

Lucene教程 - Lucene索引文件

索引是识别文档并为搜索准备文档的过程。

下表列出了索引过程中常用的类。

描述
IndexWriter在索引过程中创建/更新索引。
Directory表示索引的存储位置。
Analyzer分析文档并从文本中获取标记/单词。
Document带有字段的虚拟文档。分析仪可以处理文档。
Field索引过程的最低单位。它表示键值对,其中键用于标识索引值。

例子

以下代码显示了如何使用Lucene索引文本文件。

/** Licensed to the Apache Software Foundation (ASF) under one or more* contributor license agreements.  See the NOTICE file distributed with* this work for additional information regarding copyright ownership.* The ASF licenses this file to You under the Apache License, Version 2.0* (the "License"); you may not use this file except in compliance with* the License.  You may obtain a copy of the License at**     http://www.apache.org/licenses/LICENSE-2.0** Unless required by applicable law or agreed to in writing, software* distributed under the License is distributed on an "AS IS" BASIS,* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.* See the License for the specific language governing permissions and* limitations under the License.*/import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
import java.util.Date;/** Index all text files under a directory.* <p>* This is a command-line application demonstrating simple Lucene indexing.* Run it with no command-line arguments for usage information.*/
public class Main {private Main() {}/** Index all text files under a directory. */public static void main(String[] args) {String usage = "java IndexFiles"+ " [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n"+ "This indexes the documents in DOCS_PATH, creating a Lucene index"+ "in INDEX_PATH that can be searched with SearchFiles";String indexPath = "index";String docsPath = null;boolean create = true;for(int i=0;i<args.length;i++) {if ("-index".equals(args[i])) {indexPath = args[i+1];i++;} else if ("-docs".equals(args[i])) {docsPath = args[i+1];i++;} else if ("-update".equals(args[i])) {create = false;}}if (docsPath == null) {System.err.println("Usage: " + usage);System.exit(1);}final File docDir = new File(docsPath);if (!docDir.exists() || !docDir.canRead()) {System.out.println("Document directory "" +docDir.getAbsolutePath()+ "" does not exist or is not readable, please check the path");System.exit(1);}Date start = new Date();try {System.out.println("Indexing to directory "" + indexPath + ""...");Directory dir = FSDirectory.open(new File(indexPath));// :Post-Release-Update-Version.LUCENE_XY:Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_10_0);IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_0, analyzer);if (create) {// Create a new index in the directory, removing any// previously indexed documents:iwc.setOpenMode(OpenMode.CREATE);} else {// Add new documents to an existing index:iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);}// Optional: for better indexing performance, if you// are indexing many documents, increase the RAM// buffer.  But if you do this, increase the max heap// size to the JVM (eg add -Xmx512m or -Xmx1g)://// iwc.setRAMBufferSizeMB(256.0);IndexWriter writer = new IndexWriter(dir, iwc);indexDocs(writer, docDir);// NOTE: if you want to maximize search performance,// you can optionally call forceMerge here.  This can be// a terribly costly operation, so generally it"s only// worth it when your index is relatively static (ie// you"re done adding documents to it)://// writer.forceMerge(1);writer.close();Date end = new Date();System.out.println(end.getTime() - start.getTime() + " total milliseconds");} catch (IOException e) {System.out.println(" caught a " + e.getClass() +"\n with message: " + e.getMessage());}}/*** Indexes the given file using the given writer, or if a directory is given,* recurses over files and directories found under the given directory.* * NOTE: This method indexes one document per input file.  This is slow.  For good* throughput, put multiple documents into your input file(s).  An example of this is* in the benchmark module, which can create "line doc" files, one document per line,* using the* <a href="../../../../../contrib-benchmark/org/apache/lucene/benchmark/byTask/tasks/WriteLineDocTask.html"* >WriteLineDocTask</a>.*  * @param writer Writer to the index where the given file/dir info will be stored* @param file The file to index, or the directory to recurse into to find files to index* @throws IOException If there is a low-level I/O error*/static void indexDocs(IndexWriter writer, File file)throws IOException {// do not try to index files that cannot be readif (file.canRead()) {if (file.isDirectory()) {String[] files = file.list();// an IO error could occurif (files != null) {for (int i = 0; i < files.length; i++) {indexDocs(writer, new File(file, files[i]));}}} else {FileInputStream fis;try {fis = new FileInputStream(file);} catch (FileNotFoundException fnfe) {// at least on windows, some temporary files raise this exception with an "access denied" message// checking if the file can be read doesn"t helpreturn;}try {// make a new, empty documentDocument doc = new Document();// Add the path of the file as a field named "path".  Use a// field that is indexed (i.e. searchable), but don"t tokenize // the field into separate words and don"t index term frequency// or positional information:Field pathField = new StringField("path", file.getPath(), Field.Store.YES);doc.add(pathField);// Add the last modified date of the file a field named "modified".// Use a LongField that is indexed (i.e. efficiently filterable with// NumericRangeFilter).  This indexes to milli-second resolution, which// is often too fine.  You could instead create a number based on// year/month/day/hour/minutes/seconds, down the resolution you require.// For example the long value 2011021714 would mean// February 17, 2011, 2-3 PM.doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));// Add the contents of the file to a field named "contents".  Specify a Reader,// so that the text of the file is tokenized and indexed, but not stored.// Note that FileReader expects the file to be in UTF-8 encoding.// If that"s not the case searching for special characters will fail.doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8))));if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {// New index, so we just add the document (no old document can be there):System.out.println("adding " + file);writer.addDocument(doc);} else {// Existing index (an old copy of this document may have been indexed) so // we use updateDocument instead to replace the old one matching the exact // path, if present:System.out.println("updating " + file);writer.updateDocument(new Term("path", file.getPath()), doc);}} finally {fis.close();}}}}
}

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/460057.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

计算机视觉-对极几何

1 基本概念 对极几何&#xff08;Epipolar Geometry&#xff09;描述的是两幅视图之间的内在射影关系&#xff0c;与外部场景无关&#xff0c;只依赖于摄像机内参数和这两幅视图之间的相对位姿 两视图的对极几何可以理解为图像平面与以基线为轴的平面束相交的几何关系&#xf…

leetcode344. Reverse String

Write a function that reverses a string. The input string is given as an array of characters s. You must do this by modifying the input array in-place with O(1) extra memory. Example 1: Input: s [“h”,“e”,“l”,“l”,“o”] Output: [“o”,“l”,“l”…

ssm016基于 Java Web 的校园驿站管理系统(论文+源码)_kaic

毕 业 设 计&#xff08;论 文&#xff09; 题目&#xff1a;校园驿站管理系统的设计与实现 摘 要 互联网发展至今&#xff0c;无论是其理论还是技术都已经成熟&#xff0c;而且它广泛参与在社会中的方方面面。它让信息都可以通过网络传播&#xff0c;搭配信息管理工具可以很好…

鸿蒙自定义加载 LoadingDialog

代码如下&#xff1a; Component export struct LoadingDialog {Prop title: stringbuild() {Stack() {Column() {LoadingProgress().color(Color.White).width(100).height(100)Text(this.title).fontSize(18).fontColor(0xffffff).margin({ top: 8 }).visibility(this.title …

【AI开源项目】Dify- 轻松打造可持续运营的 GPT 系列的 AI应用 —— 全面解析LLMOps平台

文章目录 什么是Dify&#xff1f;Dify的名称由来 了解LLMOpsDify的核心功能兼容多种LLMs Dify的优势完全开源核心能力 如何安装Dify快速启动使用Helm Chart在Kubernetes上部署自定义配置 使用Dify创建AI应用第一步&#xff1a;创建应用程序第二步&#xff1a;编写和调试提示词第…

【HTML】——VS Code 基本使用入门和常见操作,新手小白也能看懂

阿华代码&#xff0c;不是逆风&#xff0c;就是我疯 你们的点赞收藏是我前进最大的动力&#xff01;&#xff01; 希望本文内容能够帮助到你&#xff01;&#xff01; 目录 零&#xff1a;HTML开发工具VSCode的使用 1&#xff1a;创建项目 2&#xff1a;创建格式模板&#x…

基于springboot+vue实现的公考知识学习平台 (源码+L文+ppt)4-103

4.1 系统功能结构设计 根据对公考知识学习平台的具体需求分析&#xff0c;把系统可以划分为几个不同的功能模块&#xff1a;管理员可以对系统首页、个人中心、用户管理、讲师管理、在线咨询管理、学习资料管理、讲座信息管理、讲座预约管理、学习论坛、练习自测管理、试题管理…

计算结构力学:多自由度振动系统

本文以笔记的形式记录计算结构力学的若干基础知识。 注1&#xff1a;限于研究水平&#xff0c;分析难免不当&#xff0c;欢迎批评指正。 注2&#xff1a;文章内容会不定期更新。 预修1&#xff1a;线性代数 1. 标准特征值 复矩阵Schur分解&#xff1a;对于复矩阵&#xff0c…

将多个commit合并成一个commit并提交

0 Preface/foreword 1 压缩多个commit方法 1.1 git merge --squash 主分支&#xff1a;main 开发分支&#xff1a;test 当前在test分支提交了8个commits&#xff0c;功能已经开发完成&#xff0c;需要将test分支合并到main分支&#xff0c;但是不想在合并时候&#xff0c;看…

JVM 实战篇(一万字)

此笔记来至于 黑马程序员 内存调优 内存溢出和内存泄漏 内存泄漏&#xff08;memory leak&#xff09;&#xff1a;在Java中如果不再使用一个对象&#xff0c;但是该对象依然在 GC ROOT 的引用链上&#xff0c;这个对象就不会被垃圾回收器回收&#xff0c;这种情况就称之为内…

使用Fiddler Classic抓包工具批量下载音频资料

1. 通过F12开发者工具&#xff0c;下载音频文件 浏览器打开音频列表->F12快捷键->网络->媒体&#xff0c;播放一个音频文件&#xff0c;右边媒体下生成一个音频文件&#xff0c;右击“在新标签页中打开”&#xff0c;可以下载这个音频文件。 2.通过Fiddler Classic抓…

新能源行业必会基础知识---电力现货问答---第9问---什么是输电权?什么是输电权市场?

新能源行业必会基础知识-----电力现货问答-----主目录-----持续更新https://blog.csdn.net/grd_java/article/details/142909208 虽然这本书已经出来有几年了&#xff0c;现货市场已经产生了一定变化&#xff0c;但是原理还是相通的。还是推荐大家买来这本书进行阅读观看&#…

音视频入门基础:AAC专题(11)——AudioSpecificConfig简介

音视频入门基础&#xff1a;AAC专题系列文章&#xff1a; 音视频入门基础&#xff1a;AAC专题&#xff08;1&#xff09;——AAC官方文档下载 音视频入门基础&#xff1a;AAC专题&#xff08;2&#xff09;——使用FFmpeg命令生成AAC裸流文件 音视频入门基础&#xff1a;AAC…

java-web-day5

1.spring-boot-web入门 目标: 开始最基本的web应用的构建 使用浏览器访问后端, 后端给浏览器返回HelloController 流程: 1.创建springboot工程, 填写模块信息, 并勾选web开发的相关依赖 注意: 在新版idea中模块创建时java下拉框只能选17, 21, 23 这里选17, maven版本是3.6.3, 很…

基于SSM的智能台球厅系统

基于SSM的智能台球厅系统设计与实现 摘要 智能台球厅系统是一个以用户便捷体验为核心的管理系统&#xff0c;结合SSM&#xff08;Spring、Spring MVC、MyBatis&#xff09;框架来实现台球厅日常业务流程的自动化和智能化管理。系统主要包含用户预约、场地管理、设备状态监控、支…

String的长度有限,而我对你的思念却无限延伸

公主请阅 1. 为什么学习string类&#xff1f;2. string类的常用接口2.1 string类对象的常见构造2.1.1 string 2.2 operator[]2.3 迭代器2.4 auto自动推导数据类型2.5 范围for2.6 迭代器第二层2.7 size和length获取字符串的长度2.8 max_size 获取这个字符串能设置的最大长度2.9 …

spring-第十一章 注解开发

spring 文章目录 spring前言1.注解回顾1.1原理1.2springIOC注解扫描原理1.2.1解释1.2.2案例 2.声明bean的注解补充&#xff1a;Bean注解&#xff0c;管理三方包对象 3.spring注解的使用3.1加入aop依赖3.2配置文件中添加context命名空间3.3配置文件中指定要扫描的包3.4在Bean上使…

Linux 之 文件属性与目录、字符串处理、系统信息获取

学习任务&#xff1a; 1、 文件属性与目录&#xff1a;Linux 文件类型、stat、chmod、链接文件、目录文件 2、 字符串处理&#xff1a;字符串输入/输出、strlen、strcat、strcpy、memset、atoi()、atol()、atoll() 3、 系统信息&#xff1a;proc 虚拟文件系统&#xff08;重点&…

搜索引擎算法更新对网站优化的影响与应对策略

内容概要 随着互联网的不断发展&#xff0c;搜索引擎算法也在不断地进行更新和优化。了解这些算法更新的背景与意义&#xff0c;对于网站管理者和优化人员而言&#xff0c;具有重要的指导意义。不仅因为算法更新可能影响到网站的排名&#xff0c;还因为这些变化也可能为网站带…

省域经济高质量发展水平测算及数据2000-2021年

经济高质量发展水平测算&#xff0c;是通过一系列科学的方法和指标&#xff0c;对经济活动的各个方面进行评估和量化的过程。这不仅涉及到经济增长的速度&#xff0c;更涵盖了效益、效率、可持续性等多个维度。包含了2000年至2021年期间&#xff0c;全国31个省份、自治区、直辖…