当前位置: 首页 > 工具软件 > MG4J > 使用案例 >

MG4J

洪捷
2023-12-01

MG4J (常务千兆字节的Java )是一个免费的全文搜索引擎的大文档集合Java编写的。

要点MG4J是:

     *强大的索引。支持文件的集合和工厂使我们能够分析,索引和查询一贯大文件汇编,提供易于理解的片段,强调有关段落中检索文件。
     *效率。我们不提供毫无意义的数据,如“我们指数x培养基每秒” (与配置?哪种语言?该数据源? ) ,我们邀请您来试试。指数没有MG4J可以努力的TREC GOV2收集(文件工厂为此目的提供)和规模有数以百万计的文件。
     *多指标区间语义。当您提交的查询, MG4J回报,为每个指数的名单,间隔满足查询。这提供了基地,一些高精密的得分手,并非常有效地执行复杂的运营商。间隔是建立在线性时使用新的研究方法。
     *表达运营商。 MG4J远远超出了袋的话模式,提供高效率的执行短语查询,接近限制,下令结合,并结合多重指数查询。每家营办商的代表国内的一个抽象的对象,这样就可以轻松地插入您最喜欢的语法。
     *虚拟领域。 MG4J支持虚拟领域字段包含文字的不同,虚拟文件;的典型的例子是锚文本,必须归因于目标文件。
     *灵活性。您可以建立小得多指数下降任期的立场,甚至长期罪状。就看您的。几种不同类型的代码可以选择平衡效率和索引大小。文件来自收集可重新编号(例如,相匹配的静态级别或实验索引技术) 。
     *开放。该文件收集/工厂接口提供一种简单的方式,目前您自己的数据代表性MG4J ,决策是一件轻而易举的建立一个基于Web的搜索引擎直接访问您的数据。每个元素的道路上查询决议(解析器,文档迭代建设者,查询引擎等) ,可以代替您自己的版本。
     *分布式处理。指数可以建立一个收集分为几个部分,并结合后。结合指数允许非毗连指数,甚至同一文件可以被分割在不同的集合(例如,当索引锚文本) 。
     *多线程。指数,可同时查询和得分。
     *聚类。指数,可群集都词汇和documentally (可能经过分区) 。集群系统是完全开放的,和用户定义的战略决定如何合并文件从不同的来源。这种结构使得有可能,例如,加载在RAM中的部分指标,其中包含的条件更加频繁出现在用户查询。

MG4J是自由软件使用GNU通用公共许可证。

网页: http://mg4j.dsi.unimi.it/


以下是原文:


MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java.

The main points of MG4J are:

    * Powerful indexing. Support for document collections and factories makes it possible to analyse, index and query consistently large document collections, providing easy-to-understand snippets that highlight relevant passages in the retrieved documents.
    * Efficiency. We do not provide meaningless data such as "we index x GiB per second" (with which configuration? which language? which data source?)—we invite you to try it. MG4J can index without effort the TREC GOV2 collection (document factories are provided to this purpose) and scales to hundreds of millions of documents.
    * Multi-index interval semantics. When you submit a query, MG4J returns, for each index, a list of intervals satisfying the query. This provides the base for several high-precision scorers and for very efficient implementation of sophisticated operators. The intervals are built in linear time using new research algorithms.
    * Expressive operators. MG4J goes far beyond the bag-of-words model, providing efficient implementation of phrase queries, proximity restrictions, ordered conjunction, and combined multiple-index queries. Each operator is represented internally by an abstract object, so you can easily plug in your favourite syntax.
    * Virtual fields. MG4J supports virtual fields—fields containing text for a different, virtual document; the typical example is anchor text, which must be attributed to the target document.
    * Flexibility. You can build much smaller indices by dropping term positions, or even term counts. It's up to you. Several different types of codes can be chosen to balance efficiency and index size. Documents coming from a collection can be renumbered (e.g., to match a static rank or experiment with indexing techniques).
    * Openness. The document collection/factory interfaces provide an easy way to present your own data representation to MG4J, making it a breeze to set up a web-based search engine accessing directly your data. Every element along the path of query resolution (parsers, document-iterator builders, query engines, etc.) can be substituted with your own versions.
    * Distributed processing. Indices can be built for a collection split in several parts, and combined later. Combination of indices allows non-contiguous indices and even the same document can be split across different collections (e.g., when indexing anchor text).
    * Multithreading. Indices can be queried and scored concurrently.
    * Clustering. Indices can be clustered both lexically and documentally (possibly after a partitioning). The clustering system is completely open, and user-defined strategies decide how to combine documents from different sources. This architecture makes it possible, for instance, to load in RAM the part of an index that contains terms appearing more frequently in user queries.

MG4J is free software distributed under the GNU Lesser General Public License.

homepage: http://mg4j.dsi.unimi.it/

 类似资料:

相关阅读

相关文章

相关问答