Lucene学习总结之九：Lucene的查询对象(1) -

forfuture1978

浏览: 412686 次
性别:
来自: 北京

最近访客更多访客>>

背着家走

DYM_NEVER

Not_Sky

kenby

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Lucene学习总结之九：Lucene的查询对象(1)

博客分类：

Lucene 学习总结

lucene Apple 面试招聘 Cache

Lucene除了支持查询语法以外，还可以自己构造查询对象进行搜索。

从上一节的Lucene的语法一章可以知道，能与查询语句对应的查询对象有：BooleanQuery，FuzzyQuery，MatchAllDocsQuery，MultiTermQuery，MultiPhraseQuery，PhraseQuery，PrefixQuery，TermRangeQuery，TermQuery，WildcardQuery。

Lucene还支持一些查询对象并没有查询语句与之对应，但是能够实现相对高级的功能，本节主要讨论这些高级的查询对象。

它们中间最主要的一些层次结构如下，我们将一一解析。

Query

BoostingQuery
CustomScoreQuery
MoreLikeThisQuery
MultiTermQuery
- NumericRangeQuery<T>
- TermRangeQuery
SpanQuery
- FieldMaskingSpanQuery
- SpanFirstQuery
- SpanNearQuery
  - PayloadNearQuery
- SpanNotQuery
- SpanOrQuery
- SpanRegexQuery
- SpanTermQuery
  - PayloadTermQuery
FilteredQuery

1、BoostingQuery

BoostingQuery包含三个成员变量：

Query match：这是结果集必须满足的查询对象
Query context：此查询对象不对结果集产生任何影响，仅在当文档包含context查询的时候，将文档打分乘上boost
float boost

在BoostingQuery构造函数中：

public BoostingQuery(Query match, Query context, float boost) {

this.match = match;

this.context = (Query)context.clone();

this.boost = boost;

this.context.setBoost(0.0f);

}

在BoostingQuery的rewrite函数如下：

public Query rewrite(IndexReader reader) throws IOException {

BooleanQuery result = new BooleanQuery() {

@Override

public Similarity getSimilarity(Searcher searcher) {

return new DefaultSimilarity() {

@Override

public float coord(int overlap, int max) {

switch (overlap) {

case 1:

return 1.0f;

case 2:

return boost;

default:

return 0.0f;

}

};

}

};

result.add(match, BooleanClause.Occur.MUST);

result.add(context, BooleanClause.Occur.SHOULD);

return result;

}

由上面实现可知，BoostingQuery最终生成一个BooleanQuery，第一项是match查询，是MUST，即required，第二项是context查询，是SHOULD，即optional

然而由查询过程分析可得，即便是optional的查询，也会影响整个打分。

所以在BoostingQuery的构造函数中，设定context查询的boost为零，则无论文档是否包含context查询，都不会影响最后的打分。

在rewrite函数中，重载了DefaultSimilarity的coord函数，当仅包含match查询的时候，其返回1，当既包含match查询，又包含context查询的时候，返回boost，也即会在最后的打分中乘上boost的值。

下面我们做实验如下：

索引如下文件：

file01: apple other other other boy

file02: apple apple other other other

file03: apple apple apple other other

file04: apple apple apple apple other

对于如下查询(1)：

TermQuery must = new TermQuery(new Term("contents","apple"));
TermQuery context = new TermQuery(new Term("contents","boy"));
BoostingQuery query = new BoostingQuery(must, context, 1f);

或者如下查询(2)：

TermQuery query = new TermQuery(new Term("contents","apple"));

两者的结果是一样的，如下：

docid : 3 score : 0.67974937
docid : 2 score : 0.58868027
docid : 1 score : 0.4806554
docid : 0 score : 0.33987468

自然是包含apple越多的文档打分越高。

然而他们的打分计算过程却不同，用explain得到查询(1)打分细节如下：

docid : 0 score : 0.33987468
0.33987468 = (MATCH) fieldWeight(contents:apple in 0), product of:
1.0 = tf(termFreq(contents:apple)=1)
0.7768564 = idf(docFreq=4, maxDocs=4)
0.4375 = fieldNorm(field=contents, doc=0)

explain得到的查询(2)的打分细节如下：

docid : 0 score : 0.33987468
0.33987468 = (MATCH) sum of:
0.33987468 = (MATCH) fieldWeight(contents:apple in 0), product of:
    1.0 = tf(termFreq(contents:apple)=1)
    0.7768564 = idf(docFreq=4, maxDocs=4)
    0.4375 = fieldNorm(field=contents, doc=0)
0.0 = (MATCH) weight(contents:boy^0.0 in 0), product of:
    0.0 = queryWeight(contents:boy^0.0), product of:
      0.0 = boost
      1.6931472 = idf(docFreq=1, maxDocs=4)
      1.2872392 = queryNorm
    0.74075186 = (MATCH) fieldWeight(contents:boy in 0), product of:
      1.0 = tf(termFreq(contents:boy)=1)
      1.6931472 = idf(docFreq=1, maxDocs=4)
      0.4375 = fieldNorm(field=contents, doc=0)

可以知道，查询(2)中，boy的部分是计算了的，但是由于boost为0被忽略了。

让我们改变boost，将包含boy的文档打分乘以10：

TermQuery must = new TermQuery(new Term("contents","apple"));
TermQuery context = new TermQuery(new Term("contents","boy"));
BoostingQuery query = new BoostingQuery(must, context, 10f);

结果如下：

docid : 0 score : 3.398747
docid : 3 score : 0.67974937
docid : 2 score : 0.58868027
docid : 1 score : 0.4806554

explain得到的打分细节如下：

docid : 0 score : 3.398747
3.398747 = (MATCH) product of:
0.33987468 = (MATCH) sum of:
    0.33987468 = (MATCH) fieldWeight(contents:apple in 0), product of:
      1.0 = tf(termFreq(contents:apple)=1)
      0.7768564 = idf(docFreq=4, maxDocs=4)
      0.4375 = fieldNorm(field=contents, doc=0)
    0.0 = (MATCH) weight(contents:boy^0.0 in 0), product of:
      0.0 = queryWeight(contents:boy^0.0), product of:
        0.0 = boost
        1.6931472 = idf(docFreq=1, maxDocs=4)
        1.2872392 = queryNorm
      0.74075186 = (MATCH) fieldWeight(contents:boy in 0), product of:
        1.0 = tf(termFreq(contents:boy)=1)
        1.6931472 = idf(docFreq=1, maxDocs=4)
        0.4375 = fieldNorm(field=contents, doc=0)
10.0 = coord(2/2)

2、CustomScoreQuery

CustomScoreQuery主要包含以下成员变量：

Query subQuery：子查询
ValueSourceQuery[] valSrcQueries：其他信息源

ValueSourceQuery主要包含ValueSource valSrc成员变量，其代表一个信息源。

ValueSourceQuery会在查询过程中生成ValueSourceWeight并最终生成ValueSourceScorer，ValueSourceScorer在score函数如下：

public float score() throws IOException {

return qWeight * vals.floatVal(termDocs.doc());

}

其中vals = valSrc.getValues(reader)类型为DocValues，也即可以根据文档号得到值。

也即CustomScoreQuery会根据子查询和其他的信息源来共同决定最后的打分，而且公式可以自己实现，以下是默认实现：

public float customScore(int doc, float subQueryScore, float valSrcScores[]) {

if (valSrcScores.length == 1) {

return customScore(doc, subQueryScore, valSrcScores[0]);

}

if (valSrcScores.length == 0) {

return customScore(doc, subQueryScore, 1);

}

float score = subQueryScore;

for(int i = 0; i < valSrcScores.length; i++) {

score *= valSrcScores[i];

}

return score;

}

一般是什么样的信息源会对文档的打分有影响的？

比如说文章的作者，可能被保存在Field当中，我们可以认为名人的文章应该打分更高，所以可以根据此Field的值来影响文档的打分。

然而我们知道，如果对每一个文档号都用reader读取域的值会影响速度，所以Lucene引入了FieldCache来进行缓存，而FieldCache并非在存储域中读取，而是在索引域中读取，从而不必构造Document对象，然而要求此索引域是不分词的，有且只有一个Token。

所以有FieldCacheSource继承于ValueSource，而大多数的信息源都继承于FieldCacheSource，其最重要的一个函数即：

public final DocValues getValues(IndexReader reader) throws IOException {

return getCachedFieldValues(FieldCache.DEFAULT, field, reader);

}

我们举ByteFieldSource为例，其getCachedFieldValues函数如下：

public DocValues getCachedFieldValues (FieldCache cache, String field, IndexReader reader) throws IOException {

final byte[] arr = cache.getBytes(reader, field, parser);

return new DocValues() {

@Override

public float floatVal(int doc) {

return (float) arr[doc];

}

@Override

public int intVal(int doc) {

return arr[doc];

}

@Override

public String toString(int doc) {

return description() + '=' + intVal(doc);

}

@Override

Object getInnerArray() {

return arr;

}

};

}

其最终可以用DocValues根据文档号得到一个float值，并影响打分。

还用作者的例子，假设我们给每一个作者一个float的评级分数，保存在索引域中，用CustomScoreQuery可以将此评级融入到打分中去。

FieldScoreQuery即是ValueSourceQuery的一个实现。

举例如下：

索引如下文件：

file01: apple other other other boy

file02: apple apple other other other

file03: apple apple apple other other

file04: apple apple apple apple other

在索引过程中，对file01的"scorefield"域中索引"10"，而其他的文件"scorefield"域中索引"1"，代码如下：

Document doc = new Document();
doc.add(new Field("contents", new FileReader(file)));
if(file.getName().contains("01")){
doc.add(new Field("scorefield", "10", Field.Store.NO, Field.Index.NOT_ANALYZED));
} else {
doc.add(new Field("scorefield", "1", Field.Store.NO, Field.Index.NOT_ANALYZED));
}
writer.addDocument(doc);

对于建好的索引，如果进行如下查询TermQuery query = new TermQuery(new Term("contents", "apple"));

则得到如下结果：

docid : 3 score : 0.67974937
docid : 2 score : 0.58868027
docid : 1 score : 0.4806554
docid : 0 score : 0.33987468

自然是包含"apple"多的文档打分较高。

然而如果使用CustomScoreQuery进行查询：

TermQuery subquery = new TermQuery(new Term("contents","apple"));
FieldScoreQuery scorefield = new FieldScoreQuery("scorefield", FieldScoreQuery.Type.BYTE);
CustomScoreQuery query = new CustomScoreQuery(subquery, scorefield);

则得到如下结果：

docid : 0 score : 1.6466033
docid : 3 score : 0.32932067
docid : 2 score : 0.28520006
docid : 1 score : 0.23286487

显然文档0因为设置了数据源评分为10而跃居首位。

如果进行explain，我们可以看到，对于普通的查询，文档0的打分细节如下：

如果对于CustomScoreQuery，文档0的打分细节如下：

docid : 0 score : 1.6466033
1.6466033 = (MATCH) custom(contents:apple, byte(scorefield)), product of:
1.6466033 = custom score: product of:
    0.20850874 = (MATCH) weight(contents:apple in 0), product of:
      0.6134871 = queryWeight(contents:apple), product of:
        0.7768564 = idf(docFreq=4, maxDocs=4)
        0.7897047 = queryNorm
      0.33987468 = (MATCH) fieldWeight(contents:apple in 0), product of:
        1.0 = tf(termFreq(contents:apple)=1)
        0.7768564 = idf(docFreq=4, maxDocs=4)
        0.4375 = fieldNorm(field=contents, doc=0)
    7.897047 = (MATCH) byte(scorefield), product of:
      10.0 = byte(scorefield)=10
      1.0 = boost
      0.7897047 = queryNorm
1.0 = queryBoost

3、MoreLikeThisQuery

在分析MoreLikeThisQuery之前，首先介绍一下MoreLikeThis。

在实现搜索应用的时候，时常会遇到"更多相似文章"，"更多相关问题"之类的需求，也即根据当前文档的文本内容，在索引库中查询相类似的文章。

我们可以使用MoreLikeThis实现此功能：

IndexReader reader = IndexReader.open(……);

IndexSearcher searcher = new IndexSearcher(reader);

MoreLikeThis mlt = new MoreLikeThis(reader);

Reader target = ... //此是一个io reader，指向当前文档的文本内容。

Query query = mlt.like( target); //根据当前的文本内容，生成查询对象。

Hits hits = searcher.search(query); //查询得到相似文档的结果。

MoreLikeThis的Query like(Reader r)函数如下：

public Query like(Reader r) throws IOException {

return createQuery(retrieveTerms(r)); //其首先从当前文档的文本内容中抽取term，然后利用这些term构建一个查询对象。

}

public PriorityQueue <Object[]> retrieveTerms(Reader r) throws IOException {

Map<String,Int> words = new HashMap<String,Int>();

//根据不同的域中抽取term，到底根据哪些域抽取，可用函数void setFieldNames(String[] fieldNames)设定。

for (int i = 0; i < fieldNames.length; i++) {

String fieldName = fieldNames[i];

addTermFrequencies(r, words, fieldName);

}

//将抽取的term放入优先级队列中

return createQueue(words);

}

private void addTermFrequencies(Reader r, Map<String,Int> termFreqMap, String fieldName) throws IOException

{

//首先对当前的文本进行分词，分词器可以由void setAnalyzer(Analyzer analyzer)设定。

TokenStream ts = analyzer.tokenStream(fieldName, r);

int tokenCount=0;

TermAttribute termAtt = ts.addAttribute(TermAttribute.class);

//遍历分好的每一个词

while (ts.incrementToken()) {

String word = termAtt.term();

tokenCount++;

//如果分词后的term的数量超过某个设定的值，则停止，可由void setMaxNumTokensParsed(int i)设定。

if(tokenCount>maxNumTokensParsed)

{

break;

}

//如果此词小于最小长度，或者大于最大长度，或者属于停词，则属于干扰词。

//最小长度由void setMinWordLen(int minWordLen)设定。

//最大长度由void setMaxWordLen(int maxWordLen)设定。

//停词表由void setStopWords(Set<?> stopWords)设定。

if(isNoiseWord(word)){

continue;

}

// 统计词频tf

Int cnt = termFreqMap.get(word);

if (cnt == null) {

termFreqMap.put(word, new Int());

}

else {

cnt.x++;

}

private PriorityQueue createQueue(Map<String,Int> words) throws IOException {

//根据统计的term及词频构造优先级队列。

int numDocs = ir.numDocs();

FreqQ res = new FreqQ(words.size()); // 优先级队列，将按tf*idf排序

Iterator<String> it = words.keySet().iterator();

//遍历每一个词

while (it.hasNext()) {

String word = it.next();

int tf = words.get(word).x;

//如果词频小于最小词频，则忽略此词，最小词频可由void setMinTermFreq(int minTermFreq)设定。

if (minTermFreq > 0 && tf < minTermFreq) {

continue;

}

//遍历所有域，得到包含当前词，并且拥有最大的doc frequency的域

String topField = fieldNames[0];

int docFreq = 0;

for (int i = 0; i < fieldNames.length; i++) {

int freq = ir.docFreq(new Term(fieldNames[i], word));

topField = (freq > docFreq) ? fieldNames[i] : topField;

docFreq = (freq > docFreq) ? freq : docFreq;

}

//如果文档频率小于最小文档频率，则忽略此词。最小文档频率可由void setMinDocFreq(int minDocFreq)设定。

if (minDocFreq > 0 && docFreq < minDocFreq) {

continue;

}

//如果文档频率大于最大文档频率，则忽略此词。最大文档频率可由void setMaxDocFreq(int maxFreq)设定。

if (docFreq > maxDocFreq) {

continue;

}

if (docFreq == 0) {

continue;

}

//计算打分tf*idf

float idf = similarity.idf(docFreq, numDocs);

float score = tf * idf;

//将object的数组放入优先级队列，只有前三项有用，按照第三项score排序。

res.insertWithOverflow(new Object[]{word, // 词

topField, // 域

Float.valueOf(score), // 打分

Float.valueOf(idf), // idf

Integer.valueOf(docFreq), // 文档频率

Integer.valueOf(tf) //词频

});

}

return res;

}

private Query createQuery(PriorityQueue q) {

//最后生成的是一个布尔查询

BooleanQuery query = new BooleanQuery();

Object cur;

int qterms = 0;

float bestScore = 0;

//不断从队列中优先取出打分最高的词

while (((cur = q.pop()) != null)) {

Object[] ar = (Object[]) cur;

TermQuery tq = new TermQuery(new Term((String) ar[1], (String) ar[0]));

if (boost) {

if (qterms == 0) {

//第一个词的打分最高，作为bestScore

bestScore = ((Float) ar[2]).floatValue();

}

float myScore = ((Float) ar[2]).floatValue();

//其他的词的打分除以最高打分，乘以boostFactor，得到相应的词所生成的查询的boost，从而在当前文本文档中打分越高的词在查询语句中也有更高的boost，起重要的作用。

tq.setBoost(boostFactor * myScore / bestScore);

}

try {

query.add(tq, BooleanClause.Occur.SHOULD);

}

catch (BooleanQuery.TooManyClauses ignore) {

break;

}

qterms++;

//如果超过了设定的最大的查询词的数目，则停止，最大查询词的数目可由void setMaxQueryTerms(int maxQueryTerms)设定。

if (maxQueryTerms > 0 && qterms >= maxQueryTerms) {

break;

}

return query;

}

MoreLikeThisQuery只是MoreLikeThis的封装，其包含了MoreLikeThis所需要的参数，并在rewrite的时候，由MoreLikeThis.like生成查询对象。

String likeText;当前文档的文本
String[] moreLikeFields;根据哪个域来抽取查询词
Analyzer analyzer;分词器
float percentTermsToMatch=0.3f;最后生成的BooleanQuery之间都是SHOULD的关系，其中至少有多少比例必须得到满足
int minTermFrequency=1;最少的词频
int maxQueryTerms=5;最多的查询词数目
Set<?> stopWords=null;停词表
int minDocFreq=-1;最小的文档频率

public Query rewrite(IndexReader reader) throws IOException

{

MoreLikeThis mlt=new MoreLikeThis(reader);

mlt.setFieldNames(moreLikeFields);

mlt.setAnalyzer(analyzer);

mlt.setMinTermFreq(minTermFrequency);

if(minDocFreq>=0)

{

mlt.setMinDocFreq(minDocFreq);

}

mlt.setMaxQueryTerms(maxQueryTerms);

mlt.setStopWords(stopWords);

BooleanQuery bq= (BooleanQuery) mlt.like(new ByteArrayInputStream(likeText.getBytes()));

BooleanClause[] clauses = bq.getClauses();

bq.setMinimumNumberShouldMatch((int)(clauses.length*percentTermsToMatch));

return bq;

}

举例，对于http://topic.csdn.net/u/20100501/09/64e41f24-e69a-40e3-9058-17487e4f311b.html?1469中的帖子

我们姑且将相关问题中的帖子以及其他共20篇文档索引。

File indexDir = new File("TestMoreLikeThisQuery/index");

IndexReader reader = IndexReader.open(indexDir);

IndexSearcher searcher = new IndexSearcher(reader);

//将《IT外企那点儿事》作为likeText，从文件读入。

StringBuffer contentBuffer = new StringBuffer();

BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream("TestMoreLikeThisQuery/IT外企那点儿事.txt"), "utf-8"));

String line = null;

while((line = input.readLine()) != null){

contentBuffer.append(line);

}

String content = contentBuffer.toString();

//分词用中科院分词

MoreLikeThisQuery query = new MoreLikeThisQuery(content, new String[]{"contents"}, new MyAnalyzer(new ChineseAnalyzer()));

//将80%都包括的词作为停词，在实际应用中，可以有其他的停词策略。

query.setStopWords(getStopWords(reader));

//至少包含5个的词才认为是重要的

query.setMinTermFrequency(5);

//只取其中之一

query.setMaxQueryTerms(1);

TopDocs docs = searcher.search(query, 50);

for (ScoreDoc doc : docs.scoreDocs) {

Document ldoc = reader.document(doc.doc);

String title = ldoc.get("title");

System.out.println(title);

}

static Set<String> getStopWords(IndexReader reader) throws IOException{

HashSet<String> stop = new HashSet<String>();

int numOfDocs = reader.numDocs();

int stopThreshhold = (int) (numOfDocs*0.7f);

TermEnum te = reader.terms();

while(te.next()){

String text = te.term().text();

if(te.docFreq() >= stopThreshhold){

stop.add(text);

}

return stop;

}

结果为：

揭开外企的底儿（连载六）——外企招聘也有潜规则.txt

去央企还是外企，帮忙分析下.txt

哪种英语教材比较适合英语基础差的人.txt

有在达内外企软件工程师就业班培训过的吗.txt

两个月的“骑驴找马”，面试无数家公司的深圳体验.txt

一个看了可能改变你一生的小说《做单》,外企销售经理做单技巧大揭密.txt

HR的至高机密：20个公司绝对不会告诉你的潜规则.txt

4、MultiTermQuery

此类查询包含一到多个Term的查询，主要包括FuzzyQuery，PrefixQuery，WildcardQuery，NumericRangeQuery<T>，TermRangeQuery。

本章主要讨论后两者。

4.1、TermRangeQuery

在较早版本的Lucene，对一定范围内的查询所对应的查询对象是RangeQuery，然而其仅支持字符串形式的范围查询，因为Lucene 3.0提供了数字形式的范围查询NumericRangeQuery，所以原来的RangeQuery变为TermRangeQuery。

其包含的成员变量如下：

String lowerTerm; 左边界字符串
String upperTerm; 右边界字符串
boolean includeLower; 是否包括左边界
boolean includeUpper; 是否包含右边界
String field; 域
Collator collator; 其允许用户实现其函数int compare(String source, String target)来决定怎么样算是大于，怎么样算是小于

其提供函数FilteredTermEnum getEnum(IndexReader reader)用于得到属于此范围的所有Term：

protected FilteredTermEnum getEnum(IndexReader reader) throws IOException {

return new TermRangeTermEnum(reader, field, lowerTerm, upperTerm, includeLower, includeUpper, collator);

}

FilteredTermEnum不断取下一个Term的next函数如下：

public boolean next() throws IOException {

if (actualEnum == null) return false;

currentTerm = null;

while (currentTerm == null) {

if (endEnum()) return false;

if (actualEnum.next()) {

Term term = actualEnum.term();

if (termCompare(term)) {

currentTerm = term;

return true;

}

else return false;

}

currentTerm = null;

return false;

}

其中调用termCompare来判断此Term是否在范围之内，TermRangeTermEnum的termCompare如下：

protected boolean termCompare(Term term) {

if (collator == null) {

//如果用户没有设定collator，则使用字符串比较。

boolean checkLower = false;

if (!includeLower)

checkLower = true;

if (term != null && term.field() == field) {

if (!checkLower || null==lowerTermText || term.text().compareTo(lowerTermText) > 0) {

checkLower = false;

if (upperTermText != null) {

int compare = upperTermText.compareTo(term.text());

if ((compare < 0) ||

(!includeUpper && compare==0)) {

endEnum = true;

return false;

}

return true;

}

} else {

endEnum = true;

return false;

}

return false;

} else {

//如果用户设定了collator，则使用collator来比较字符串。

if (term != null && term.field() == field) {

if ((lowerTermText == null

|| (includeLower

? collator.compare(term.text(), lowerTermText) >= 0

: collator.compare(term.text(), lowerTermText) > 0))

&& (upperTermText == null

|| (includeUpper

? collator.compare(term.text(), upperTermText) <= 0

: collator.compare(term.text(), upperTermText) < 0))) {

return true;

}

return false;

}

endEnum = true;

return false;

}

由前面分析的MultiTermQuery的rewrite可以知道，TermRangeQuery可能生成BooleanQuery，然而当此范围过大，或者范围内的Term过多的时候，可能出现TooManyClause异常。

另一种方式可以用TermRangeFilter，并不变成查询对象，而是对查询结果进行过滤，在Filter一节详细介绍。

4.2、NumericRangeQuery

从Lucene 2.9开始，提供对数字范围的支持，然而欲使用此查询，必须使用NumericField添加域：

document.add(new NumericField(name).setIntValue(value));

或者使用NumericTokenStream添加域：

Field field = new Field(name, new NumericTokenStream(precisionStep).setIntValue(value));

field.setOmitNorms(true);

field.setOmitTermFreqAndPositions(true);

document.add(field);

NumericRangeQuery可因不同的类型用如下方法生成：

newDoubleRange(String, Double, Double, boolean, boolean)
newFloatRange(String, Float, Float, boolean, boolean)
newIntRange(String, Integer, Integer, boolean, boolean)
newLongRange(String, Long, Long, boolean, boolean)

public static NumericRangeQuery<Integer> newIntRange(final String field, Integer min, Integer max, final boolean minInclusive, final boolean maxInclusive) {

return new NumericRangeQuery<Integer>(field, NumericUtils.PRECISION_STEP_DEFAULT, 32, min, max, minInclusive, maxInclusive);

}

其提供函数FilteredTermEnum getEnum(IndexReader reader)用于得到属于此范围的所有Term：

protected FilteredTermEnum getEnum(final IndexReader reader) throws IOException {

return new NumericRangeTermEnum(reader);

}

NumericRangeTermEnum的termCompare如下：

protected boolean termCompare(Term term) {

return (term.field() == field && term.text().compareTo(currentUpperBound) <= 0);

}

另一种方式可以使用NumericRangeFilter，下面会详细论述。

举例，我们索引id从0到9的十篇文档到索引中：

Document doc = new Document();

doc.add(new Field("contents", new FileReader(file)));

String name = file.getName();

Integer id = Integer.parseInt(name);

doc.add(new NumericField("id").setIntValue(id));

writer.addDocument(doc);

搜索的时候，生成NumericRangeQuery:

File indexDir = new File("TestNumericRangeQuery/index");

IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));

IndexSearcher searcher = new IndexSearcher(reader);

NumericRangeQuery<Integer> query = NumericRangeQuery.newIntRange("id", 3, 6, true, false);

TopDocs docs = searcher.search(query, 50);

for (ScoreDoc doc : docs.scoreDocs) {

System.out.println("docid : " + doc.doc + " score : " + doc.score);

}

结果如下：

docid : 3 score : 1.0
docid : 4 score : 1.0
docid : 5 score : 1.0

查看图片附件

分享到：

Lucene学习总结之九：Lucene的查询对象(2) | 有关Lucene的问题(5)：Lucene中的TooManyC ...

2010-05-19 02:34
浏览 6325
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene学习总结之九：Lucene的查询对象(1)

1、BoostingQuery

2、CustomScoreQuery

3、MoreLikeThisQuery

4、MultiTermQuery

4.1、TermRangeQuery

4.2、NumericRangeQuery

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene学习总结之九：Lucene的查询对象(1)

1、BoostingQuery

2、CustomScoreQuery

3、MoreLikeThisQuery

4、MultiTermQuery

4.1、TermRangeQuery

4.2、NumericRangeQuery

评论

发表评论

相关推荐

Lucene应用开发揭秘

Lucene应用开发揭秘上线了

LinkedIn公司实现的实时搜索引擎Zoie

Lucene 原理与代码分析完整版

Lucene学习总结之十：Lucene的分词器Analyzer

Lucene学习总结之九：Lucene的查询对象

Lucene学习总结之九：Lucene的查询对象(3)

Lucene学习总结之九：Lucene的查询对象(2)

Lucene学习总结之八：Lucene的查询语法，JavaCC及QueryParser

Lucene学习总结之八：Lucene的查询语法，JavaCC及QueryParser(2)

Lucene学习总结之八：Lucene的查询语法，JavaCC及QueryParser(1)

Lucene学习总结之七：Lucene搜索过程解析

Lucene学习总结之七：Lucene搜索过程解析

Lucene学习总结之七：Lucene搜索过程解析(8)

Lucene学习总结之七：Lucene搜索过程解析(7)

Lucene学习总结之七：Lucene搜索过程解析(6)

Lucene学习总结之七：Lucene搜索过程解析(5)

Lucene学习总结之七：Lucene搜索过程解析(4)

Lucene学习总结之七：Lucene搜索过程解析(3)

Lucene学习总结之七：Lucene搜索过程解析(2)

最近访客更多访客>>