mirror of
				https://github.com/sb745/NyaaV3.git
				synced 2025-11-04 01:45:46 +02:00 
			
		
		
		
	ES: delimit words before ngram, optimize tokens (#487)
Before, long.tokens.with.dots.or.dashes would get edgengrammed up to the ngram limit, so we'd get to long.tokens.wit which would then be split - discarding "with.dots.or.dashes" completely. The fullword index would keep the complete large token, but without any ngramming, so incomplete searches (like "tokens") would not match it, only the full token. Now, we split words before ngramming them, so the main index will properly handle words up to the ngram limit. The fullword index will still handle the longer words for non-ngram matching. Also optimized away duplicate tokens from the indices (since we rely on boolean matching, not scoring) to save a couple megabytes of space.
This commit is contained in:
		
							parent
							
								
									8f4202c098
								
							
						
					
					
						commit
						59db958977
					
				
					 1 changed files with 3 additions and 1 deletions
				
			
		| 
						 | 
					@ -20,9 +20,10 @@ settings:
 | 
				
			||||||
        filter:
 | 
					        filter:
 | 
				
			||||||
          - resolution
 | 
					          - resolution
 | 
				
			||||||
          - lowercase
 | 
					          - lowercase
 | 
				
			||||||
          - my_ngram
 | 
					 | 
				
			||||||
          - word_delimit
 | 
					          - word_delimit
 | 
				
			||||||
 | 
					          - my_ngram
 | 
				
			||||||
          - trim_zero
 | 
					          - trim_zero
 | 
				
			||||||
 | 
					          - unique
 | 
				
			||||||
      # For exact matching - simple lowercase + whitespace delimiter
 | 
					      # For exact matching - simple lowercase + whitespace delimiter
 | 
				
			||||||
      exact_analyzer:
 | 
					      exact_analyzer:
 | 
				
			||||||
        tokenizer: whitespace
 | 
					        tokenizer: whitespace
 | 
				
			||||||
| 
						 | 
					@ -40,6 +41,7 @@ settings:
 | 
				
			||||||
          # Skip tokens shorter than N characters,
 | 
					          # Skip tokens shorter than N characters,
 | 
				
			||||||
          # since they're already indexed in the main field
 | 
					          # since they're already indexed in the main field
 | 
				
			||||||
          - fullword_min
 | 
					          - fullword_min
 | 
				
			||||||
 | 
					          - unique
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    filter:
 | 
					    filter:
 | 
				
			||||||
      my_ngram:
 | 
					      my_ngram:
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
	Add table
		
		Reference in a new issue