Schemaless Solr highlighting with tika

Using a default schemaless core with content indexed using Tika, I wasn't able to get any highlighting returned with a search. Two things needed fixing:

  • Tika extracts text to the 'content' field. However, the schemaless config had guessed 'strings' which are not tokenized. Solution: curl -X POST -H 'Content-type:application/json' --data-binary '{ "replace-field":{ "name":"content", "type":"text_general", "stored":true } }' http://localhost:8983/solr/aasv/schema
  • The content field was being mapped to the _text_ field which is not stored. Solution: edit the core's solrconfig.xml to comment out _text_

On reindexing, the content field now will highlight. However, it also includes all the metadata which I don't really want there. I'll have to dig further into Tika's and the extraction configuration to see what I can do about that.

Tags: