How to prevent attachments from being stored in _source with Elasticsearch and Tire?

Tag: elasticsearch , attachment , tire Author: d845047722 Date: 2012-07-23

I've got some PDF attachments being indexed in Elasticsearch, using the Tire gem. It's all working great, but I'm going to have many GB of PDFs, and we will likely store the PDFs in S3 for access. Right now the base64-encoded PDFs are being stored in Elasticsearch _source, which will make the index huge. I want to have the attachments indexed, but not stored, and I haven't yet figured out the right incantation to put in Tire's "mapping" block to prevent it. The block is like this right now:

mapping do
  indexes :id, :type => 'integer'
  indexes :title
  indexes :last_update, :type => 'date'
  indexes :attachment, :type => 'attachment'
end

I've tried some variations like:

indexes :attachment, :type => 'attachment', :_source => { :enabled => false }

And it looks nice when I run the tire:import rake task, but it doesn't seem to make a difference. Does anyone know A) if this is possible? and B) how to do it?

Thanks in advance.

Do you want to disable source completely or only exclude this particular field?
Preferably just exclude this one field, so that highlighting/etc will still be available on the other fields. I suppose I could store the specific fields where we want highlighting and disable source completely, but I'm not yet clear on what the overall effects of that would be.

Best Answer

The _source field settings contain a list of fields what should be excluded from the source. I would guess that in case of tire, something like this should do it:

mapping :_source => { :excludes => ['attachment'] } do
  indexes :id, :type => 'integer'
  indexes :title
  indexes :last_update, :type => 'date'
  indexes :attachment, :type => 'attachment'
end

comments:

Looks like that did it! Thanks very much for the answer -- hopefully this will get added to Tire's documentation, as it's a great option.

Other Answer1

@imotov 's solution does not work for me. When I execute the curl command

curl -X GET "http://localhost:9200/user_files/user_file/_search?pretty=true" -d '{"query":{"query_string":{"query":"rspec"}}}'

I can still see the content of the attachment file included in the search results.

"_source" : {"user_file":{"id":5,"folder_id":1,"updated_at":"2012-08-16T11:32:41Z","attachment_file_size":179895,"attachment_updated_at":"2012-08-16T11:32:41Z","attachment_file_name":"hw4.pdf","attachment_content_type":"application/pdf","created_at":"2012-08-16T11:32:41Z","attachment_original":"JVBERi0xL .....

Here's my implementation:

include Tire::Model::Search
include Tire::Model::Callbacks

def self.search(folder, params)
  tire.search() do
    query { string params[:query], default_operator: "AND"} if params[:query].present?
    filter :term, folder_id: folder.id
    highlight :attachment_original, :options => {:tag => "<em>"}
  end
end

mapping :_source => { :excludes => ['attachment_original'] } do
  indexes :id, :type => 'integer'
  indexes :folder_id, :type => 'integer'
  indexes :attachment_file_name
  indexes :attachment_updated_at, :type => 'date'
  indexes :attachment_original, :type => 'attachment'
end

def to_indexed_json
   to_json(:methods => [:attachment_original])
end

def attachment_original
  if attachment_file_name.present?
    path_to_original = attachment.path
    Base64.encode64(open(path_to_original) { |f| f.read })
  end    
end

comments:

This may sound obvious, but I just wanted to double-check: after adding the "excludes" you did delete the index and do a complete re-index? I ask because when I was testing I forgot to do that once and spent a couple of minutes before realizing it so it can't hurt to check. Your code looks correct, so...
Yes, I did run: rake environment tire:import CLASS='Article' FORCE=true to reindex. I've also removed highlight from tire.search() but it didn't help. I still see the attachment content included in _source :(
hmm, I've just noticed that in the search results all the fields, including the ones that are not mapped, are included in _source. That's not supposed to happen right ? I think I'll post another question regarding this. Thanks !
hmm, I misunderstood how to_indexed_json works. See this question