公告版位
新版Blog peichengnote

目前分類:elasticsearch/logstash (4)

瀏覽方式: 標題列表 簡短摘要

note 

timeout while indexing 1gb document

InquiringMind <brian.from.fl@gmail.com> 2013年5月16日下午11:32
回覆:elasticsearch@googlegroups.com
收件者: elasticsearch@googlegroups.com
By "we can say", you must mean you and your tapeworm. No one else is included in your conclusion.
 
By "document" you really mean "input data stream". In strict terms, an ElasticSearch "document" is a MySQL "row". You will never succeed in loading a 1 GB row into MySQL. But from your posts, I am guessing that MySQL has a tool that slurps one huge 1 GB input stream into the multiple rows it represents and loads them optimally. OK, ElasticSearch doesn't come with such a tool, but it comes with wonderful APIs that let you dream up and implement all manner of input streams. There are many third-party tools for pulling in data from many sources (rivers, they call them), and I wrote my own converters with proper bulk-load coding to push bulk data into ElasticSearch.
 
I can easily and successfully load a 3.1 GB "document" into ElasticSearch. Even on my laptop with decent CPU power but low end disk performance, I can load this 3.1 GB monster in just under 3 hours. The MacBook fans sound like a (quiet) jet engine, but the system is still surprisingly responsive during its efforts. And there are no memory issues, exceptions thrown, or any other issues at all. And note that this exact same 3.1 GB input "document" was loaded into MySQL in 8 hours on a production server with a proper disk array; ElasticSearch did the same job on my laptop and single slow disk in less than half the time.
 
And that 3.1 GB document is a gzip'd CSV file. Of course, I needed my Java skills to take the gunzip'd output (using gunzip -c to decompress to stdout but not on disk. Yay!), then convert that (probably about 7 or 8 GB by now) uncompressed CSV stream into the desired JSON stream, and the use the excellent examples as a model for my bulk loader that properly loaded that huge document into ElasticSearch.
文章標籤

peicheng 發表在 痞客邦 留言(0) 人氣()


SolrCloud 改善了原本在 Solr 分散式上面的不足。
Solr 在之前沒有distributed index的feature,我們需要手動的去拆分 core to shards,
你要自己知道你要index的record file 是送到哪一台node上面做index,Solr並不會幫你管理這些步驟。
而且,當你的每個core不 balance 時,也要手動來結果。而且不支援 failover 。可能會導致某些 index 找不到的情況。

SolrCloud 引入 ZooKeeper

在整個系統中導入了hadoop eco 常用的 居中協調角色的 ZooKeeper來做 failover 與 load balancing。讓整個 Search engine 可以更 Robust。

文章標籤

peicheng 發表在 痞客邦 留言(0) 人氣()


software version
elasticsearch-0.19.11-2.el6.x86_64.rpm
logstash-1.1.5-1.el6.x86_64.rpm
redis-2.4.6-rc2.x86_64.rpm

use
redis-2.4.7-1.x86_64.rpm
avoid this problems.

文章標籤

peicheng 發表在 痞客邦 留言(0) 人氣()

Too many active ES requests, blocking now. {:inflight_requests=>50, :max_inflight_requests=>50, :level=>:info, :file=>"/opt/logstash/lib/logstash-1.1.5-monolithic.jar!/logstash/outputs/elasticsearch.rb", :line=>"150", :method=>"receive"}
文章標籤

peicheng 發表在 痞客邦 留言(0) 人氣()