SOLR Performance and SolrJ(2)Compress and Post File

SOLR Performance and SolrJ(2)Compress and Post File

Then the idea is to compress the post data first and then send that to SOLR indexer or we can have SOLR cloud and shard to improve the band of the indexer machines.

Finally, we decide to generate the XML file, compress that and SCP the compressed file to indexer machine. On the indexer machine, we can have a monitor to unzip the file and post to the SOLR localhost. That is localhost, not the network band then.

I am using XMLWriter to generate the SOLR XML, similar code here
public function __construct($ioc)
{
    $this->ioc = $ioc;

    $logger = $this->ioc->getService("logger");
    $config = $this->ioc->getService("config");

    $this->xmlWriter = new \XMLWriter();
}

public function addStart($file){
    $this->xmlWriter->openURI($file);
    $this->xmlWriter->setIndent(true);

    $this->xmlWriter->startElement('update');
}
public function __construct($ioc)
{
    $this->ioc = $ioc;

    $logger = $this->ioc->getService("logger");
    $config = $this->ioc->getService("config");

    $this->xmlWriter = new \XMLWriter();
}

public function addStart($file){
    $this->xmlWriter->openURI($file);
    $this->xmlWriter->setIndent(true);

    $this->xmlWriter->startElement('update');
}

Zip the file and SCP to the target
system("gzip -f {$file_path}");
system("scp -i /share/ec2.id -o StrictHostKeyChecking=no {$file_path}.gz ec2-user@{$ip}:" . $this->XML_FOLDER);
unlink($file_path . ".gz");

On the target machine, we will watch the directory and exec the post curl request to the SOLR server. PHP is really easy in this situation.

    $delta_files = array();
    exec('ls -tr --time=ctime /mnt/ad_feed/*.gz 2>/dev/null', $delta_files);
    $delta_count = count($delta_files);
    if(DEBUG) echo "delta count: ".$delta_count."\n";
    if($delta_count == 0) continue;

Check how many process are working
$curl_processes = array();
    exec('ps -ef | grep "curl --fail http://localhost:8983/job/update -d @/mnt/ad_feed/" | grep -v grep', $curl_processes);
    $curl_count = count($curl_processes);
    if(DEBUG) echo "curl count: ".$curl_count."\n";
    if($curl_count >= MAX_PROCS) continue;

Execute the command in the backend, then we can use exec to execute multiple process
$curl_command = "php delta_curl.php $cur_file > /dev/null 2>&1 &"; //parallel processes
exec($curl_command);

Post XML file
exec("curl --fail http://localhost:8983/job/update -d @{$argv[1]} -H Content-type:application/xml", $output, $status);

if(0 != $status)
{
    send_delta_alert($argv[1]);
}
unlink($argv[1]);

The sample format of the XML will be as follow:
<update>
<delete>
  <id>2136083108</id>
  <id>2136083113</id>
  <id>2136083114</id>
</delete>
<add>
  <doc>
   <field name="id">2136083xx</field>
   <field name="customer_id">2xx</field>
   <field name="pool_id">20xx</field>
   <field name="source_id">23xx</field>
   <field name="campaign_id">3xxx</field>
   <field name="segment_id">0</field>
   <field name="job_reference">468-1239-xxxx4</field>
   <field name="title"><![CDATA[CDL-A xxxxx ]]></field>
   <field name="url"><![CDATA[http://www.xxxxxx]]></field>
   <field name="company_id">11xxx7</field>
   <field name="company">Hub xxxxx</field>
   <field name="title_com">CDL-xxxx</field>
   <field name="campaign_com">3396xxx</field>
   <field name="zipcode">3xxxx</field>
   <field name="cities">Atlanta,GA</field>
   <field name="jlocation">33.8444,-84.4741</field>
   <field name="state_id">11</field>
   <field name="cpc">125</field>
   <field name="reg_cpc">130</field>
   <field name="qq_multiplier">0</field>
   <field name="j2c_apply">0</field>
   <field name="created">2016-09-02T06:02:42Z</field>
   <field name="posted">2016-09-02T06:02:42Z</field>
   <field name="experience">2</field>
   <field name="salary">150</field>
   <field name="education">2</field>
   <field name="jobtype">1</field>
   <field name="quality_score">60</field>
   <field name="boost_factor">20.81</field>
   <field name="industry">20</field>
   <field name="industries">20</field>
   <field name="paused">false</field>
   <field name="email"></field>
   <field name="srcseg_id">23xx</field>
   <field name="srccamp_id">23xxx</field>
   <field name="top_spot_type">7</field>
   <field name="top_spot_industries">20</field>
   <field name="is_ad">2</field>
   <field name="daily_capped">0</field>
   <field name="mobile_friendly">1</field>
   <field name="excluded_company">false</field>
  </doc>
</add>
</update>

References:

猜你喜欢

转载自sillycat.iteye.com/blog/2363224
今日推荐