ElasticSearch(五)SpringBoot+ES+Jsoup实现JD(京东)搜索

标题SpringBoot+ES+Jsoup实现JD搜索


项目效果
在这里插入图片描述

1、功能概述

​ 利用Jsoup爬虫爬取JD商城的商品信息,并将商品信息存储在ElasticSearch中,同时利用请求进行全文检索,同时完成高亮显示等功能。

2、工具简介

Jsoup:jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。

httpclient:HttpClient 是Apache Jakarta Common 下的子项目,可以用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本和建议。

3、操作步骤

3.1 创建SpringBoot项目

在这里插入图片描述

3.2 勾选对应的集成包

在这里插入图片描述

3.3 导入项目中需要的jar包依赖(这里需要注意springboot版本与ES版本的冲突问题)

版本对应

Spring Data Release Train Spring Data Elasticsearch Elasticsearch Spring Framework Spring Boot
2021.2 (Raj) 4.4.x 7.17.4 5.3.x 2.7.x
2021.1 (Q) 4.3.x 7.15.2 5.3.x 2.6.x
2021.0 (Pascal) 4.2.x[1] 7.12.0 5.3.x 2.5.x
2020.0 (Ockham)[1] 4.1.x[1] 7.9.3 5.3.2 2.4.x
Neumann[1] 4.0.x[1] 7.6.2 5.2.12 2.3.x
Moore[1] 3.2.x[1] 6.8.12 5.2.12 2.2.x
Lovelace[1] 3.1.x[1] 6.2.2 5.1.19 2.1.x
Kay[1] 3.0.x[1] 5.5.0 5.0.13 2.0.x
Ingalls[1] 2.1.x[1] 2.4.0 4.3.25 1.5.x

​ 需要导入maven依赖:

<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>

<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>

<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.75</version>
</dependency>

<!--解析网页 jsoup  解析视频 tika-->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>

<dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>5.4.6</version>
</dependency>

<!-- HttpClient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
</dependency>
	

3.4 编写ES客户端配置类 ElasticSearchClientConfig (用于spring整体管理)

@Configuration
public class ElasticSearchClientConfig {
    
    
  @Bean
  public RestHighLevelClient restHighLevelClient(){
    
    
    RestHighLevelClient restHighLevelClient = new RestHighLevelClient(
      RestClient.builder(
        new HttpHost("127.0.0.1", 9200)));
    return restHighLevelClient;
  }
}

3.5 编写爬虫工具类 HtmlParseUtil

//html解析工具类
public class HtmlParseUtil {
    
    
  public static void main(String[] args) throws IOException {
    
    

    List<Content> list = HtmlParseUtil.parseJDSearchKeyByPage("洗衣机", 2);
    System.out.println(list.size());

  }

  public static List<Content> parseJDSearchKeyByPage(String key,int page) throws IOException {
    
    
    List<Content> list = new ArrayList<>();
    for (int i = 1; i <=page ; i++) {
    
    
      List<Content> itemList = HtmlParseUtil.parseJDSearchKey(key, i);
      list.addAll(itemList);
    }
    return list;
  }

  public static List<Content> parseJDSearchKey(String key,int page) throws IOException {
    
    
    //拼接URL路径和请求参数
    String url = UrlBuilder.create()
      .setScheme("https")
      .setHost("search.jd.com")
      .addPath("Search")
      .addQuery("keyword", key)
      .addQuery("enc","utf-8")
      .addQuery("page",String.valueOf(2*page-1)) //默认爬取前两页数据
      .build();

    URL url1 = new URL(url);
    HttpURLConnection httpConn = (HttpURLConnection) url1.openConnection();
    httpConn.setRequestMethod("GET");
/**
利用http模仿浏览器行为,防止被京东反爬虫程序
**/
    httpConn.setRequestProperty("authority", "search.jd.com");
    httpConn.setRequestProperty("accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");
    httpConn.setRequestProperty("accept-language", "zh-CN,zh;q=0.9");
    httpConn.setRequestProperty("cache-control", "max-age=0");
    httpConn.setRequestProperty("cookie", "__jdv=122270672|direct|-|none|-|1657610731752; __jdu=1657610731752947367087; pinId=zrLGvhk9izSm009P6x9LOw; pin=apple_ggUEIRS; unick=apple_ggUEIRS; ceshi3.com=000; _tp=70MDtYz0RbaKAAA4iyM%2FQQ%3D%3D; _pst=apple_ggUEIRS; shshshfpb=daS4RVr0Yk9w65Hio31lN-g; shshshfpa=03fd05de-1795-e1be-7faa-dbe1342ebbcd-1657504705; rkv=1.0; areaId=12; ipLoc-djd=12-988-0-0; TrackID=1xjK9942JTH1cA13hCy9lpjoF4VUsywFztnHXMZa8fMqdod6dnvsJBqV2ZD7UVJXPOj_9eOcIbRSs8MdtE1dIc4M7Ie1oRPm-h1ZW-hdOnb9Gtb_DRX3_JGb_ZkJexJcQ; qrsc=3; PCSYCityID=CN_320000_320500_0; user-key=93bcac49-c4f4-4018-8b25-0766e0c16eda; cn=0; shshshfp=fc6aabe0109953d6062026a77f8bb1e5; __jda=122270672.1657610731752947367087.1657610732.1657610732.1657610732.1; __jdb=122270672.12.1657610731752947367087|1.1657610732; __jdc=122270672; shshshsID=fcfca37eb1dce4e7ebabf041ed253e70_6_1657612610164; thor=D83906BED82DBCAAD56166802034A7EB66575CF409BC09A49AFAF3487B79FEB995355C1A9063238C46E44EDF6CFED6A8324081B64A2FC4E00045BBAB6836FB7D4A6F24F6FBF97FE1F6A3014B93F3032242CB6FE9BF9D997B81005B34FA33DC1505BFB42E7DA2FE2D5991823CAEC187EE28A13F59C3698528BFD659FBAB4CFF16650B12DA4813475B5BF6F26CFCF2C198; 3AB9D23F7A4B3C9B=4YK7NHSJLWRZZ3CXJ4A22DRHHX7TAZBRBGGHDONJODT3TACJJJ65IS72HOSU4LFNHG6ZV3WAFDYORHCEBRJYYI6ZL4");
    httpConn.setRequestProperty("sec-ch-ua", "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"");
    httpConn.setRequestProperty("sec-ch-ua-mobile", "?0");
    httpConn.setRequestProperty("sec-ch-ua-platform", "\"macOS\"");
    httpConn.setRequestProperty("sec-fetch-dest", "document");
    httpConn.setRequestProperty("sec-fetch-mode", "navigate");
    httpConn.setRequestProperty("sec-fetch-site", "none");
    httpConn.setRequestProperty("sec-fetch-user", "?1");
    httpConn.setRequestProperty("upgrade-insecure-requests", "1");
    httpConn.setRequestProperty("user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36");

    InputStream responseStream = httpConn.getResponseCode() / 100 == 2
      ? httpConn.getInputStream()
      : httpConn.getErrorStream();
    Scanner s = new Scanner(responseStream).useDelimiter("\\A");
    String response = s.hasNext() ? s.next() : "";
    Document document = Jsoup.parse(response);
    //        Document document = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36").cookie("wlfstk_smdl","4jxg7p5cy2jz7afp41rull7hc3y9mkjr").timeout(30000).get();
    Element j_goodsList = document.getElementById("J_goodsList");
    if(j_goodsList==null)
      return  new ArrayList<>(); ;

    Element gl_warp= j_goodsList.getElementsByClass("gl-warp").get(0);

    ArrayList<Content> contents = new ArrayList<>();
    for (Element child : gl_warp.children()) {
    
    
      //img图片路径是存放在懒加载路径里面。
      String img =child.getElementsByTag("img").eq(0).attr("data-lazy-img");
      String price = child.getElementsByClass("p-price").eq(0).text();
      String name = child.getElementsByClass("p-name").eq(0).text();
      Content content = new Content();
      content.setImg(img);
      content.setTitle(name);
      content.setPrice(price);
      contents.add(content);
    }
    return contents;
  }
}

3.6 编写前端页面 index.html

<!DOCTYPE html>
<html xmlns:th="http://www.thymeleaf.org">

  <head>
    <meta charset="utf-8"/>
    <title>ES仿京东实战</title>
    <link rel="stylesheet" th:href="@{/css/style.css}"/>

  </head>

  <body class="pg">
    <div class="page" id="app">
      <div id="mallPage" class=" mallist tmall- page-not-market ">

        <!-- 头部搜索 -->
        <div id="header" class=" header-list-app">
          <div class="headerLayout">
            <div class="headerCon ">
              <!-- Logo-->
              <h1 id="mallLogo">
                <img th:src="@{/images/jdlogo.png}" alt="">
              </h1>

              <div class="header-extra">

                <!--搜索-->
                <div id="mallSearch" class="mall-search">
                  <form name="searchTop" class="mallSearch-form clearfix">
                    <fieldset>
                      <legend>天猫搜索</legend>
                      <div class="mallSearch-input clearfix">
                        <div class="s-combobox" id="s-combobox-685">
                          <div class="s-combobox-input-wrap">
                            <input v-model="keyword" type="text" autocomplete="off" value="dd" id="mq"
                                   class="s-combobox-input" aria-haspopup="true"
                                   >
                          </div>
                        </div>
                        <button type="submit" id="searchbtn"   @click.prevent="searchKey">搜索</button>
                      </div>
                    </fieldset>
                  </form>
                  <ul class="relKeyTop">
                    <li><a>Java</a></li>
                    <li><a>前端</a></li>
                    <li><a>Linux</a></li>
                    <li><a>大数据</a></li>
                    <li><a>理财</a></li>
                  </ul>
                </div>
              </div>
            </div>
          </div>
        </div>

        <!-- 商品详情页面 -->
        <div id="content">
          <div class="main">
            <!-- 品牌分类 -->
            <form class="navAttrsForm">
              <div class="attrs j_NavAttrs" style="display:block">
                <div class="brandAttr j_nav_brand">
                  <div class="j_Brand attr">
                    <div class="attrKey">
                      品牌
                    </div>
                    <div class="attrValues">
                      <ul class="av-collapse row-2">
                        <li><a href="#">  </a></li>
                        <li><a href="#"> Java </a></li>
                      </ul>
                    </div>
                  </div>
                </div>
              </div>
            </form>

            <!-- 排序规则 -->
            <div class="filter clearfix">
              <a class="fSort fSort-cur">综合<i class="f-ico-arrow-d"></i></a>
              <a class="fSort">人气<i class="f-ico-arrow-d"></i></a>
              <a class="fSort">新品<i class="f-ico-arrow-d"></i></a>
              <a class="fSort">销量<i class="f-ico-arrow-d"></i></a>
              <a class="fSort">价格<i class="f-ico-triangle-mt"></i><i class="f-ico-triangle-mb"></i></a>
            </div>

            <!-- 商品详情 -->
            <div class="view grid-nosku">

              <!--                    <div class="product">-->
              <!--                        <div class="product-iWrap">-->
              <!--                            &lt;!&ndash;商品封面&ndash;&gt;-->
              <!--                            <div class="productImg-wrap">-->
              <!--                                <a class="productImg">-->
              <!--                                    <img src="https://img.alicdn.com/bao/uploaded/i1/3899981502/O1CN01q1uVx21MxxSZs8TVn_!!0-item_pic.jpg">-->
              <!--                                </a>-->
              <!--                            </div>-->
              <!--                            &lt;!&ndash;价格&ndash;&gt;-->
              <!--                            <p class="productPrice">-->
              <!--                                <em><b>¥</b>2590.00</em>-->
              <!--                            </p>-->
              <!--                            &lt;!&ndash;标题&ndash;&gt;-->
              <!--                            <p class="productTitle">-->
              <!--                                <a> dkny秋季纯色a字蕾丝dd商场同款连衣裙 </a>-->
              <!--                            </p>-->
              <!--                            &lt;!&ndash; 店铺名 &ndash;&gt;-->
              <!--                            <div class="productShop">-->
              <!--                                <span>店铺: Java </span>-->
              <!--                            </div>-->
              <!--                            &lt;!&ndash; 成交信息 &ndash;&gt;-->
              <!--                            <p class="productStatus">-->
              <!--                                <span>月成交<em>999笔</em></span>-->
              <!--                                <span>评价 <a>3</a></span>-->
              <!--                            </p>-->
              <!--                        </div>-->
              <!--                    </div>-->

              <div class="product" v-for="(item,index) in result" :key="index+item">
                <div class="product-iWrap">
                  <!--商品封面-->
                  <div class="productImg-wrap">
                    <a class="productImg">
                      <img :src="'http:'+item.img">
                    </a>
                  </div>
                  <!--价格-->
                  <p class="productPrice">
                    <!--                                <em><b>¥</b>2590.00</em>-->
                    <em>{
   
   {item.price}}</em>
                  </p>
                  <!--标题-->
                  <p class="productTitle">
                    <a v-html="item.title">  </a>
                    <!--                                <a> {
    
    {item.title}}} </a>-->
                  </p>
                  <!-- 店铺名 -->
                  <div class="productShop">
                    <span>店铺: Java </span>
                  </div>
                  <!-- 成交信息 -->
                  <p class="productStatus">
                    <span>月成交<em>999笔</em></span>
                    <span>评价 <a>3</a></span>
                  </p>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
    </div>

    <script th:src="@{/js/jquery.min.js}"></script>
    <script th:src="@{/js/axios.min.js}"></script>
    <script th:src="@{/js/vue.min.js}"></script>
    <script>
      new Vue({
      
      
        el:"#app",
        data:{
      
      
          keyword:"",
          result:[]
        },
        methods:{
      
      
          async searchKey(){
      
      
            let keyword = this.keyword;
            console.log(keyword);

            let res =  await axios.post("ES/Search",{
      
      
              keyword,
              pageSize:20,
              pageNo:1
            })
            console.log(res);
            if(res!=null&& res!=undefined){
      
      
              // alert("查询成功")
              this.result = res.data;
            }
          }
        }
      })

    </script>
  </body>
</html>

3.7 创建商品pojo类 Content

@Data
public class Content {
    
    
    private String img;
    private String title;
    private String price;
}

3.8 编写爬虫同步逻辑代码

/** Controller层代码  **/
@Slf4j
@RestController
@RequestMapping("/ES")
public class ESController {
    
    
    @Resource
    EsDataSearchService esDataSearchService;

    /**
     * 导入数据进入es
     * @param keyword
     * @return
     * @throws Exception
     */
    @GetMapping("/data/{keyword}")
    public boolean  SynchronizeData(@PathVariable("keyword") String keyword) throws Exception {
    
    
        return esDataSearchService.SynchronizeData(keyword);
    }
}

/** Service层代码 **/
@Service
public class EsDataSearchServiceImpl implements EsDataSearchService {
    
    

  @Resource
  RestHighLevelClient restHighLevelClient;

  @Override
  public boolean SynchronizeData(String keyword)throws Exception {
    
    
    List<Content> contents = HtmlParseUtil.parseJDSearchKeyByPage(keyword,2) ;
    //创建批量操作请求
    BulkRequest jd_goods = new BulkRequest();
    jd_goods.timeout("2m");

    //将爬取出来的数组同步进入es
    for (Content content : contents) {
    
    
     //新增添加请求
      jd_goods.add(
        new IndexRequest("jd_goods")
        .source(JSON.toJSONString(content), XContentType.JSON)
      );
    }
    //批量请求
    BulkResponse response = restHighLevelClient.bulk(jd_goods, RequestOptions.DEFAULT);
    return !response.hasFailures();
  }
}

注意:通过将爬取的数据转成数组,再通过es批量处理,将数据同步进入es

3.9 编写查询接口

/** Controller层代码  **/
@PostMapping("/Search")
public List<Content> SearchData(@RequestBody SearchObject searchObject)  {
    
    
  return esDataSearchService.SearchData(searchObject,true);
}


/** Service层代码 **/
@SneakyThrows
@Override
public List<Content> SearchData(SearchObject searchObject,boolean flag) {
    
    
  SearchRequest request = new SearchRequest();
  request.indices("jd_goods");
  SearchSourceBuilder builder = new SearchSourceBuilder();

  //分页
  builder.from((searchObject.getPageNo()-1)*searchObject.getPageSize());
  builder.size(searchObject.getPageSize());

  HighlightBuilder highlightBuilder = new HighlightBuilder();
  //多个高亮显示
  highlightBuilder.requireFieldMatch(false);
  highlightBuilder.preTags("<span style='color:red;'>");
  highlightBuilder.postTags("</span>");
  highlightBuilder.field("title");

  builder.highlighter(highlightBuilder);
  //精准匹配 必须完全相同 否则无法展示
  TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title", searchObject.getKeyword());
  MatchPhraseQueryBuilder queryBuilders = QueryBuilders.matchPhraseQuery("title", searchObject.getKeyword());
  builder.query(queryBuilders);

  //带中文的匹配
  BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
  //boolQueryBuilder.must(QueryBuilders.matchPhraseQuery("title",searchObject.getKeyword()));
  builder.query(boolQueryBuilder);
  builder.timeout(new TimeValue(60, TimeUnit.SECONDS));

  request.source(builder);

  //执行搜索
  SearchResponse response = restHighLevelClient.search(request, RequestOptions.DEFAULT);

  //获取结果
  List<Content> res = new ArrayList<>();
  SearchHits hits = response.getHits();
  for (SearchHit hit : hits.getHits()) {
    
    
    Content content = JSON.parseObject(hit.getSourceAsString(), Content.class);
    Map<String, HighlightField> highlightFields = hit.getHighlightFields();
    HighlightField title = highlightFields.get("title");
    if(title!=null){
    
    
      Text[] fragments = title.fragments();
      StringBuffer str = new StringBuffer("");//利用StringBuffer拼接效率更高
      for (Text fragment : fragments) {
    
    
        str.append(fragment);
      }
      content.setTitle(str.toString());
    }


    res.add( content);
  }

  //没有就现插入
  if(res.size()==0&&flag){
    
    
    //第一次没有查找到数据,则进行一次数据爬取再执行查询。
    this.SynchronizeData(searchObject.getKeyword());
    Thread.sleep(1000);//线程睡眠1s 因为同步es数据是异步操作,等待同步完成。
    res =  this.SearchData(searchObject,false);
  }
  return res;

}

3.10 启动项目,通过 启动端口进行访问(记得打开ES服务)

ES项目视频

4、总结

Elasticsearch 是一个分布式、高扩展、高实时的搜索与数据分析引擎。它能很方便的使大量数据具有搜索、分析和探索的能力。

它可以做实时数据存储,es检索数据本身扩展性很好,可以扩展到上百台服务器,处理PB级别(大数据时代)的数据。

猜你喜欢

转载自blog.csdn.net/qq_27331467/article/details/125862341