Java screening of Tmall supermarket coupon products (not involving crawlers)

Tmall supermarket coupon product screening (no crawlers involved)

    Why is there no crawler involved? Maochao’s coupons are for limited products and can be displayed on one page. So, here is a review of regular expression extraction.

Above:




    These coupons are basically useless because the limited products inside have all increased in price first.



    However, 618 Tmall has some shopping coupons (not the coupons in the picture above), and I want to buy some milk. So, let’s see if these price-increased products can be discounted after adding Tmall shopping coupons.

Shopping voucher picture above:


    Here’s the problem: There are too many limited edition products, with more than 1,000 types. Find milk with Ctrl+F?


    Did you see this "..."? Some titles have exceeded the limit. If "milk" is in this "...", you have no idea where it is. Cancel overflow:hidden in css style.

The style has become like this. OK. Now you can check it.

However, there are still too many things, and it’s too annoying to look for them one by one. I thought it would be better to get them all and practice my skills at the same time.


1. Download the html source code of this web page.
2. Observe the structure of html and use regular extraction

   ① Observe the web page:

                    <div class="mui-chaoshi-item mui-chaoshi-item-column columnCount-5" data-tag="item" data-itemid="567987627569">
	                <a class="mui-chaoshi-item-column-inner" href="//detail.tmall.com/item.htm?id=567987627569" target="_blank" data-itemid="567987627569">
					<div class="img-wrapper"><img class="item-img " src="//img.alicdn.com/bao/uploaded/i2/725677994/TB12vbCmStYBeNjSspaXXaOOFXa_!!0-item_pic.jpg_190x190Q50s50.jpg_.webp" alt=""> <img class="soldout-mark" src="//img.alicdn.com/tps/i2/TB1BYYIHpXXXXcEXXXXZ6GBKFXX-150-150.png" style="display:none"></div>
					<div class="item-main">
						<div class="item-info">
							<div class="item-title">Sagacity/尚贤火鸡脆饼178g*2罐(特辣+中辣)网红饼干</div>
						</div>
						<div class="item-imp">
							<div class="imp-main">
								<div class="item-price"> <b class="promotion-price"><span class="mui-price normal red"><b class="mui-price-rmb">¥</b><span class="mui-price-integer">19</span><span class="mui-price-decimal">.9</span></span>
									</b>
								</div>
							</div> <button class="cart j_AddCart" data-itemid="567987627569" data-pic="//img.alicdn.com/bao/uploaded/i2/725677994/TB12vbCmStYBeNjSspaXXaOOFXa_!!0-item_pic.jpg" data-stardandtype="" data-token=""></button> </div>
					</div>
				</a>
			</div>

    Each item is a div. The attributes we need to obtain: itemid (item id), itemTitle (item name), itemPrice (price).

    It should be noted that the integer part and decimal part of the price are stored separately. And for a price like ¥20.00, there is no decimal part div. (Here integers are stored in <span class="mui-price-integer"> and decimals are stored in <span class="mui-price-decimal"> )

Observe the following div again: (at this time soldout-sold out products), on sale and sold out

			<div class="mui-chaoshi-item mui-chaoshi-item-column columnCount-5 soldout" data-tag="item" data-itemid="521997816724">
				<a class="mui-chaoshi-item-column-inner" href="//detail.tmall.com/item.htm?id=521997816724" target="_blank" data-itemid="521997816724">
					<div class="img-wrapper"><img class="item-img " data-ks-lazyload="//img.alicdn.com/bao/uploaded/i2/725677994/TB1IeCbq_dYBeNkSmLyXXXfnVXa_!!0-item_pic.jpg" src="//g.alicdn.com/s.gif" alt=""> <img class="soldout-mark" src="//img.alicdn.com/tps/i2/TB1BYYIHpXXXXcEXXXXZ6GBKFXX-150-150.png" style="display:none"></div>
					<div class="item-main">
						<div class="item-info">
							<div class="item-title">GuyLian吉利莲比利时进口金贝壳夹心巧克力礼盒装送女友生日礼物</div>
						</div>
						<div class="item-imp">
							<div class="imp-main">
								<div class="item-price"> <b class="promotion-price"><span class="mui-price normal red"><b class="mui-price-rmb">¥</b><span class="mui-price-integer">69</span></span>
									</b>
								</div>
							</div> <button class="cart j_AddCart" data-itemid="521997816724" data-pic="//img.alicdn.com/bao/uploaded/i2/725677994/TB1IeCbq_dYBeNkSmLyXXXfnVXa_!!0-item_pic.jpg" data-stardandtype="" data-token=""></button> </div>
					</div>
				</a>
			</div>

    ② Use java to process html

        a. Input character stream

	/**
	 * 取得html页面的字符串
	 * @param path
	 * @return
	 */
	public StringBuilder getHtml(String path) {
		File file = new File(path);
		StringBuilder old = new StringBuilder();
		try {
			BufferedReader br = null;
			try {
				br = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"));
				String c = "";
				while ((c = br.readLine()) != null) {
					old.append(c);
				}
			} finally {
				br.close();
			}

		} catch (IOException e) {
			e.printStackTrace();
		}
		return old;
	}

    b. Use regular expressions to process each div

	/**
	 * 从old[]中提取有用属性到List<String[]>
	 * @param old 分割出的string数组
	 * @return
	 */
	public List<String[]> fetchAttribute(String[] old){
		List<String[]> list = new ArrayList<>();
		String angle="<.*?>";
		Pattern patternId = Pattern.compile("target=\"_blank\" data-itemid=\"\\d+");    //取得itemID
		Pattern patternTitle = Pattern.compile("<div class=\"item-title\">.*?</div>");    //取得itemtitle
		Pattern patternInteger = Pattern.compile("<span class=\"mui-price-integer\">.*?</span>");    //取得价格的整数
		Pattern patternDecimal = Pattern.compile("<span class=\"mui-price-decimal\">.*?</span>");    //取得价格的小数
		
		for(int i=0;i<old.length;i++) {
			String oldNow=old[i];
			Matcher matchId = patternId.matcher(oldNow);
			String[] attribute=null;
			if(matchId.find()) {
				attribute=new String[4];
				attribute[0]=matchId.group().replaceAll("target=\"_blank\" data-itemid=\"", "");
			}else {
				continue;
			}
			Matcher matchTitle = patternTitle.matcher(oldNow);
			if(matchTitle.find()) {
				attribute[1]=matchTitle.group().replaceAll(angle, "");
			}else {
				attribute[1]="";
			}
			Matcher matchInteger = patternInteger.matcher(oldNow);
			if(matchInteger.find()) {
				attribute[2]=matchInteger.group().replaceAll(angle, "");
			}else {
				attribute[2]="";
			}
			Matcher matchDecimal = patternDecimal.matcher(oldNow);
			if(matchDecimal.find()) {
				attribute[3]=matchDecimal.group().replaceAll(angle, "");
			}else {
				attribute[3]="";
			}
			list.add(attribute);
		}
		
		return list;
	}

    c. Save it as a csv file, it will not be saved to the database here.

	/**
	 * 保存为csv文件
	 * @param path
	 * @param itemId
	 * @param itemTitle
	 */
	public void saveInCsv(String path, List<String[]> attribute) {
		File file = new File(path);
		try {
			BufferedWriter bw = null;
			try {
				file.createNewFile();
				bw = new BufferedWriter(new FileWriter(file));
				bw.write( "商品ID,商品名称,价格" );
				bw.newLine();
				for (int i = 0; i < attribute.size(); i++) {
					String[] temp=attribute.get(i);
					bw.write(temp[0] + "," + temp[1] + ","+temp[2] + temp[3]);
					bw.newLine();
				}
				bw.flush();
			} finally {
				bw.close();
			}
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

    main function:

		String path = "coupon.html";
		FetchFromHtml ffh = new FetchFromHtml();
		String old=ffh.getHtml(path).toString();
		
		String splitRegex="<div class=\"mui-chaoshi-item mui-chaoshi-item-column columnCount-5( soldout)?\" data-tag=\"item\"";
		String[] spliters=old.split(splitRegex);
		List<String[]> list2=ffh.fetchAttribute(spliters);
		System.out.println(list2.size());
		ffh.saveInCsv("out.csv", list2);
1305 items in total.


html link: https://pan.baidu.com/s/1sPnEUi-MxWtDGrg6Y_MVtg Password: 1bnj


Guess you like

Origin blog.csdn.net/ever_who/article/details/80781394