Golang-based crawler combat

Golang-based crawler combat

foreword

Crawler was originally python's strength. I studied scrapy in the early stage and wrote some simple crawler applets, but then I suddenly became interested in golang and decided to write and write crawler to practice my hands. Since I am new to golang, if there are any mistakes, please correct me.

general idea

  • Since there are many dynamic pages now, consider using WebDriver to drive Chrome and other pages to render and then grab the data. (Phantomjs was used at the beginning, but later this product is not maintained, and the efficiency is not high)
  • Generally, crawlers run on Linux systems, so consider Chrome's headless mode.
  • After the data is captured, it is saved to a CSV file, and then sent out by email.

insufficiency

  • Because rendering is required, the speed will be reduced a lot, even if the image is not rendered, the speed is not very ideal.
  • Because I just started learning, I didn't add multi-threading or anything, for fear that the memory would collapse.
  • Not writing the data to the database, putting it in a file is not the final solution after all.

required library

  • github.com/tebeka/selenium
    • The golang version of selenium can implement most functions.
  • gopkg.in/gomail.v2
    • The library used for sending emails has not been updated for a long time, but it is enough.

Download dependency packages

  • I planned to use dep to manage dependencies, but it turned out that there were quite a lot of pits.
  • Download dependencies via go get
go get github.com/tebeka/selenium
go get gopkg.in/gomail.v2

Code

  • Start chromedriver, which is used to drive the Chrome browser.
// StartChrome 启动谷歌浏览器headless模式
func StartChrome() {
	opts := []selenium.ServiceOption{}
	caps := selenium.Capabilities{
		"browserName":                      "chrome",
	}
    
        // 禁止加载图片,加快渲染速度
	imagCaps := map[string]interface{}{
		"profile.managed_default_content_settings.images": 2,
	}

	chromeCaps := chrome.Capabilities{
		Prefs: imagCaps,
		Path:  "",
		Args: []string{
			"--headless", // 设置Chrome无头模式
			"--no-sandbox",
			"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/604.4.7 (KHTML, like Gecko) Version/11.0.2 Safari/604.4.7", // 模拟user-agent,防反爬
		},
	}
	caps.AddChrome(chromeCaps)
        // 启动chromedriver,端口号可自定义
	service, err = selenium.NewChromeDriverService("/opt/google/chrome/chromedriver", 9515, opts...) 
	if err != nil {
		log.Printf("Error starting the ChromeDriver server: %v", err)
	}
        // 调起chrome浏览器
	webDriver, err = selenium.NewRemote(caps, fmt.Sprintf("http://localhost:%d/wd/hub", 9515))
	if err != nil {
		panic(err)
	}
	// 这是目标网站留下的坑,不加这个在linux系统中会显示手机网页,每个网站的策略不一样,需要区别处理。
	webDriver.AddCookie(&selenium.Cookie{
		Name:  "defaultJumpDomain",
		Value: "www",
	})
        // 导航到目标网站
	err = webDriver.Get(urlBeijing)
	if err != nil {
		panic(fmt.Sprintf("Failed to load page: %s\n", err))
	}
	log.Println(webDriver.Title())
}

Through the above code, you can start Chrome through the code and jump to the target website, which is convenient for the next data acquisition.

  • Initialize CSV, where the data is stored
// SetupWriter 初始化CSV
func SetupWriter() {
	dateTime = time.Now().Format("2006-01-02 15:04:05") // 格式字符串是固定的,据说是go语言诞生时间,谷歌的恶趣味...
	os.Mkdir("data", os.ModePerm)
	csvFile, err := os.Create(fmt.Sprintf("data/%s.csv", dateTime))
	if err != nil {
		panic(err)
	}
	csvFile.WriteString("\xEF\xBB\xBF")
	writer = csv.NewWriter(csvFile)
	writer.Write([]string{"车型", "行驶里程", "首次上牌", "价格", "所在地", "门店"})
}
data scraping

This part is the core business. The crawling method of each website is different, but the idea is the same. The content of the element is obtained through xpath, css selector, className, tagName, etc. The selenium api can realize most operations Function, you can see from the selenium source code, the core api includes WebDriver and WebElement, the following is the process of my grabbing the used car data of Beijing used car home, other websites can refer to the modification process.

  • Open the used car home website through Safari browser and get the link to the Beijing used car homepageimage
const urlBeijing = "https://www.che168.com/beijing/list/#pvareaid=104646"
  • Right-click "Inspect Element" on the page to enter developer mode, you can see all the data in itimage
<ul class="fn-clear certification-list" id="viewlist_ul">

Point the mouse to the right button of this sentence, and then 拷贝- XPath, you can get the xpath attribute where the element is located.

//*[@id="viewlist_ul"]

then by code

listContainer, err := webDriver.FindElement(selenium.ByXPATH, "//*[@id=\"viewlist_ul\"]")

You can get the WebElement object of the modified html. It is not difficult to see that this is the parent container of all data. In order to get specific data, you need to locate each element subset, which can be seen through the development mode.image

The class of carinfo can be obtained through developer tools, because there are multiple elements of this element, so pass

lists, err := listContainer.FindElements(selenium.ByClassName, "carinfo")

You can get a set of all subsets of elements. To get the element data in each subset, you need to traverse the set

for i := 0; i < len(lists); i++ {
	var urlElem selenium.WebElement
	if pageIndex == 1 {
		urlElem, err = webDriver.FindElement(selenium.ByXPATH, fmt.Sprintf("//*[@id='viewlist_ul']/li[%d]/a", i+13))
	} else {
		urlElem, err = webDriver.FindElement(selenium.ByXPATH, fmt.Sprintf("//*[@id='viewlist_ul']/li[%d]/a", i+1))
	}
	if err != nil {
		break
	}
	// 因为有些数据在次级页面,需要跳转
	url, err := urlElem.GetAttribute("href") 
	if err != nil {
		break
	}
    webDriver.Get(url)
	title, _ := webDriver.Title()
	log.Printf("当前页面标题:%s\n", title)
        // 获取车辆型号
	modelElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[1]/h2")
	var model string
	if err != nil {
		log.Println(err)
		model = "暂无"
	} else {
		model, _ = modelElem.Text()
	}
	log.Printf("model=[%s]\n", model)
	
    ...
    
    // 数据写入CSV
    writer.Write([]string{model, miles, date, price, position, store})
	writer.Flush()
	webDriver.Back() // 回退到上级页面重复步骤抓取
}

All source codes are as follows, beginners, light spray~~

// StartCrawler 开始爬取数据
func StartCrawler() {
	log.Println("Start Crawling at ", time.Now().Format("2006-01-02 15:04:05"))
	pageIndex := 0
	for {
		listContainer, err := webDriver.FindElement(selenium.ByXPATH, "//*[@id=\"viewlist_ul\"]")
		if err != nil {
			panic(err)
		}
		lists, err := listContainer.FindElements(selenium.ByClassName, "carinfo")
		if err != nil {
			panic(err)
		}
		log.Println("数据量:", len(lists))
		pageIndex++
		log.Printf("正在抓取第%d页数据...\n", pageIndex)
		for i := 0; i < len(lists); i++ {
			var urlElem selenium.WebElement
			if pageIndex == 1 {
				urlElem, err = webDriver.FindElement(selenium.ByXPATH, fmt.Sprintf("//*[@id='viewlist_ul']/li[%d]/a", i+13))
			} else {
				urlElem, err = webDriver.FindElement(selenium.ByXPATH, fmt.Sprintf("//*[@id='viewlist_ul']/li[%d]/a", i+1))
			}
			if err != nil {
				break
			}
			url, err := urlElem.GetAttribute("href")
			if err != nil {
				break
			}
			webDriver.Get(url)
			title, _ := webDriver.Title()
			log.Printf("当前页面标题:%s\n", title)

			modelElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[1]/h2")
			var model string
			if err != nil {
				log.Println(err)
				model = "暂无"
			} else {
				model, _ = modelElem.Text()
			}
			log.Printf("model=[%s]\n", model)

			priceElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[2]/div/ins")
			var price string
			if err != nil {
				log.Println(err)
				price = "暂无"
			} else {
				price, _ = priceElem.Text()
				price = fmt.Sprintf("%s万", price)
			}
			log.Printf("price=[%s]\n", price)

			milesElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[4]/ul/li[1]/span")
			var miles string
			if err != nil {
				log.Println(err)
				milesElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[3]/ul/li[1]/span")
				if err != nil {
					log.Println(err)
					miles = "暂无"
				} else {
					miles, _ = milesElem.Text()
				}
			} else {
				miles, _ = milesElem.Text()
			}
			log.Printf("miles=[%s]\n", miles)

			timeElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[4]/ul/li[2]/span")
			var date string
			if err != nil {
				log.Println(err)
				timeElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[3]/ul/li[2]/span")
				if err != nil {
					log.Println(err)
					date = "暂无"
				} else {
					date, _ = timeElem.Text()
				}
			} else {
				date, _ = timeElem.Text()
			}
			log.Printf("time=[%s]\n", date)

			positionElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[4]/ul/li[4]/span")
			var position string
			if err != nil {
				log.Println(err)
				positionElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[3]/ul/li[4]/span")
				if err != nil {
					log.Println(err)
					position = "暂无"
				} else {
					position, _ = positionElem.Text()
				}
			} else {
				position, _ = positionElem.Text()
			}
			log.Printf("position=[%s]\n", position)

			storeElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[1]/div/div/div")
			var store string
			if err != nil {
				log.Println(err)
				store = "暂无"
			} else {
				store, _ = storeElem.Text()
				store = strings.Replace(store, "商家|", "", -1)
				if strings.Contains(store, "金牌店铺") {
					store = strings.Replace(store, "金牌店铺", "", -1)
				}
			}
			log.Printf("store=[%s]\n", store)
			writer.Write([]string{model, miles, date, price, position, store})
			writer.Flush()
			webDriver.Back()
		}
		log.Printf("第%d页数据已经抓取完毕,开始下一页...\n", pageIndex)
		nextButton, err := webDriver.FindElement(selenium.ByClassName, "page-item-next")
		if err != nil {
			log.Println("所有数据抓取完毕!")
			break
		}
		nextButton.Click()
	}
	log.Println("Crawling Finished at ", time.Now().Format("2006-01-02 15:04:05"))
	sendResult(dateTime)
}
  • send email

The whole code is as follows, which is relatively simple and will not be repeated.

func sendResult(fileName string) {
	email := gomail.NewMessage()
	email.SetAddressHeader("From", "re**[email protected]", "张**")
	email.SetHeader("To", email.FormatAddress("li**[email protected]", "李**"))
	email.SetHeader("Cc", email.FormatAddress("zhang**[email protected]", "张**"))
	email.SetHeader("Subject", "二手车之家-北京-二手车信息")
	email.SetBody("text/plain;charset=UTF-8", "本周抓取到的二手车信息数据,请注意查收!\n")
	email.Attach(fmt.Sprintf("data/%s.csv", fileName))

	dialer := &gomail.Dialer{
		Host:     "smtp.163.com",
		Port:     25,
		Username: ${your_email},    // 替换自己的邮箱地址
		Password: ${smtp_password}, // 自定义smtp服务器密码
		SSL:      false,
	}
	if err := dialer.DialAndSend(email); err != nil {
		log.Println("邮件发送失败!err: ", err)
		return
	}
	log.Println("邮件发送成功!")
}
  • Last note on recycling resources
defer service.Stop()    // 停止chromedriver
defer webDriver.Quit()  // 关闭浏览器
defer csvFile.Close()   // 关闭文件流

Summarize

  • Beginning to learn golang, just practice with the crawler project, the code is relatively rough and there is no engineering at all, I hope it will not be misleading.
  • Since the golang crawler basically has no other projects to learn from, it also has some of its own research results, and I hope it can help others.
  • Finally, Pholcus , a crawler framework written by a great god of Amway , has powerful functions and is a relatively complete framework at present.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324922902&siteId=291194637