Python Monitor Water Falls(4)Crawler and Scrapy

Python Monitor Water Falls(4)Crawler and Scrapy

Create a virtual env
> python3 -m venv ./pythonenv

Use that ENV
> source ./pythonenv/bin/activate

> pip install scrapy
> pip install scrapyd

Check version
>scrapy --version
Scrapy 1.5.0 - project: scrapy_clawer

> scrapyd --version
twistd (the Twisted daemon) 17.9.0
Copyright (c) 2001-2016 Twisted Matrix Laboratories.
See LICENSE for details.

> pip install selenium

Install Phantomjs
http://phantomjs.org/download.html
Download the zip file and place in the working directory

Check the version
> phantomjs --version
2.1.1

Warning Message:
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Mar 21 2018: No releases

Solution:
Run the headless chrome
https://intoli.com/blog/running-selenium-with-headless-chrome/
Install on MAC
> brew install chromedriver

Check if it is there
> chromedriver
Starting ChromeDriver 2.36.540469 (1881fd7f8641508feb5166b7cae561d87723cfa8) on port 9515
Only local connections are allowed.

Change the Python Codes as follow, it works again
from selenium import webdriver

options = webdriver.ChromeOptions()
#options.binary_location = '/usr/bin/google-chrome-unstable'
options.add_argument('headless')
options.add_argument('window-size=1200x600')
browser = webdriver.Chrome(chrome_options=options)

browser.get('https://hydromet.lcra.org/riverreport')

tables = browser.find_elements_by_css_selector('table.table-condensed')

tbody = tables[5].find_element_by_tag_name("tbody")
for row in tbody.find_elements_by_tag_name("tr"):
    cells = row.find_elements_by_tag_name("td")
    if(cells[0].text == 'Marble Falls (Starcke)'):
        print(cells[1].text)

browser.quit()

Run everything with Scrapy and See if it is working well.

Another way to create ENV
> virtualenv env
> . env/bin/activate

Check version for PYTHON module
> pip install show

> pip install selenium
> pip show selenium | grep Version
Version: 3.11.0

I have a file named requirements.txt
selenium==3.11.0

I can run
> pip install -r requirements.txt

Here is how it generate the requirements.txt
> pip freeze > requirements.txt

How to run the Spider Local
> scrapy crawl quotes

Prepare the Deployment ENV
> pip install scrapyd
> pip install scrapyd-client

Start the Server
> scrapyd

Deploy the spider
> scrapyd-deploy

List the Projects and spiders
curl http://localhost:6800/listprojects.json
curl http://localhost:6800/listspiders.json?project=default
curl http://localhost:6800/schedule.json -d project=default -d spider=quotes

Investigate the Requests
https://requestbin.fullcontact.com/
https://hookbin.com/

A lot of details are in project monitor-water

References:
mysql
http://sillycat.iteye.com/blog/2393787
haproxy
http://sillycat.iteye.com/blog/2066118
http://sillycat.iteye.com/blog/1055846
http://sillycat.iteye.com/blog/562645

https://intoli.com/blog/running-selenium-with-headless-chrome/

Python Monitor Water Falls(4)Crawler and Scrapy

猜你喜欢