Python Monitor Water Falls(4)Crawler and Scrapy
Create a virtual env
> python3 -m venv ./pythonenv
Use that ENV
> source ./pythonenv/bin/activate
> pip install scrapy
> pip install scrapyd
Check version
>scrapy --version
Scrapy 1.5.0 - project: scrapy_clawer
> scrapyd --version
twistd (the Twisted daemon) 17.9.0
Copyright (c) 2001-2016 Twisted Matrix Laboratories.
See LICENSE for details.
> pip install selenium
Install Phantomjs
http://phantomjs.org/download.html
Download the zip file and place in the working directory
Check the version
> phantomjs --version
2.1.1
Warning Message:
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Mar 21 2018: No releases
Solution:
Run the headless chrome
https://intoli.com/blog/running-selenium-with-headless-chrome/
Install on MAC
> brew install chromedriver
Check if it is there
> chromedriver
Starting ChromeDriver 2.36.540469 (1881fd7f8641508feb5166b7cae561d87723cfa8) on port 9515
Only local connections are allowed.
Change the Python Codes as follow, it works again
from selenium import webdriver
options = webdriver.ChromeOptions()
#options.binary_location = '/usr/bin/google-chrome-unstable'
options.add_argument('headless')
options.add_argument('window-size=1200x600')
browser = webdriver.Chrome(chrome_options=options)
browser.get('https://hydromet.lcra.org/riverreport')
tables = browser.find_elements_by_css_selector('table.table-condensed')
tbody = tables[5].find_element_by_tag_name("tbody")
for row in tbody.find_elements_by_tag_name("tr"):
cells = row.find_elements_by_tag_name("td")
if(cells[0].text == 'Marble Falls (Starcke)'):
print(cells[1].text)
browser.quit()
Run everything with Scrapy and See if it is working well.
Another way to create ENV
> virtualenv env
> . env/bin/activate
Check version for PYTHON module
> pip install show
> pip install selenium
> pip show selenium | grep Version
Version: 3.11.0
I have a file named requirements.txt
selenium==3.11.0
I can run
> pip install -r requirements.txt
Here is how it generate the requirements.txt
> pip freeze > requirements.txt
How to run the Spider Local
> scrapy crawl quotes
Prepare the Deployment ENV
> pip install scrapyd
> pip install scrapyd-client
Start the Server
> scrapyd
Deploy the spider
> scrapyd-deploy
List the Projects and spiders
curl http://localhost:6800/listprojects.json
curl http://localhost:6800/listspiders.json?project=default
curl http://localhost:6800/schedule.json -d project=default -d spider=quotes
Investigate the Requests
https://requestbin.fullcontact.com/
https://hookbin.com/
A lot of details are in project monitor-water
References:
mysql
http://sillycat.iteye.com/blog/2393787
haproxy
http://sillycat.iteye.com/blog/2066118
http://sillycat.iteye.com/blog/1055846
http://sillycat.iteye.com/blog/562645
https://intoli.com/blog/running-selenium-with-headless-chrome/
Python Monitor Water Falls(4)Crawler and Scrapy
猜你喜欢
转载自sillycat.iteye.com/blog/2414301
今日推荐
周排行