2018 Scrapy Environment Enhance(3)Docker ENV

2018 Scrapy Environment Enhance(3)Docker ENV

Set Up Scrapy Ubuntu DEV
>sudo apt-get install -qy python python-dev python-distribute python-pip ipython
>sudo apt-get install -qy firefox xvfb
>sudo apt-get install -qy libffi-dev libxml2-dev libxslt-dev lib32z1-dev libssl-dev
> sudo apt-get install python3-venv
> sudo apt-get install python3-dev
> sudo apt install unzip
> sudo apt-get install libxi6 libgconf-2-4
> sudo apt-get install libnss3 libgconf-2-4
> sudo apt-get install chromium-browser

If need, make it to remember the git username and password
> git config credential.helper 'cache --timeout=300000'

Create the virtual ENV and activate that
> python3 -m venv ./env
> source ./env/bin/activate

> pip install --upgrade pip
> pip install selenium pyvirtualdisplay
> pip install boto3
> pip install beautifulsoup4 requests

Install Twisted
> wget http://twistedmatrix.com/Releases/Twisted/17.9/Twisted-17.9.0.tar.bz2
> tar xjf Twisted-17.9.0.tar.bz2
> python setup.py install

> pip install lxml scrapy scrapyjs

Install Browser and Driver
> wget https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip
> unzip chromedriver_linux64.zip
> chmod a+x chromedriver
> sudo mv chromedriver /usr/local/bin/

> chromedriver --version
ChromeDriver 2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7)

> chromium-browser -version
Chromium 65.0.3325.181 Built on Ubuntu , running on Ubuntu 16.04

Setup Tor Network Proxy
> sudo apt-get install tor
> sudo apt-get install netcat
> sudo apt-get install curl
> sudo apt-get install privoxy

Check my Local IP
> curl http://icanhazip.com/
52.14.197.xxx

Set Up Tor
> tor --hash-password prxxxxxxxx
16:01D5D02xxxxxxxxxxxxxxxxxxxxxxxxxxx

> cat /etc/tor/torrc
ControlPort 9051

> cat /etc/tor/torrcpassword
HashedControlPassword 16:01D5D02EFA3D6A5xxxxxxxxxxxxxxxxxxx

Start Tor
> sudo service tor start

Verify it change my IP
> torify curl http://icanhazip.com/
192.36.27.4

Command does not work here
> echo -e 'AUTHENTICATE "pricemonitor1234"\r\nsignal NEWNYM\r\nQUIT' | nc 127.0.0.1 9051

Try to use Python to change the IP
> pip install stem

> python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from stem import Signal
>>> from stem.control import Controller
>>> with Controller.from_port(port=9051) as controller:
...     controller.authenticate()
...     controller.signal(Signal.NEWNYM)
...

That should work if the permission is right.
Config the Proxy
> cat /etc/privoxy/config
forward-socks5t / 127.0.0.1:9050 .

Start the Service
> sudo service privoxy start

Verify the IP
> curl -x 127.0.0.1:8118 http://icanhazip.com/
185.220.101.6

Verify with Request API
> python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>>
>>> import requests
>>> response = requests.get('http://icanhazip.com/', proxies={'http': '127.0.0.1:8118'})
>>> response.text.strip()
'185.220.101.6'

Think About Docker Application
Dockerfile
#Run a scrapy server side

#Prepare the OS
FROM            ubuntu:16.04
MAINTAINER      Carl Luo <[email protected]>

ENV DEBIAN_FRONTEND noninteractive
RUN apt-get -qq update
RUN apt-get -qqy dist-upgrade

#Prepare the denpendencies
RUN apt-get install -qy python3 python3-dev python-distribute python3-pip ipython
RUN apt-get install -qy firefox xvfb
RUN pip3 install selenium pyvirtualdisplay
RUN pip3 install boto3 beautifulsoup4 requests
RUN apt-get install -qy libffi-dev libxml2-dev libxslt-dev lib32z1-dev libssl-dev
RUN pip3 install lxml scrapy scrapyjs
RUN pip3 install --upgrade pip
RUN apt-get install -qy python3-venv
RUN apt-get install -qy libxi6 libgconf-2-4 libnss3 libgconf-2-4
RUN apt-get install -qy chromium-browser
RUN apt-get install -qy wget unzip git

#add tool
ADD install/chromedriver /usr/local/bin/
RUN pip install scrapyd

#copy the config
RUN mkdir -p /tool/scrapyd/
ADD conf/scrapyd.conf /tool/scrapyd/

#set up the app
EXPOSE  6801
RUN     mkdir -p /app/
ADD     start.sh /app/
WORKDIR /app/
CMD    [ "./start.sh" ]

Makefile
IMAGE=sillycat/public
TAG=ubuntu-scrapy-1.0
NAME=ubuntu-scrapy-1.0

docker-context:

build: docker-context
    docker build -t $(IMAGE):$(TAG) .

run:
    docker run -d -p 6801:6801 --name $(NAME) $(IMAGE):$(TAG)

debug:
    docker run -p 6801:6801 --name $(NAME) -ti $(IMAGE):$(TAG) /bin/bash

clean:
    docker stop ${NAME}
    docker rm ${NAME}

logs:
    docker logs ${NAME}

publish:
    docker push ${IMAGE}

start.sh
#!/bin/sh -ex

#start the service
cd /tool/scrapyd/
scrapyd

Configuration in conf/scrapyd.conf
[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 100
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 20
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port   = 6801
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

References:
http://sillycat.iteye.com/blog/2418353
http://sillycat.iteye.com/blog/2418229

猜你喜欢

转载自sillycat.iteye.com/blog/2422861
env