Projects to explain reptiles Case II: positioning, reptiles, positioning of page elements, respectively, positioning, simple processing data capture (a summary)

1.scrapy shell [to crawl URLs]
whether he can be very intuitive back to you to locate the elements can be positioned to
2. Open and then:
response.xpath ( "// * [@ the above mentioned id = \" ml_001 \ "] / table / tbody / tr [1] / td [1] / a / text ()") extract (); statement written to see if the return value can be targeted to.
the yield effect: and the like return

The overall process is as follows:
1.CD Part6 (go to the next one Project)
scrapy startproject [name 1]
cd [name 1]
scrapy genspider Stock (the crawling name) [address]
2. With stock.py file, first open the page, the data needs to be acquired, click cope xpath
after obtaining the xpath path, in the console with the statement: scrapy shell [to crawl URLs] enter scrapy mode, then use response.xpath ( "// * . [@id = \ "ml_001 \ "] / table / tbody / tr [1] / td [1] / a / text () ") extract (); Following the analogous statement to try to find the data, to find the data, You can write in stock.py file code, the code is as follows:
------------------------------------ --------------------------------------------
# - * - Coding: . 8-UTF - * -
Import Scrapy
from the urllib the parse Import
Import Re
from stock_spider.items Import StockItem


class StockSpider(scrapy.Spider):
name = 'stock'
allowed_domains = ['pycs.greedyai.com/']
start_urls = ['http://pycs.greedyai.com/']

def parse(self, response):
post_urls=response.xpath("//a/@href").extract();
for post_url in post_urls:
yield scrapy.Request(url=parse.urljoin(response.url, post_url), callback=self.parse_detail, dont_filter=True)

parse_detail DEF (Self, the Response):
stock_item = StockItem ();
# Board member's name
stock_item [ "names"] = self.get_tc (the Response);
# grab gender information
stock_item [ "sexes"] = self.get_sex (response) ;
# crawling age information
stock_item [ "aGEs"] = self.get_age (the Response);
# ticker symbol
stock_item [ "codes"] = self.get_code (the Response);
# jobs
stock_item [ "leaders"] = self.get_leader (Response, len (stock_item [ "names"]));
# logical file storage
yield stock_item;

def get_tc(self,response):
tc_names=response.xpath("//*[@id=\"ml_001\"]/table/tbody/tr[1]/td[1]/a/text()").extract();
return tc_names;

def get_sex(self,response):
# //*[@id=\"ml_001\"]/table/tbody/tr[2]/td[1]/div/table/thead/tr[2]/td[1]
infos=response.xpath("//*[@class=\"intro\"]/text()").extract();
sex_list=[];
for info in infos:
try:
sex=re.findall("[男|女]",info)[0];
sex_list.append(sex);
except(IndexError):
continue;
return sex_list;


def get_age(self,response):
infos = response.xpath("//*[@class=\"intro\"]/text()").extract();
age_list = [];
for info in infos:
try:
age = re.findall("\d+", info)[0];
age_list.append(age);
except(IndexError):
continue;
return age_list;

def get_code(self,response):
infos=response.xpath('/html/body/div[3]/div[1]/div[2]/div[1]/h1/a/@title').extract();
code_list=[];
for info in infos:
try:
code=re.findall("\d+",info)[0];
code_list.append(code);
except():
continue;
return code_list;

get_leader DEF (Self, Response, length):
tc_leaders = response.xpath ( "// * [@ class = \" TL \ "] / text ()") Extract ();.
tc_leaders tc_leaders = [0: length];
tc_leaders return;
----------------------------------------------- ---------------------------------
3. after written, main.py call from the call, the code as follows:
------------------------------------------------ --------------------------------
from the Execute scrapy.cmdline Import
Import SYS
Import os
a written # debug
sys.path .append (os.path.dirname (os.path.abspath with (__ file__)));
# Exec ( "Scrapy", "crawl", "tonghuashun");
# Execute ([ "Scrapy", "crawl", "tonghuashun "]);
# Execute ([" Scrapy "," crawl ","tonghuashun"]);
execute(["scrapy","crawl","stock"]);
# Before two parameters are fixed, the last parameter is the name of your own creation

-------------------------------------------------- ------------------------------
after 4.main written, followed items.py write the following code, object In order to integrate items and stock.py. items.py specifies which data to crawl
---------------------------------------- ----------------------------------------
Import scrapy
class StockSpiderItem (scrapy.Item):
The DEFINE Fields for your # Item here Wallpaper like:
# name = scrapy.Field ()
Pass
class StockItem (scrapy.Item):
names scrapy.Field = ();
sexes scrapy.Field = ();
AGEs = scrapy.Field () ;
Codes scrapy.Field = ();
Leaders scrapy.Field = ();
Note: the name and stock keeping of
------------------------ -------------------------------------------------- ------
5. pipeplines.py followed in the preparation method, where the main role is to indicate pipeplines.py handling data
------------------------- -------------------------------------------------- -----
class StockSpiderPipeline (Object):
DEF process_item (Self, Item, Spider):
return Item
class StockPipeline (Object):
DEF process_item (Self, Item, Spider):
Print (Item)
return Item
----- -------------------------------------------------- -------------------------
6. Finally, we need to ITEM_PIPELINES open setting.py, and then add your own writing class code is as follows:
- -------------------------------------------------- -----------------------------
ITEM_PIPELINES = {
'stock_spider.pipelines.StockSpiderPipeline': 300,
'stock_spider.pipelines.StockPipeline': 1
}
--------------------------------------------------------------------------------

Summary: In the process of writing code encountered some problems, such as:
1. The code to add exception handling statement

 

2. items.py be introduced under the category file in stock.py file, then the information for all the packages in this class are all placed inside

 

 

3. Each class there must be introduced response

4. The title page is when there is a crawl

5. In general is / text ()

6. Return the case, and finally returns the yield will be stock_item

 

Guess you like

Origin www.cnblogs.com/jxxgg/p/11666844.html