关于爬虫:
万维网上有着无数的网页,包含着海量的信息,无孔不入、森罗万象。但很多时候,无论出于数据分析或产品需求,我们需要从某些网站,提取出我们感兴趣、有价值的内容,但是纵然是进化到21世纪的人类,依然只有两只手,一双眼,不可能去每一个网页去点去看,然后再复制粘贴。所以我们需要一种能自动获取网页内容并可以按照指定规则提取相应内容的程序,这就是爬虫。
本篇文章就以小编用java语言爬取360视频网站为例,为大家进行复杂而又简单的爬虫技术的应用。
如上图所示,360影视网站共有电视剧,电影,综艺,动漫四个大的频道,分析其链接url,也有规律可言,分别为
1.https://www.360kan.com/dianshi/list 2.https://www.360kan.com/dianying/list
3.https://www.360kan.com/zongyi/list 4.https://www.360kan.com/dongman/list
接下来便开始通过url对需要的内容进行抓取
小编所使用到的工具有MyEclipse开发工具,Apache Tomcat服务器。具体的使用方法和配置在这里不详细介绍,不懂得的可以百度。需要做的是将安装的Tomcat与MyEclipse进行绑定,以使web程序能够在服务器上运行。首先新建工程,注意工程里需要导入我们的java爬虫常用工具jsoup.jar,然后在工程里写两个java web的Servlet,首页index.jsp,以及播放界面player.jsp,
(1)首页代码如下所示:
<div class="form">
<form action="/LongVideos/Search" method="post" target="ifr">
<input type="text" placeholder="尽情搜吧" name="v_name">
<input type="submit" value="搜搜">
</form>
</div>
<%
String p=request.getParameter("pageId");
for(int i=1;i<=24;i++){
String pa=i+"";
}
if("1".equals(p)){
p="1";
}else if("2".equals(p)){
p="2";
}else if("3".equals(p)){
p="3";
}else if("4".equals(p)){
p="4";
}else if("5".equals(p)){
p="5";
}else if("6".equals(p)){
p="6";
}else if("7".equals(p)){
p="7";
}else if("8".equals(p)){
p="8";
}else if("9".equals(p)){
p="9";
}else if("10".equals(p)){
p="10";
}else if("11".equals(p)){
p="11";
}else if("12".equals(p)){
p="12";
}else if("13".equals(p)){
p="13";
}else if("14".equals(p)){
p="14";
}else if("15".equals(p)){
p="15";
}else if("16".equals(p)){
p="16";
}else if("17".equals(p)){
p="17";
}else if("18".equals(p)){
p="18";
}else if("19".equals(p)){
p="19";
}else if("20".equals(p)){
p="20";
}else if("21".equals(p)){
p="21";
}else if("22".equals(p)){
p="22";
}else if("23".equals(p)){
p="23";
}else if("24".equals(p)){
p="24";
}else{
p="1";
}
System.out.print(p);
%>
<div class="tab" align="center">
<a href="/LongVideos/SpideImg?type=tvplay&pageId=<%=p %>" target="ifr">电视剧</a>
<a href="/LongVideos/SpideImg?type=movie&pageId=<%=p %>" target="ifr">电影</a>
<a href="/LongVideos/SpideImg?type=variety&pageId=<%=p %>" target="ifr">综艺</a>
<a href="/LongVideos/SpideImg?type=cartoon&pageId=<%=p %>" target="ifr">动漫</a>
</div>
<iframe src="/LongVideos/SpideImg?type=movie" name="ifr" width="100%" height="100%;"></iframe>
<div class="page" align="center">
<%for(int j=1;j<=24;j++){
String pageIndex="<a href=\"/LongVideos/index.jsp?pageId="+j+"\">"+j+"</a>";
out.write(pageIndex);
//System.out.println(pageIndex);
} %>
</div>
运行后的界面如下
(2)两个Servlet代码如下所示
【1】SpideServlet该Servlet实现抓取网页中的视频内容并显示,
String type=request.getParameter("type");
String page=request.getParameter("pageId");
if(type.equals("tvplay")){
type="dianshi";
}else if(type.equals("movie")){
type="dianying";
}else if(type.equals("variety")){
type="zongyi";
}else if(type.equals("cartoon")){
type="dongman";
}
String main="https://www.360kan.com";
String url="https://www.360kan.com/"+type+"/list.php?rank=rankhot&cat=all&area=all&year=all&pageno="+page;//?rank=rankhot&cat=all&area=all&year=all&pageno=2
Document doc=Jsoup.connect(url).get();//?rank=rankhot&cat=all&area=all&act=all&year=all&pageno=4
Elements plays=doc.getElementsByClass("js-tongjic");
PrintWriter out=response.getWriter();
out.println("<head>");
out.println("<title>Long Bro影院欢迎你</title>");
out.println("<link rel=\"stylesheet\" type=\"text/css\" href=\"css/index.css\"> ");
out.println("</head>");
//遍历完imgs后,srcL链表已包含所有图片地�?
for(Element play: plays){
// Attribute
// String href=play.attr("href");
String p=play.toString();
//写一个方法,取出href里面的内容,src里面的内容,以及span里的年份,视频名和评分,还有演员
String[] s=p.split(">");
String hr=s[0]+">";
String hre=main+hr.substring(hr.indexOf("href=")+6, hr.indexOf("\">"));
String sr=s[2]+">";
String src=sr.substring(sr.indexOf("src=")+5, sr.indexOf("\">"));
String year,name,score,actor;
if(p.contains("付费")){
// System.out.println("360付费影视");
year=s[6].substring(0,4);
// System.out.println("年份+year);
name=s[11].split("<")[0];
// System.out.println("片名�?+name);
score=s[13].split("<")[0];
// System.out.println("评分�?+score);
actor=s[16].split("<")[0];
// System.out.println("主演�?+actor);
}else{
year=s[4].substring(0, 4);
name=s[9].split("<")[0];
score=s[11].split("<")[0];
actor=s[14].split("<")[0];
}
if(name.length()>11){
name=name.substring(0,11)+"...";
}
if(actor.length()>11){
actor=actor.substring(0, 11)+"...";
}
if(score.equals("")){
score="暂无";
}
if(year.length()!=4){
year="暂无";
}
if(actor.equals("")){
actor="暂无";
}
///LongVideos/player.jsp?href=
//得到hre中的网页源码,以从中筛选出想要的信息
Document docu=Jsoup.connect(hre).get();
Elements btns=docu.getElementsByClass("s-cover");//播放链接在这个id中
String pu=null;
String sb=btns.toString();
// System.out.println(sb);
// PlayerUrl获取该视频的播放链接
pu=sb.substring(sb.indexOf("href=\"")+6,sb.indexOf("\" class=\""));
// System.out.println(name+"\n"+pu);
//从href网页源码中获取其他信息,如视频的详情,剧集,等等,然后传入播放界面并显示
//或者只将hre传入player.jsp,在其里面进行这些信息的爬取操作
// System.out.println("--------------------");
out.write(" <div class='whole'><a href=\"/LongVideos/player.jsp?type="+type+"&url="+hre+"&href="+pu+"\" target='_blank'>"
+ "<img src='"+src+"' title='"+name+"' alt='"+src+"'>"
+ "<em>"+name+"</em><br><em>"+actor+"</em><br><em>年份:"+year+"</em><br>"
+ "<em>评分:"+score+"</em></a></div>");
【2】SearchServlet该Servlet实现影视的搜索功能
package servlet;
import java.io.IOException;
import java.io.PrintWriter;
import java.net.URLEncoder;
import java.util.ArrayList;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import bean.SimInfo;
/**
* @author Long Bro
*
*/
public class Search extends HttpServlet {
public void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
request.setCharacterEncoding("UTF-8");//
response.setCharacterEncoding("UTF-8");//设置浏览器响应的编码方式,即控制浏览器的编码
response.setContentType("text/html;charset=UTF-8");//
String kw=request.getParameter("v_name");
System.out.println(kw);
PrintWriter out=response.getWriter();
out.println("<html><head>");
out.println("<title>Long Bro-Video搜搜</title>");
out.println("<link rel=\"stylesheet\" type=\"text/css\" href=\"css/index.css\"> ");
out.println("<style type=\"text/css\">");
out.print("body{margin-left:60px;}");
out.println("</style>");
out.println("</head>");
// String kw="意外";
ArrayList<SimInfo> sis=searchInfo(kw,response);
out.write("<body>");
for(int i=0;i<sis.size();i++){
SimInfo si=sis.get(i);
String url=si.getUrl();
//得到播放链接
Document docu=Jsoup.connect(url).get();
Elements btns=docu.getElementsByClass("s-cover");//播放链接在这个id中
String pu=null;
String sb=btns.toString();
// System.out.println(sb);
// PlayerUrl获取该视频的播放链接
pu=sb.substring(sb.indexOf("href=\"")+6,sb.indexOf("\" class=\""));
System.out.println(pu);
String img=si.getImg();
String name=si.getVname();
String type=si.getType();
System.out.println(url+" "+name);
String typ=null;
if(type.equals("电视剧")){
typ="dianshi";
}else if(type.equals("电影")){
typ="dianying";
}else if(type.equals("综艺")){
typ="zongyi";
}else if(type.equals("动漫")){
typ="dongman";
}
out.write("<a href='/LongVideos/player.jsp?type="+typ+"&url="+url+"&href="+pu+"' target='_blank'><div class='whole'>"
+ "<img src='"+img+"' alt='"+name+"' title='"+name+"'><br><em>片名:"+name+"</em><br><br><em>类型:"+type+"</em></div></a>");
}
out.write("</body></html");
}
public void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
doGet(request, response);
}
public static ArrayList searchInfo(String kw,HttpServletResponse response)throws IOException{
kw=URLEncoder.encode(kw,"utf-8");//将汉字转为url编码
String url="https://so.360kan.com/index.php?kw="+kw;
System.out.println(url);
Document doc=Jsoup.connect(url).get();
// System.out.println("doc"+doc);
Elements es=doc.getElementsByClass("js-longitem");
// System.out.println(es);
ArrayList<SimInfo> sis=new ArrayList<SimInfo>();
for(Element s:es){
// System.out.println(s);
System.out.println("----------------------------------------------------");
String ss=s.toString();
//视频详情网址
String uu=ss.substring(ss.indexOf("<a href=\"")+9,ss.indexOf("\" class=\"g-playicon") );
//视频图片
String img=ss.substring(ss.indexOf("<img src=\"")+10,ss.indexOf("\" alt=\""));
//视频名
String name=ss.substring(ss.indexOf("title=\"")+7,ss.indexOf("\" data-logger=\"ctype"));
// System.out.println(name);
String type=ss.substring(ss.indexOf("<span class=\"playtype\">")+24,ss.indexOf("]</span>"));
// System.out.println(type);
SimInfo si=new SimInfo();
si.setUrl(uu);
si.setImg(img);
si.setVname(name);
si.setType(type);
sis.add(si);
}
return sis;
}
}
(3)player.jsp代码如下所示,这里调用的是beac视频解析接口来解析并播放视频
<%@page import="org.jsoup.nodes.Element"%>
<%@page import="org.jsoup.select.Elements"%>
<%@page import="org.jsoup.Jsoup"%>
<%@page import="org.jsoup.nodes.Document"%>
<%@ page language="java" import="java.util.*" pageEncoding="utf-8"%>
<%
String path = request.getContextPath();
String basePath = request.getScheme()+"://"+request.getServerName()+":"+request.getServerPort()+path+"/";
request.setCharacterEncoding("UTF-8");//
response.setCharacterEncoding("UTF-8");//设置浏览器响应的编码方式,即控制浏览器的编码
response.setContentType("text/html;charset=UTF-8");//
String type=request.getParameter("type");
String hre=request.getParameter("href");//获取该视频的播放链接
String url=request.getParameter("url");
System.out.println(url);
%>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<base href="<%=basePath%>">
<title>正在播放中......</title>
<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="cache-control" content="no-cache">
<meta http-equiv="expires" content="0">
<meta http-equiv="keywords" content="keyword1,keyword2,keyword3">
<meta http-equiv="description" content="This is my page">
<link rel="stylesheet" type="text/css" href="css/index.css">
</head>
<body>
<iframe src='http://beaacc.com/api.php?url=<%=hre %>' class="player" width='800px' height="500px" align="left"></iframe><br>
<!-- http://api.baiyug.cn/vip/?url=-->
<%
if(type.equals("dianshi")||type.equals("dongman")){
System.out.println(type);
Document doc=Jsoup.connect(url).get();//?rank=rankhot&cat=all&area=all&act=all&year=all&pageno=4
Elements s=doc.getElementsByClass("s-top-list-ji");
String ss=s.toString();
System.out.println(ss);
String se="";
if(ss.contains("display:none")){//集数较多,折叠的
se=s.toString().substring(ss.indexOf("display:none")+15,ss.indexOf("收起</a>")-63);
}else if(ss.contains("<i class=\"ico-yugao\">")){//有预告片的
se=s.toString().substring(ss.indexOf("js-tab")+14,ss.indexOf("<i class=\"ico-yugao\">"));
}else{//无预告片且未折叠,有预告片且折叠
se=s.toString().substring(ss.indexOf("js-tab")+14);
}
System.out.println(se);
String[] hs=se.split("href=\"");
String u="";
out.write("<div class='episode'>");
for(int i=1;i<hs.length;i++){
//<a data-num="40" data-daochu="to=qq" http://v.qq.com/x/cover/uutv6yv0c4jn95h/d002608a679.html?ptag=360kan.tv.free"> 40 <i class="ico-new"></i> </a>
u=hs[i].substring(0,hs[i].indexOf("\">"));
// h=h.substring(0,42);
System.out.println(u);
System.out.println("-------------------------------");
out.write("<a href='/LongVideos/player.jsp?type="+type+"&url="+url+"&href="+u+"'>"+i+"</a> ");
//每行八集
if(i%8==0){
out.write("<br>");
}
}
out.write("</div><br>");
//
Elements info=doc.getElementsByClass("s-top-info");
String sinfo=info.toString();
System.out.println(sinfo);
String title=sinfo.substring(sinfo.indexOf("<h1>")+4,sinfo.indexOf("</h1>"));
out.println("<br>正在播放:<em class='title'>"+title+"</em><br>");
if(type.equals("dianshi")){
Element s8=doc.getElementById("js-desc-switch");
String[] r=s8.toString().split("<span>");
//类型
String style=r[1].substring(10, r[1].indexOf("</p>"));
//年份
String year=r[2].substring(10,r[2].indexOf("</p>"));
//地区
String area=r[3].substring(10,r[3].indexOf("</p>"));
//导演
String lD=r[4].substring(r[4].indexOf("href")+6, r[4].indexOf("\">"));
String director=r[4].substring(r[4].indexOf("\">")+2, r[4].indexOf("</a>"));
out.println("类型:<des>"+style+"</des> 年份:<des>"+year+"</des> 地区:<des>"+area+"</des> 导演:<des>"+director+"</des><br><br>");
}
//简介
String desc=sinfo.substring(sinfo.indexOf("<p class=\"item-desc js-close-wrap\" style=\"display:none;\">")+73,sinfo.indexOf("<a href=\"#\" class=\"js-close btn\">"));
out.println("简介:"+desc);
}
else if(type.equals("dianying")||type.equals("zongyi")){
Document doc=Jsoup.connect(url).get();
Elements s=doc.getElementsByClass("top-info");//电影
String ss=s.toString();
out.write("<div class='episode'>");
//片名
String title=ss.substring(ss.indexOf("<h1>")+4,ss.indexOf("</h1>"));
if(type.equals("dianying")){
//评分
String score=ss.substring(ss.indexOf("<span class=\"s\">")+16,ss.indexOf("</span>"));
if(score.length()>6){
out.println("<br>正在播放:<em class='title'>"+title+"</em><br><br>");
}else{
out.println("<br>正在播放:<em class='title'>"+title+"</em> "+score+"<br><br>");
}
Element s8=doc.getElementById("js-desc-switch");
String[] r=s8.toString().split("<span>");
//年份
String year=r[1].substring(10, r[1].indexOf("</p>"));
//地区
String area=r[2].substring(10,r[2].indexOf("</p>"));
//演员
String actor=r[3].substring(10,r[3].indexOf("</p>"));
//导演
String lD=r[4].substring(r[4].indexOf("href")+6, r[4].indexOf("\">"));
String director=r[4].substring(r[4].indexOf("\">")+2, r[4].indexOf("</a>"));
out.println("年份:<des>"+year+"</des> 地区:<des>"+area+"</des> 导演:<des>"+director+"</des><br><br>");
}
//简介
String desc=ss.substring(ss.indexOf("<p class=\"item-desc js-close-wrap\" style=\"display:none;\">")+73,ss.indexOf("<a href=\"#\" class=\"js-close btn\">"));
out.println("简介:<span>"+desc+"</span>");
out.write("</div>");
}
%>
</body>
</html>
效果图如下所示:
只需以上四个文件,一个简单的视频爬取网站就算完成了,界面较low,莫要见怪。接下来需要做的就是利用css,js等技术,加上一些额外的功能来完善你做的视频网站。