好久没写爬虫了,现在学了js后就想这把以前的煎蛋网的坑给填上。现在就来讲一下煎蛋网的加密方式
发现没有url链接,但是看到了onload这个事件,百度一下得到以下结论
onload 事件会在页面或图像加载完成后立即发生。
所以我们可以知道当页面加载完成后进行onload事件加载,我们来看下这个事件所对应的函数 jandan_load_img,即按f12打开控制台再按ctrl+shift+F即可调出全局搜索,如下图
然后查看这个函数中的内容
function jandan_load_img(b) {
var d = $(b);
var f = d.next("span.img-hash");
var e = f.text();
f.remove();
var c = jdy5ugL2GX9bFlWsf4C709D7Qloik7yWq6(e, "Xcvbx6YJq7BcAEdUaNWbdeYheXOtVY6T");
var a = $('<a href="' + c.replace(/(\/\/\w+\.sinaimg\.cn\/)(\w+)(\/.+\.(gif|jpg|jpeg))/, "$1large$3") + '" target="_blank" class="view_img_link">[查看原图]</a>');
d.before(a);
d.before("<br>");
d.removeAttr("onload");
d.attr("src", location.protocol + c.replace(/(\/\/\w+\.sinaimg\.cn\/)(\w+)(\/.+\.gif)/, "$1thumb180$3"));
if (/\.gif$/.test(c)) {
d.attr("org_src", location.protocol + c);
b.onload = function() {
add_img_loading_mask(this, load_sina_gif)
}
}
}
我们看到其中最关键的代码
var a = $('<a href="' + c.replace(/(\/\/\w+\.sinaimg\.cn\/)(\w+)(\/.+\.(gif|jpg|jpeg))/, "$1large$3") + '" target="_blank" class="view_img_link">[查看原图]</a>');
这个就是生成链接的js代码然后可以看到这里有个变量c,在上面的代码中可以看到这个c是由jdy5ugL2GX9bFlWsf4C709D7Qloik7yWq6函数生成的,那么采用全局搜索,来获取这个代码的内容
var jdy5ugL2GX9bFlWsf4C709D7Qloik7yWq6 = function(n, t, e) {
var f = "DECODE";
var t = t ? t : "";
var e = e ? e : 0;
var r = 4;
t = md5(t);
var d = n;
var p = md5(t.substr(0, 16));
var o = md5(t.substr(16, 16));
if (r) {
if (f == "DECODE") {
var m = n.substr(0, r)
}
} else {
var m = ""
}
var c = p + md5(p + m);
var l;
if (f == "DECODE") {
n = n.substr(r);
l = base64_decode(n)
}
var k = new Array(256);
for (var h = 0; h < 256; h++) {
k[h] = h
}
var b = new Array();
for (var h = 0; h < 256; h++) {
b[h] = c.charCodeAt(h % c.length)
}
for (var g = h = 0; h < 256; h++) {
g = (g + k[h] + b[h]) % 256;
tmp = k[h];
k[h] = k[g];
k[g] = tmp
}
var u = "";
l = l.split("");
for (var q = g = h = 0; h < l.length; h++) {
q = (q + 1) % 256;
g = (g + k[q]) % 256;
tmp = k[q];
k[q] = k[g];
k[g] = tmp;
u += chr(ord(l[h]) ^ (k[(k[q] + k[g]) % 256]))
}
if (f == "DECODE") {
if ((u.substr(0, 10) == 0 || u.substr(0, 10) - time() > 0) && u.substr(10, 16) == md5(u.substr(26) + o).substr(0, 16)) {
u = u.substr(26)
} else {
u = ""
}
u = base64_decode(d)
}
return u
};
看起来很复杂,实际上这个很简单,为什么我们从后向前看返回的是u,那么u是哪里来的u是通过 u = base64_decode(d)语句生成的,那么d又是什么,发现就是这个
那么现在已经很清楚了,就是将html中的加密链接经过一次base64解密即可获取到链接。
python代码如下
import requests
import base64
from bs4 import BeautifulSoup
def base64_decode(encode_code):
return base64.b64decode(encode_code)
def decodingUrl(url_decode):
url_old = url_decode
return base64_decode(url_old).decode('unicode-escape')[2:]
def get_all_decodemsg(response):
li=[]
html = BeautifulSoup(response.text, 'html.parser')
datas= html.find_all("img")
for data in datas:
try:
li.append(data.next_sibling.string)
except:
pass
return li
if __name__=="__main__":
for i in range(1,38):
demo_url="http://jandan.net/ooxx/page-%d"%(i)
print(demo_url,end="\n\n\n")
response = requests.get(demo_url)
msgli = get_all_decodemsg(response)
for msg in msgli:
url = decodingUrl(msg)
url = "http://"+url
print(url)
r=requests.get(url)
with open(r"C:\Users\asus\Desktop\testss\%s"%(url.split("/")[-1]) ,"wb+") as f:
f.write(r.content)
执行过程截图
爬下来的图