目录
一、初遇拦路虎
二、破解加密
1、详细分析
2、分析js代码
三、转化为python爬虫代码
四、全部代码
心血来潮想玩下最近比较火的AI绘画,于是想要搞点图片丢到模型上训练
一、初遇拦路虎
随便找个外国的图片网站来爬点图片,随手f12打开、检视元素:
突然发现图片的链接居然是jpg.txt的形式,而图片1是data:xxx的形式,再f12抓包看看,发现这个jpg返回的是很长的一串,加上末尾标志性的两个等号,判断是一个加密后的密文:
解密后就变成了二进制的图片编码直接可以看到
二、破解加密
初步分析判断:这个网站初始使用一个默认的图片做占位,然后在图片src中获取到http://xxx.jpg.txt的密文,然后解密后获取到图片的二进制编码,拼接上data:image/jpg;base64,的前缀替换掉src元素的值然后在页面上就能看到真实的图片了。
根据目前的分析,破解加密的关键就在获取到解密函数了。好,大清早的还给我出个难题,你越是反抗我越是兴奋
1、详细分析
先看下前端节点元素,看下有没有可以用于分析线索的关键字
看了下也没有什么有用的信息,就一个class的name:lazy可能有用
接下来看下前端访问过程中请求的所有js
这些js先大致过一下,其中可能有用的大概就是:
crypto-js.js
encrypt.min.js
以上两个估计是加密函数(其实我先打断点分析图片是不是使用这些加密的,后来发现不是,文中就不再赘述了)
LazyLoad.js
detail.js
以上可能是和内容显示有关的,其中LazyLoad由于和节点的class名字一致,所以是优先分析对象
2、分析js代码
右击打开文件
贴上代码
!function(t, e, r, i) {var o = t(e);t.fn.lazyload = function(n) {var a, f = this, l = {threshold: 0,failure_limit: 0,event: "scroll",effect: "show",container: e,data_attribute: "original",skip_invisible: !0,appear: null,load: null,prefix: "prefix",placeholder: "/assets/images/default/loading/470x666.jpg",host: image_url};function d() {var e = 0;f.each((function() {var r = t(this);if (!l.skip_invisible || r.is(":visible"))if (t.abovethetop(this, l) || t.leftofbegin(this, l));else if (t.belowthefold(this, l) || t.rightoffold(this, l)) {if (++e > l.failure_limit)return !1} elser.trigger("appear"),e = 0}))}return n && (i !== n.failurelimit && (n.failure_limit = n.failurelimit,delete n.failurelimit),i !== n.effectspeed && (n.effect_speed = n.effectspeed,delete n.effectspeed),t.extend(l, n)),a = l.container === i || l.container === e ? o : t(l.container),0 === l.event.indexOf("scroll") && a.bind(l.event, (function() {return d()})),this.each((function() {var e = this, r = t(e), o = r.attr("data-" + l.prefix), n = r.attr("data-placeholder"), a = r.attr("data-" + l.data_attribute);a = o ? o + a : a;let d = r.attr("data-aes");if ((r.attr("src") === i || !1 === r.attr("src")) && r.is("img")) {let i = t(e).attr("data-loading");t(e).attr("data-no-loading") || (i ? r.attr("src", i) : n ? r.attr("src", img_host + n) : r.attr("src", img_host + l.placeholder))}r.one("appear", (function() {if (!this.loaded) {if (l.appear) {var i = f.length;l.appear.call(e, i, l)}"true" == d && a.indexOf(".txt") > -1 ? (e.loaded = !0,t.ajax({url: l.host + a,type: "get",success: function(t) {let e = desDecrypt(t);r.is("img") ? r.attr("src", e) : r.css({"background-image": "url('" + e + "') !important"})},error: function(t) {r.attr("load-status", "error")}})) : t("<img />").bind("load", (function() {var i = r.attr("data-" + l.prefix), o = r.attr("data-" + l.data_attribute);o = i ? i + o : o,o = l.host + o,r.hide(),r.is("img") ? r.attr("src", o) : r.css({"background-image": "url('" + o + "')"}),r[l.effect](l.effect_speed),e.loaded = !0;var n = t.grep(f, (function(t) {return !t.loaded}));if (f = t(n),l.load) {var a = f.length;l.load.call(e, a, l)}})).attr("src", a)}})),0 !== l.event.indexOf("scroll") && r.bind(l.event, (function() {e.loaded || r.trigger("appear")}))})),o.bind("resize", (function() {d()})),/(?:iphone|ipod|ipad).*os 5/gi.test(navigator.appVersion) && o.bind("pageshow", (function(e) {e.originalEvent && e.originalEvent.persisted && f.each((function() {t(this).trigger("appear")}))})),t(r).ready((function() {d()})),this},t.belowthefold = function(r, n) {return (n.container === i || n.container === e ? (e.innerHeight ? e.innerHeight : o.height()) + o.scrollTop() : t(n.container).offset().top + t(n.container).height()) <= t(r).offset().top - n.threshold},t.rightoffold = function(r, n) {return (n.container === i || n.container === e ? o.width() + o.scrollLeft() : t(n.container).offset().left + t(n.container).width()) <= t(r).offset().left - n.threshold},t.abovethetop = function(r, n) {return (n.container === i || n.container === e ? o.scrollTop() : t(n.container).offset().top) >= t(r).offset().top + n.threshold + t(r).height()},t.leftofbegin = function(r, n) {return (n.container === i || n.container === e ? o.scrollLeft() : t(n.container).offset().left) >= t(r).offset().left + n.threshold + t(r).width()},t.inviewport = function(e, r) {return !(t.rightoffold(e, r) || t.leftofbegin(e, r) || t.belowthefold(e, r) || t.abovethetop(e, r))},t.extend(t.expr[":"], {"below-the-fold": function(e) {return t.belowthefold(e, {threshold: 0})},"above-the-top": function(e) {return !t.belowthefold(e, {threshold: 0})},"right-of-screen": function(e) {return t.rightoffold(e, {threshold: 0})},"left-of-screen": function(e) {return !t.rightoffold(e, {threshold: 0})},"in-viewport": function(e) {return t.inviewport(e, {threshold: 0})},"above-the-fold": function(e) {return !t.belowthefold(e, {threshold: 0})},"right-of-fold": function(e) {return t.rightoffold(e, {threshold: 0})},"left-of-fold": function(e) {return !t.rightoffold(e, {threshold: 0})}})
}($ || jQuery, window, document);
分析代码发现,以下可疑对象:1、以.txt结尾;2、使用了一个desDecrypt函数;3、会对src元素进行处理:
在desDecrypt处打断点看下,果然在加载图片的时候触发了断点,输出过程中变量看看:
直接调用函数解密试试:
应该就是这个函数了,看下这个函数内容,跳转到了另一个js文件:
三、转化为python爬虫代码
找到了目标函数后,直接丢到chatgpt转化为python代码:
咦怎么还要密钥,打个断点,获取下:
试试:
找chatgpt问问怎么把图片保存到本地:
大功告成,破译过程结束,接下来加到爬虫代码里就行了,再让chatgpt生成下示例代码,开20个进程下载:
四、全部代码
import requests
import multiprocessing
import re
import time
import os
import base64
from Crypto.Cipher import DES
from Crypto.Util.Padding import pad, unpad
from Crypto.Util.strxor import strxor
import logging
import base64
from PIL import Image
from io import BytesIOlogging.basicConfig(filename="request.log")def des_decrypt(data):# 解密函数key = b'jeH3O1VX' # 密钥iv = b'nHnsU4cX' # 初始向量cipher = DES.new(key, DES.MODE_CBC, iv)ciphertext = base64.b64decode(data)decrypted = cipher.decrypt(ciphertext)unpadded = unpad(decrypted, DES.block_size)return unpadded.decode('utf-8')# 定义一个函数,用于下载图片
def download_image(url):# 发送请求try:response = requests.get(url, timeout=10)except Exception as e:# 获取响应内容logging.error(f"[ERROR] get url [{url}] Failed! error:[{e}]")returnprint(f"[INFO] get url [{url}] Success!")try:content = des_decrypt(response.content)except Exception as e:logging.error(f"[ERROR] decrypt url content [{url}] Failed! error:[{e}]")return# 把二进制图片保存为文件download_path = rf"data\{url.split('/')[3]}_{int(time.time())}.jpg"try:# 从base64编码的字符串中提取图像数据image_data = content.split(',')[1]image_data = bytes(image_data, encoding='utf-8')image = Image.open(BytesIO(base64.b64decode(image_data)))# 保存图像image.save(download_path)except Exception as e:logging.error(f"[ERROR] save image [{url}] to [{download_path}] Failed! error:[{e}]")returnprint(f"[INFO] save image [{url}] to [{download_path}] success!")if __name__ == '__main__':with open("requests3.log") as f:l = f.readlines()# 这里的链接是直接暴力枚举的图片二进制加密数据,不需要再请求网页了url_list = [f"http://xxx/{i}.jpg" for i in range(10000)]# 启动20个进程pool = multiprocessing.Pool(processes=20)# 并行访问urlpool.map(download_image, url_list)# 关闭进程池pool.close()pool.join()