-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
113 changed files
with
4,346 additions
and
28 deletions.
There are no files selected for viewing
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Ja3解决方案 | ||
|
||
### Ja3特征: | ||
输出的结果包含 **"Just a moment..."** 字样的基本上就可以肯定是ja3指纹被检测了 | ||
![img.png](img.png) | ||
|
||
### 查看浏览器 Ja3指纹 https://tls.peet.ws/api/clean | ||
![img_2.png](img_2.png) | ||
|
||
## 方案1: | ||
#### 参考文献:https://zhuanlan.zhihu.com/p/601474166 | ||
#### 这里使用一个大佬魔改的request库 curl_cffi | ||
pip install curl_cffi -i https://pypi.tuna.tsinghua.edu.cn/simple | ||
#### 对比一下魔改携带指纹的库与原生的区别: | ||
![img_1.png](img_1.png) | ||
|
||
## 方案2: | ||
#### 第二种方案效率可能会比较低,就是在Linux上部署一个类似浏览器的服务(使用docker安装一个内置的浏览器) | ||
|
||
#### 可以参考一下我之前写的一篇博客:https://blog.csdn.net/m0_61720747/article/details/133993502?spm=1001.2014.3001.5502 | ||
|
||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# @Time : 2024/2/6 17:15 | ||
# @Author : Harvey | ||
# @File : ja3_demo.py | ||
|
||
# 使用ja3魔改库 | ||
from curl_cffi import requests | ||
a = requests.get("https://www.globalspec.com/productfinder/data_acquisition_signal_conditioning", impersonate="chrome101") | ||
print(a.text) | ||
|
||
print("\n=====================\n") | ||
# 使用原生的request | ||
import requests | ||
b = requests.get("https://www.globalspec.com/productfinder/data_acquisition_signal_conditioning") | ||
print(b.text) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
## 参数逆向技巧及解决方案 | ||
#### 1、RPC远程调用 | ||
#### 2、Ja3解决方案 | ||
#### 3、selenium(当实在破解不了的时候的兜底技能) | ||
|
||
|
||
|
||
## 爬虫快速定位技巧 | ||
|
||
### 一、搜索加密函数常用关键词及说明 : | ||
|
||
1、MD5 : | ||
搜索关键词 :1732584193、271733879、1732584194、271733878、md5 | ||
原生MD5加密源码生成 | ||
|
||
2、SHA1 : | ||
搜索关键词 :1732584193、271733879、1732584194、271733878、1009589776 | ||
SHA1源码加密源码生成 | ||
|
||
3、Base64 : | ||
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 +/= | ||
往往与其它加密函数结合使用 | ||
|
||
4、AES : | ||
搜索关键词 :crypto、AES、encrypt | ||
往往与其它加密函数结合使用 | ||
|
||
5、DES : | ||
搜索关键词 :crypto、DES、encrypt、mode、padding | ||
crypto官方网站 | ||
|
||
6、RSA : | ||
搜索关键词 :setPublicKey、rsa | ||
jsencrypt官方网站 | ||
|
||
7、websocket : | ||
搜索关键词 :onopen、onmessage、onsent ,WebSocket | ||
协议ws和wss ,类似http和https | ||
|
||
8、JS编码 : | ||
搜索关键词 :encodeURI、encodeURIComponent、btoa、escape | ||
前面两种方式最为常见 | ||
|
||
9、加密函数导出 : | ||
搜索关键词 :module.exports、exports | ||
导出加密函数常用方法 | ||
|
||
10、FROM表单 : | ||
搜索关键词 :password、pwd、sign、userid。加密或非加密 ,关键词 ,搜索词后面加冒号、等于号、前面加点 ,例如pwd:、pwd =、pwd =、.pwd | ||
搜索表单键值对中值被加密的键 ,表单提交方式为POST ,不同表单搜索关键词不同 | ||
|
||
11、十六进制 : | ||
搜索关键词 :0123456789ABCDEF、0123456789abcdef | ||
|
||
### 二、主要加密解密算法简介 : | ||
|
||
1、对称性加密算法 :对称式加密就是加密和解密使用同一个密钥 (AES、DES、3DES) | ||
|
||
2、非对称算法 :非对称式加密就是加密和解密所使用的不是同一个密钥 ,通常有两个密钥 ,称为公钥、私钥,它们两个必需配对使用 ,否则不能打开加密文件 (RSA、DSA、ECC) | ||
|
||
3、散列算法 :又称哈希函数 ,是一种单向加密算法 ,不可逆 ,目前无法解密 (MD5、SHA1、HMAC) | ||
|
||
4、Base64 :算是一个编码算法 ,通常用于把二进制数据编码为可写的字符形式的数据 ,对数据内容进行编码来适合传输。这是一种可逆的编码方式。编码后的数据是一个字符串 ,其中包含的字符为 :A - Z、a - z、0 - 9、+、/,共64个字符(26 +26 +10 +1 +1 =64 ,其实是65个字符 ,“=”是填充字符 (HTTPS、 HTTP +SSL层) | ||
|
||
### 三、各种加密格式 : | ||
1、MD5常见16、32、40位 | ||
|
||
123456 加密 (16位以49开头、32位e10或E10开头 ): | ||
49BA59ABBE56E057 E10ADC3949BA59ABBE56E057F20F883E | ||
|
||
2、SHA1常见40、64、125位 | ||
123456 加密 (40位以7c开头 ): | ||
7c4a8d09ca3762af61e59520943dc26494f8941b | ||
|
||
3、AES其中data 是字符串 ,若是对象则用JSON.stringify(data)转化: | ||
varCryptoJS=require("crypto-js ");vardata='my message '; | ||
|
||
secret密钥: | ||
|
||
varsecret='secret key 123 ';// Encryptvarciphertext=CryptoJS.AES.encrypt(data,secret).toString();// Decryptvarbytes=CryptoJS.AES.decrypt(ciphertext,'secret key 123 '); | ||
varoriginalText=bytes.toString(CryptoJS.enc.Utf8); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,22 +1,21 @@ | ||
|
||
前提:需要在网页端启动油猴脚本! | ||
|
||
设置端口: | ||
|
||
修改 conf/config.properties 文件 sekiro.port=6001 修改端口 与 油猴的脚本代码的端口保持一致 | ||
|
||
在运行之前的流程: | ||
|
||
1、需要先打开 浏览器 对应的网站 | ||
|
||
2、在油猴上修改头信息 Ag: // @match https://www.taobao.com/* | ||
|
||
3、修改需要执行的js代码:默认为: | ||
|
||
var result = document.cookie; | ||
|
||
resolve(result); | ||
4、打开RPC连接通信并挂起, bin/sekiro.bat | ||
|
||
5、刷新浏览器并检查 是否有 sekiro: begin of connect to wsURL: xxx等等信息 有就是成功 | ||
|
||
前提:需要在网页端启动油猴脚本! | ||
|
||
设置端口: | ||
|
||
修改 conf/config.properties 文件 sekiro.port=6001 修改端口 与 油猴的脚本代码的端口保持一致 | ||
|
||
在运行之前的流程: | ||
|
||
1、需要先打开 浏览器 对应的网站 | ||
|
||
2、在油猴上修改头信息 Ag: // @match https://www.taobao.com/* | ||
|
||
3、修改需要执行的js代码:默认为: | ||
|
||
var result = document.cookie; | ||
resolve(result); | ||
4、打开RPC连接通信并挂起, bin/sekiro.bat | ||
|
||
5、刷新浏览器并检查 是否有 sekiro: begin of connect to wsURL: xxx等等信息 有就是成功 |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
## 抖音评论区爬虫 | ||
|
||
### 网站反爬更新说明: | ||
2024-02-06: | ||
1、不登陆时,cookie新增了ttwid参数,由原先的检测s_v_web_id,现在更新为检测ttwid | ||
2、未登录不允许翻页 | ||
3、登陆后有携带sessionid与sessionid_ss,可以不用X-Bogus | ||
|
||
|
||
#### 用法: | ||
#### 只需要写入视频id、爬取页数即可(当然也可以通过响应数据获取总页数,修改一下即可) | ||
![img_1.png](img_1.png) | ||
|
||
![img.png](img.png) | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# @Time : 2024/1/17 18:53 | ||
# @Author : Harvey | ||
# @File : run_new.py | ||
# -*- coding: UTF-8 -*- | ||
import datetime | ||
import json | ||
from urllib.parse import urlencode | ||
import requests | ||
import execjs | ||
import pandas as pd | ||
|
||
''' | ||
反爬更新说明: | ||
2024-02-06: | ||
1、不登陆时,cookie新增了ttwid参数,由原先的检测s_v_web_id,现在更新为检测ttwid | ||
2、未登录不允许翻页 | ||
3、登陆后有携带sessionid与sessionid_ss,可以不用X-Bogus | ||
''' | ||
|
||
def get_xb_data(urlform): | ||
filename_js = 'get_XBogus2.js' | ||
with open(filename_js, mode='r') as f: | ||
pw_js = f.read() | ||
f.close() | ||
js1 = execjs.compile(pw_js) | ||
print('********* 正在生成 -- XBogus *********') | ||
xb_data = js1.call('get_xb', urlform) # 获取token | ||
print(xb_data) | ||
return xb_data | ||
|
||
|
||
def get_headers(): | ||
# 20240117更新:之前是检测s_v_web_id,现在更新为检测ttwid | ||
# 未登录的cookie | ||
# cookies = { | ||
# "ttwid": "1%7CvSGoO5ZPPEgIIWpNoRr0YCmMrWzQACoN1hNhqxBsexQ%7C1705490555%7C9d5ecaae8cae4fc1ef856f7de5f31a991a4d9a459b738888c7525070d1e806d0", | ||
# # "s_v_web_id": "verify_lqw9zg13_3aUMuPsb_dPTm_4QA0_8Msc_wKAv0zy96U0j", | ||
# } | ||
# 登陆后的cookie: | ||
cookies = { | ||
"sessionid": "xxxxxxxxxxxx", | ||
"sessionid_ss": "xxxxxxxxxxxx", | ||
} | ||
headers = { | ||
"referer": "https://www.douyin.com/", | ||
"user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' | ||
} | ||
return cookies,headers | ||
|
||
|
||
def get_cursor(page): | ||
''' | ||
:param page: 从第 1 页开始, | ||
:return: cursor 为评论开始的位置(起始为0),count为一次读取多少个评论(最高50) | ||
''' | ||
count = 20 | ||
cursor = (page-1) * count # (1-1)*50 = 0 | ||
return cursor, count | ||
|
||
|
||
def dispose_params(params_data): | ||
''' | ||
将 "X-Bogus" 插入 params 中,同时获取 url | ||
:param params_data: | ||
:return: | ||
''' | ||
params_ = urlencode(params_data).replace('%3D', '=') # 转换完后有部分格式问题需要修改 | ||
xb_data = get_xb_data(params_) | ||
params = params_ + '&X-Bogus=' + xb_data | ||
url = 'https://www.douyin.com/aweme/v1/web/comment/list/?' + params | ||
return url | ||
|
||
|
||
def get_url(page,aweme_id): | ||
''' | ||
"aweme_id": "7284507907465481512", # 视频id | ||
"cursor": "0", # 起始评论位置 | ||
"count": "50", # 一次获取的条数,最多50条 | ||
msToken : 固定值 | ||
"X-Bogus" : 根据不带"X-Bogus"的params数据加密生成 | ||
:return: params(无xb参数的params) | ||
''' | ||
cursor,count = get_cursor(page) | ||
params_data = { | ||
"device_platform": "webapp", | ||
"aid": "6383", | ||
"channel": "channel_pc_web", | ||
"aweme_id": aweme_id, | ||
"cursor": cursor, | ||
"count": count, | ||
"item_type": "0", | ||
"insert_ids": "", | ||
"whale_cut_token": "", | ||
"cut_version": "1", | ||
"rcFT": "", | ||
"pc_client_type": "1", | ||
"version_code": "170400", | ||
"version_name": "17.4.0", | ||
"cookie_enabled": "true", | ||
"screen_width": "1920", | ||
"screen_height": "1080", | ||
"browser_language": "zh-CN", | ||
"browser_platform": "Win32", | ||
"browser_name": "Chrome", | ||
"browser_version": "120.0.0.0", | ||
"browser_online": "true", | ||
"engine_name": "Blink", | ||
"engine_version": "120.0.0.0", | ||
"os_name": "Windows", | ||
"os_version": "10", | ||
"cpu_core_num": "6", | ||
"device_memory": "8", | ||
"platform": "PC", | ||
"downlink": "10", | ||
"effective_type": "4g", | ||
"round_trip_time": "50", | ||
"webid": "7325026101045102121", | ||
"msToken": "", | ||
# "X-Bogus": "DFSzswVLhOJANcLLti0PqvB9Piz9" | ||
} | ||
url = dispose_params(params_data) | ||
print(url) | ||
return url | ||
|
||
|
||
def dispose_comments(comments,data_list): | ||
for comment_one in comments: | ||
# ============ 用户个人信息部分 =========== | ||
nickname = comment_one['user']['nickname'] # 用户名 | ||
user_id = comment_one['user']['short_id'] # 用户id | ||
signature = comment_one['user']['signature'] # 用户个性签名 | ||
head_image = comment_one['user']['avatar_medium']['url_list'][0] # 中等大小的头像 | ||
user_url_ = comment_one['user']['sec_uid'] | ||
user_url = 'https://www.douyin.com/user/' + user_url_ | ||
# ============ 用户个人信息部分 =========== | ||
pinlun_cid = comment_one['cid'] # 后续展开回复评论的请求id(comment_id) | ||
|
||
text_data = comment_one['text'] # 评论内容 | ||
digg_count = comment_one['digg_count'] # 点赞数 | ||
reply_comment_total = comment_one['reply_comment_total'] # 评论数 | ||
create_time_str = comment_one['create_time'] # 评论时间 | ||
create_time = datetime.datetime.fromtimestamp(create_time_str) | ||
try: | ||
ip_label = comment_one['ip_label'] | ||
except: | ||
ip_label = "未知ip" | ||
|
||
print(f"用户名:{nickname} 用户id : {user_id} , ip地址 : {ip_label} , 评论 : {text_data} , " | ||
f"回复数:{reply_comment_total} , 点赞数:{digg_count} , 回复时间:{create_time}") | ||
data_one_dict = {"用户id":user_id,"用户名":nickname,"用户链接":user_url,"用户头像":head_image,"ip地址":ip_label, | ||
"评论":text_data,"回复数":reply_comment_total,"点赞数":digg_count,"个性签名":signature,"回复时间":create_time} | ||
|
||
PL_image_bool = comment_one['image_list'] # 判断评论是否有图片 | ||
if PL_image_bool: | ||
PL_image = PL_image_bool[0]['origin_url']['url_list'][0] | ||
data_one_dict['PL_image']=PL_image | ||
data_list.append(data_one_dict) | ||
return data_list | ||
|
||
|
||
def run(): | ||
aweme_id_list = ["7306459845480254754"] | ||
for aweme_id in aweme_id_list: | ||
file_name = 'data/' + aweme_id + 'asda.xlsx' | ||
cookies,headers = get_headers() | ||
max_page = 5 # 获取100页数据(100*50=5000)条评论 | ||
data_list = [] | ||
for page in range(1,max_page+1): | ||
print(f"============= 正在获取第 {page} 页数据 =============") | ||
url = get_url(page,aweme_id) | ||
response = requests.get(url, headers=headers, cookies=cookies) | ||
print(response.text) | ||
if response.json().get("status_msg",None) == "blocked": | ||
break | ||
comments = response.json()['comments'] | ||
if comments: | ||
print(f"当前有 {len(response.json()['comments'])} 条数据") | ||
data_list = dispose_comments(comments,data_list) | ||
else: | ||
break | ||
|
||
df_data = pd.DataFrame.from_dict(data_list) # 字典列表转pandas | ||
print(df_data) | ||
df_data.to_excel(file_name,index=False) | ||
|
||
|
||
if __name__ == '__main__': | ||
run() | ||
|
||
|
||
|
Oops, something went wrong.