Skip to content

Commit

Permalink
新增抖音及各种反爬破解方案
Browse files Browse the repository at this point in the history
  • Loading branch information
wu50416 committed Feb 6, 2024
1 parent 3316d88 commit 016e421
Show file tree
Hide file tree
Showing 113 changed files with 4,346 additions and 28 deletions.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
22 changes: 22 additions & 0 deletions AAA-反爬破解方案/Ja3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Ja3解决方案

### Ja3特征:
输出的结果包含 **"Just a moment..."** 字样的基本上就可以肯定是ja3指纹被检测了
![img.png](img.png)

### 查看浏览器 Ja3指纹 https://tls.peet.ws/api/clean
![img_2.png](img_2.png)

## 方案1:
#### 参考文献:https://zhuanlan.zhihu.com/p/601474166
#### 这里使用一个大佬魔改的request库 curl_cffi
pip install curl_cffi -i https://pypi.tuna.tsinghua.edu.cn/simple
#### 对比一下魔改携带指纹的库与原生的区别:
![img_1.png](img_1.png)

## 方案2:
#### 第二种方案效率可能会比较低,就是在Linux上部署一个类似浏览器的服务(使用docker安装一个内置的浏览器)

#### 可以参考一下我之前写的一篇博客:https://blog.csdn.net/m0_61720747/article/details/133993502?spm=1001.2014.3001.5502


Binary file added AAA-反爬破解方案/Ja3/img.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added AAA-反爬破解方案/Ja3/img_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added AAA-反爬破解方案/Ja3/img_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 17 additions & 0 deletions AAA-反爬破解方案/Ja3/ja3_demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2024/2/6 17:15
# @Author : Harvey
# @File : ja3_demo.py

# 使用ja3魔改库
from curl_cffi import requests
a = requests.get("https://www.globalspec.com/productfinder/data_acquisition_signal_conditioning", impersonate="chrome101")
print(a.text)

print("\n=====================\n")
# 使用原生的request
import requests
b = requests.get("https://www.globalspec.com/productfinder/data_acquisition_signal_conditioning")
print(b.text)

81 changes: 81 additions & 0 deletions AAA-反爬破解方案/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
## 参数逆向技巧及解决方案
#### 1、RPC远程调用
#### 2、Ja3解决方案
#### 3、selenium(当实在破解不了的时候的兜底技能)



## 爬虫快速定位技巧

### 一、搜索加密函数常用关键词及说明 :

1、MD5 :
搜索关键词 :1732584193、271733879、1732584194、271733878、md5
原生MD5加密源码生成

2、SHA1 :
搜索关键词 :1732584193、271733879、1732584194、271733878、1009589776
SHA1源码加密源码生成

3、Base64 :
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 +/=
往往与其它加密函数结合使用

4、AES :
搜索关键词 :crypto、AES、encrypt
往往与其它加密函数结合使用

5、DES :
搜索关键词 :crypto、DES、encrypt、mode、padding
crypto官方网站

6、RSA :
搜索关键词 :setPublicKey、rsa
jsencrypt官方网站

7、websocket :
搜索关键词 :onopen、onmessage、onsent ,WebSocket
协议ws和wss ,类似http和https

8、JS编码 :
搜索关键词 :encodeURI、encodeURIComponent、btoa、escape
前面两种方式最为常见

9、加密函数导出 :
搜索关键词 :module.exports、exports
导出加密函数常用方法

10、FROM表单 :
搜索关键词 :password、pwd、sign、userid。加密或非加密 ,关键词 ,搜索词后面加冒号、等于号、前面加点 ,例如pwd:、pwd =、pwd =、.pwd
搜索表单键值对中值被加密的键 ,表单提交方式为POST ,不同表单搜索关键词不同

11、十六进制 :
搜索关键词 :0123456789ABCDEF、0123456789abcdef

### 二、主要加密解密算法简介 :

1、对称性加密算法 :对称式加密就是加密和解密使用同一个密钥 (AES、DES、3DES)

2、非对称算法 :非对称式加密就是加密和解密所使用的不是同一个密钥 ,通常有两个密钥 ,称为公钥、私钥,它们两个必需配对使用 ,否则不能打开加密文件 (RSA、DSA、ECC)

3、散列算法 :又称哈希函数 ,是一种单向加密算法 ,不可逆 ,目前无法解密 (MD5、SHA1、HMAC)

4、Base64 :算是一个编码算法 ,通常用于把二进制数据编码为可写的字符形式的数据 ,对数据内容进行编码来适合传输。这是一种可逆的编码方式。编码后的数据是一个字符串 ,其中包含的字符为 :A - Z、a - z、0 - 9、+、/,共64个字符(26 +26 +10 +1 +1 =64 ,其实是65个字符 ,“=”是填充字符 (HTTPS、 HTTP +SSL层)

### 三、各种加密格式 :
1、MD5常见16、32、40位

123456 加密 (16位以49开头、32位e10或E10开头 ):
49BA59ABBE56E057 E10ADC3949BA59ABBE56E057F20F883E

2、SHA1常见40、64、125位
123456 加密 (40位以7c开头 ):
7c4a8d09ca3762af61e59520943dc26494f8941b

3、AES其中data 是字符串 ,若是对象则用JSON.stringify(data)转化:
varCryptoJS=require("crypto-js ");vardata='my message ';

secret密钥:

varsecret='secret key 123 ';// Encryptvarciphertext=CryptoJS.AES.encrypt(data,secret).toString();// Decryptvarbytes=CryptoJS.AES.decrypt(ciphertext,'secret key 123 ');
varoriginalText=bytes.toString(CryptoJS.enc.Utf8);
43 changes: 21 additions & 22 deletions RPC/README.md → AAA-反爬破解方案/RPC/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,21 @@

前提:需要在网页端启动油猴脚本!

设置端口:

修改 conf/config.properties 文件 sekiro.port=6001 修改端口 与 油猴的脚本代码的端口保持一致

在运行之前的流程:

1、需要先打开 浏览器 对应的网站

2、在油猴上修改头信息 Ag: // @match https://www.taobao.com/*

3、修改需要执行的js代码:默认为:

var result = document.cookie;

resolve(result);
4、打开RPC连接通信并挂起, bin/sekiro.bat

5、刷新浏览器并检查 是否有 sekiro: begin of connect to wsURL: xxx等等信息 有就是成功

前提:需要在网页端启动油猴脚本!

设置端口:

修改 conf/config.properties 文件 sekiro.port=6001 修改端口 与 油猴的脚本代码的端口保持一致

在运行之前的流程:

1、需要先打开 浏览器 对应的网站

2、在油猴上修改头信息 Ag: // @match https://www.taobao.com/*

3、修改需要执行的js代码:默认为:

var result = document.cookie;
resolve(result);
4、打开RPC连接通信并挂起, bin/sekiro.bat

5、刷新浏览器并检查 是否有 sekiro: begin of connect to wsURL: xxx等等信息 有就是成功
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

## 请注意访问频率不宜过快!!

获取最新js环境:https://github.com/requireCool/stealth.min.js?tab=readme-ov-file

部分代码展示如下:

![image](https://github.com/wu50416/spider_projects/assets/103317042/c0bc5a70-4c57-438c-a81b-4dc5236c9c81)
6 changes: 0 additions & 6 deletions wbh_word/说明文档.txt

This file was deleted.

18 changes: 18 additions & 0 deletions 抖音/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
## 抖音评论区爬虫

### 网站反爬更新说明:
2024-02-06:
1、不登陆时,cookie新增了ttwid参数,由原先的检测s_v_web_id,现在更新为检测ttwid
2、未登录不允许翻页
3、登陆后有携带sessionid与sessionid_ss,可以不用X-Bogus


#### 用法:
#### 只需要写入视频id、爬取页数即可(当然也可以通过响应数据获取总页数,修改一下即可)
![img_1.png](img_1.png)

![img.png](img.png)




193 changes: 193 additions & 0 deletions 抖音/douyin_run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2024/1/17 18:53
# @Author : Harvey
# @File : run_new.py
# -*- coding: UTF-8 -*-
import datetime
import json
from urllib.parse import urlencode
import requests
import execjs
import pandas as pd

'''
反爬更新说明:
2024-02-06:
1、不登陆时,cookie新增了ttwid参数,由原先的检测s_v_web_id,现在更新为检测ttwid
2、未登录不允许翻页
3、登陆后有携带sessionid与sessionid_ss,可以不用X-Bogus
'''

def get_xb_data(urlform):
filename_js = 'get_XBogus2.js'
with open(filename_js, mode='r') as f:
pw_js = f.read()
f.close()
js1 = execjs.compile(pw_js)
print('********* 正在生成 -- XBogus *********')
xb_data = js1.call('get_xb', urlform) # 获取token
print(xb_data)
return xb_data


def get_headers():
# 20240117更新:之前是检测s_v_web_id,现在更新为检测ttwid
# 未登录的cookie
# cookies = {
# "ttwid": "1%7CvSGoO5ZPPEgIIWpNoRr0YCmMrWzQACoN1hNhqxBsexQ%7C1705490555%7C9d5ecaae8cae4fc1ef856f7de5f31a991a4d9a459b738888c7525070d1e806d0",
# # "s_v_web_id": "verify_lqw9zg13_3aUMuPsb_dPTm_4QA0_8Msc_wKAv0zy96U0j",
# }
# 登陆后的cookie:
cookies = {
"sessionid": "xxxxxxxxxxxx",
"sessionid_ss": "xxxxxxxxxxxx",
}
headers = {
"referer": "https://www.douyin.com/",
"user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
return cookies,headers


def get_cursor(page):
'''
:param page: 从第 1 页开始,
:return: cursor 为评论开始的位置(起始为0),count为一次读取多少个评论(最高50)
'''
count = 20
cursor = (page-1) * count # (1-1)*50 = 0
return cursor, count


def dispose_params(params_data):
'''
将 "X-Bogus" 插入 params 中,同时获取 url
:param params_data:
:return:
'''
params_ = urlencode(params_data).replace('%3D', '=') # 转换完后有部分格式问题需要修改
xb_data = get_xb_data(params_)
params = params_ + '&X-Bogus=' + xb_data
url = 'https://www.douyin.com/aweme/v1/web/comment/list/?' + params
return url


def get_url(page,aweme_id):
'''
"aweme_id": "7284507907465481512", # 视频id
"cursor": "0", # 起始评论位置
"count": "50", # 一次获取的条数,最多50条
msToken : 固定值
"X-Bogus" : 根据不带"X-Bogus"的params数据加密生成
:return: params(无xb参数的params)
'''
cursor,count = get_cursor(page)
params_data = {
"device_platform": "webapp",
"aid": "6383",
"channel": "channel_pc_web",
"aweme_id": aweme_id,
"cursor": cursor,
"count": count,
"item_type": "0",
"insert_ids": "",
"whale_cut_token": "",
"cut_version": "1",
"rcFT": "",
"pc_client_type": "1",
"version_code": "170400",
"version_name": "17.4.0",
"cookie_enabled": "true",
"screen_width": "1920",
"screen_height": "1080",
"browser_language": "zh-CN",
"browser_platform": "Win32",
"browser_name": "Chrome",
"browser_version": "120.0.0.0",
"browser_online": "true",
"engine_name": "Blink",
"engine_version": "120.0.0.0",
"os_name": "Windows",
"os_version": "10",
"cpu_core_num": "6",
"device_memory": "8",
"platform": "PC",
"downlink": "10",
"effective_type": "4g",
"round_trip_time": "50",
"webid": "7325026101045102121",
"msToken": "",
# "X-Bogus": "DFSzswVLhOJANcLLti0PqvB9Piz9"
}
url = dispose_params(params_data)
print(url)
return url


def dispose_comments(comments,data_list):
for comment_one in comments:
# ============ 用户个人信息部分 ===========
nickname = comment_one['user']['nickname'] # 用户名
user_id = comment_one['user']['short_id'] # 用户id
signature = comment_one['user']['signature'] # 用户个性签名
head_image = comment_one['user']['avatar_medium']['url_list'][0] # 中等大小的头像
user_url_ = comment_one['user']['sec_uid']
user_url = 'https://www.douyin.com/user/' + user_url_
# ============ 用户个人信息部分 ===========
pinlun_cid = comment_one['cid'] # 后续展开回复评论的请求id(comment_id)

text_data = comment_one['text'] # 评论内容
digg_count = comment_one['digg_count'] # 点赞数
reply_comment_total = comment_one['reply_comment_total'] # 评论数
create_time_str = comment_one['create_time'] # 评论时间
create_time = datetime.datetime.fromtimestamp(create_time_str)
try:
ip_label = comment_one['ip_label']
except:
ip_label = "未知ip"

print(f"用户名:{nickname} 用户id : {user_id} , ip地址 : {ip_label} , 评论 : {text_data} , "
f"回复数:{reply_comment_total} , 点赞数:{digg_count} , 回复时间:{create_time}")
data_one_dict = {"用户id":user_id,"用户名":nickname,"用户链接":user_url,"用户头像":head_image,"ip地址":ip_label,
"评论":text_data,"回复数":reply_comment_total,"点赞数":digg_count,"个性签名":signature,"回复时间":create_time}

PL_image_bool = comment_one['image_list'] # 判断评论是否有图片
if PL_image_bool:
PL_image = PL_image_bool[0]['origin_url']['url_list'][0]
data_one_dict['PL_image']=PL_image
data_list.append(data_one_dict)
return data_list


def run():
aweme_id_list = ["7306459845480254754"]
for aweme_id in aweme_id_list:
file_name = 'data/' + aweme_id + 'asda.xlsx'
cookies,headers = get_headers()
max_page = 5 # 获取100页数据(100*50=5000)条评论
data_list = []
for page in range(1,max_page+1):
print(f"============= 正在获取第 {page} 页数据 =============")
url = get_url(page,aweme_id)
response = requests.get(url, headers=headers, cookies=cookies)
print(response.text)
if response.json().get("status_msg",None) == "blocked":
break
comments = response.json()['comments']
if comments:
print(f"当前有 {len(response.json()['comments'])} 条数据")
data_list = dispose_comments(comments,data_list)
else:
break

df_data = pd.DataFrame.from_dict(data_list) # 字典列表转pandas
print(df_data)
df_data.to_excel(file_name,index=False)


if __name__ == '__main__':
run()



Loading

0 comments on commit 016e421

Please sign in to comment.