Skip to content

BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection (WWW23)

Notifications You must be signed in to change notification settings

LexiKuma/BERT4ETH

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BERT4ETH

This is the repo for the code and datasets used in the paper BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection, accepted by the ACM Web conference (WWW) 2023.

Here you can find our slides.

Getting Start

Requirements

  • Python >= 3.6.1
  • NumPy >= 1.12.1
  • TensorFlow >= 1.4.0

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

cd BERT4ETH/Data; # Labels are already included
unzip ...;

The total volume of unzipped dataset is quite huge (more than 15GB).

If you want to run the basic BERT4ETH model, it is no need to download the ERC-20 log dataset.

Advanced features (In/out separation and ERC20 log) make the model not very efficient..

Step 3: Transaction Sequence Generation

cd Model/bert4eth;
python gen_seq.py --phisher=True \
                  --deanon=True \ 
                  --mev=True \
                  --dup=True \
                  --dataset=1M \
                  --bizdate=bert4eth_1M_min3_dup

Pre-training

Step 0: Model Configuration

The configuration file is "Model/BERT4ETH/bert_config.json"

{
  "attention_probs_dropout_prob": 0.2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.2,
  "hidden_size": 64,
  "initializer_range": 0.02,
  "intermediate_size": 64,
  "max_position_embeddings": 50,
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "type_vocab_size": 2,
  "vocab_size": 3000000
}

Step 1: Pre-train Sequence Generation

python gen_pretrain_data.py --source_bizdate=bert4eth_1M_min3_dup  \
                            --bizdate=bert4eth_1M_min3_dup_seq100_mask80  \ 
                            --max_seq_length=100  \
                            --dupe_factor=10 \
                            --masked_lm_prob=0.8 \
                            --do_eval=False

Step 2: Pre-train BERT4ETH Model

python run_pretrain.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \
                       --max_seq_length=100 \
                       --checkpointDir=bert4eth_1M_min3_dup_seq100_mask80_shared_zipfan5000 \
                       --epoch=5 \
                       --batch_size=256 \
                       --learning_rate=1e-4 \
                       --num_train_steps=1000000 \
                       --num_warmup_steps=100 \
                       --save_checkpoints_steps=8000 \
                       --neg_strategy=zip
                       --neg_sample_num=5000 
                       --neg_share=True
                       --init_seed=1234 
                       
Parameter Description
bizdate The signature for this experiment run.
max_seq_length The maximum length of BERT4ETH.
masked_lm_prob The probability of masking an address.
epochs Number of training epochs, default = 5.
batch_size Batch size, default = 256.
learning_rate Learning rate for the optimizer (Adam), default = 1e-4.
num_train_steps The maximum number of training steps, default = 1000000,
num_warmup_steps The step number for warm-up training, default = 100.
save_checkpoints_steps The parameter controlling the step of saving checkpoints, default = 8000.
neg_strategy Strategy for negative sampling, default zip, options (uniform, zip, freq).
neg_share Whether enable in-batch sharing strategy, default = True.
neg_sample_num The negative sampling number for one batch, default = 5000.
do_eval Whether to do evaluation during training, default = False.
checkpointDir Specify the directory to save the checkpoints.
init_seed The initial seed, default = 1234.

Step 3: Output Representation

python run_embed.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \ 
                    --init_checkpoint=bert4eth_1M_min3_dup_seq100_mask80_shared_zipfan5000/model_104000 \  
                    --max_seq_length=100 \
                    --neg_sample_num=5000 \
                    --neg_strategy=zip \
                    --neg_share=True 

Testing on the account representation

Phishing Account Detection

cd BERT4ETH/Model;
python run_phishing_detection.py --algo=bert4eth \
                                 --model_index=XXX

De-anonymization (ENS dataset)

cd BERT4ETH/Model;
python run_dean_ENS.py --metric=euclidean \
                       --algo=bert4eth \
                       --model_index=XXX

De-anonymization (Tornado Cash)

cd BERT4ETH/Model;
python run_dean_Tornado.py --metric=euclidean \
                           --algo=bert4eth \
                           --model_index=XXX

Fine-tuning on the phishing account detection

cd BERT4ETH/Model;
python gen_finetune_phisher_data.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \ 
                                    --source_bizdate=bert4eth_1M_min3_dup \
                                    --max_seq_length=100 
cd BERT4ETH/Model/BERT4ETH
python run_finetune_phisher.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \ 
                               --max_seq_length=100 --checkpointDir=tmp

Citation

If you find this repository useful, please give us a star and cite our paper : ) Thank you!

@article{hu2023bert4eth,
  title={BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection},
  author={Hu, Sihao and Zhang, Zhen and Luo, Bingqiao and Lu, Shengliang and He, Bingsheng and Liu, Ling},
  journal={arXiv preprint arXiv:2303.18138},
  year={2023}
}

Q&A

If you have any questions, you can either open an issue or contact me (sihaohu@gatech.edu), and I will reply as soon as I see the issue or email.

About

BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection (WWW23)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%