BERT4ETH

This is the repo for the code and datasets used in the paper BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection, accepted by the ACM Web conference (WWW) 2023.

Here you can find our slides.

Getting Start

Requirements

Python >= 3.6.1
NumPy >= 1.12.1
TensorFlow >= 1.4.0

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

cd BERT4ETH/Data; # Labels are already included
unzip ...;

The total volume of unzipped dataset is quite huge (more than 15GB).

If you want to run the basic BERT4ETH model, it is no need to download the ERC-20 log dataset.

Advanced features (In/out separation and ERC20 log) make the model not very efficient..

Step 3: Transaction Sequence Generation

cd Model/bert4eth;
python gen_seq.py --phisher=True \
                  --deanon=True \ 
                  --mev=True \
                  --dup=True \
                  --dataset=1M \
                  --bizdate=bert4eth_1M_min3_dup

Pre-training

Step 0: Model Configuration

The configuration file is "Model/BERT4ETH/bert_config.json"

{
  "attention_probs_dropout_prob": 0.2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.2,
  "hidden_size": 64,
  "initializer_range": 0.02,
  "intermediate_size": 64,
  "max_position_embeddings": 50,
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "type_vocab_size": 2,
  "vocab_size": 3000000
}

Step 1: Pre-train Sequence Generation

python gen_pretrain_data.py --source_bizdate=bert4eth_1M_min3_dup  \
                            --bizdate=bert4eth_1M_min3_dup_seq100_mask80  \ 
                            --max_seq_length=100  \
                            --dupe_factor=10 \
                            --masked_lm_prob=0.8 \
                            --do_eval=False

Step 2: Pre-train BERT4ETH Model

python run_pretrain.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \
                       --max_seq_length=100 \
                       --checkpointDir=bert4eth_1M_min3_dup_seq100_mask80_shared_zipfan5000 \
                       --epoch=5 \
                       --batch_size=256 \
                       --learning_rate=1e-4 \
                       --num_train_steps=1000000 \
                       --num_warmup_steps=100 \
                       --save_checkpoints_steps=8000 \
                       --neg_strategy=zip
                       --neg_sample_num=5000 
                       --neg_share=True
                       --init_seed=1234

Parameter	Description
`bizdate`	The signature for this experiment run.
`max_seq_length`	The maximum length of BERT4ETH.
`masked_lm_prob`	The probability of masking an address.
`epochs`	Number of training epochs, default = `5`.
`batch_size`	Batch size, default = `256`.
`learning_rate`	Learning rate for the optimizer (Adam), default = `1e-4`.
`num_train_steps`	The maximum number of training steps, default = `1000000`,
`num_warmup_steps`	The step number for warm-up training, default = `100`.
`save_checkpoints_steps`	The parameter controlling the step of saving checkpoints, default = `8000`.
`neg_strategy`	Strategy for negative sampling, default `zip`, options (`uniform`, `zip`, `freq`).
`neg_share`	Whether enable in-batch sharing strategy, default = `True`.
`neg_sample_num`	The negative sampling number for one batch, default = `5000`.
`do_eval`	Whether to do evaluation during training, default = `False`.
`checkpointDir`	Specify the directory to save the checkpoints.
`init_seed`	The initial seed, default = `1234`.

Step 3: Output Representation

python run_embed.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \ 
                    --init_checkpoint=bert4eth_1M_min3_dup_seq100_mask80_shared_zipfan5000/model_104000 \  
                    --max_seq_length=100 \
                    --neg_sample_num=5000 \
                    --neg_strategy=zip \
                    --neg_share=True

Testing on the account representation

Phishing Account Detection

cd BERT4ETH/Model;
python run_phishing_detection.py --algo=bert4eth \
                                 --model_index=XXX

De-anonymization (ENS dataset)

cd BERT4ETH/Model;
python run_dean_ENS.py --metric=euclidean \
                       --algo=bert4eth \
                       --model_index=XXX

De-anonymization (Tornado Cash)

cd BERT4ETH/Model;
python run_dean_Tornado.py --metric=euclidean \
                           --algo=bert4eth \
                           --model_index=XXX

Fine-tuning on the phishing account detection

cd BERT4ETH/Model;
python gen_finetune_phisher_data.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \ 
                                    --source_bizdate=bert4eth_1M_min3_dup \
                                    --max_seq_length=100

cd BERT4ETH/Model/BERT4ETH
python run_finetune_phisher.py --bizdate=bert4eth_1M_min3_dup_seq100_mask80 \ 
                               --max_seq_length=100 --checkpointDir=tmp

Citation

If you find this repository useful, please give us a star and cite our paper : ) Thank you!

@article{hu2023bert4eth,
  title={BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection},
  author={Hu, Sihao and Zhang, Zhen and Luo, Bingqiao and Lu, Shengliang and He, Bingsheng and Liu, Ling},
  journal={arXiv preprint arXiv:2303.18138},
  year={2023}
}

Q&A

If you have any questions, you can either open an issue or contact me (sihaohu@gatech.edu), and I will reply as soon as I see the issue or email.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Data		Data
Material		Material
Model		Model
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT4ETH

Getting Start

Requirements

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

Step 3: Transaction Sequence Generation

Pre-training

Step 0: Model Configuration

Step 1: Pre-train Sequence Generation

Step 2: Pre-train BERT4ETH Model

Step 3: Output Representation

Testing on the account representation

Phishing Account Detection

De-anonymization (ENS dataset)

De-anonymization (Tornado Cash)

Fine-tuning on the phishing account detection

Citation

Q&A

About

Releases

Packages

Languages

LexiKuma/BERT4ETH

Folders and files

Latest commit

History

Repository files navigation

BERT4ETH

Getting Start

Requirements

Preprocess dataset

Step 1: Download dataset from Google Drive.

Step 2: Unzip dataset under the directory of "BERT4ETH/Data/"

Step 3: Transaction Sequence Generation

Pre-training

Step 0: Model Configuration

Step 1: Pre-train Sequence Generation

Step 2: Pre-train BERT4ETH Model

Step 3: Output Representation

Testing on the account representation

Phishing Account Detection

De-anonymization (ENS dataset)

De-anonymization (Tornado Cash)

Fine-tuning on the phishing account detection

Citation

Q&A

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages