Skip to content
This repository has been archived by the owner on Jul 18, 2024. It is now read-only.

[DataCap Application] Commoncrawl(3/3) #2302

Open
1 of 2 tasks
nicelove666 opened this issue Jan 2, 2024 · 66 comments
Open
1 of 2 tasks

[DataCap Application] Commoncrawl(3/3) #2302

nicelove666 opened this issue Jan 2, 2024 · 66 comments

Comments

@nicelove666
Copy link

nicelove666 commented Jan 2, 2024

Data Owner Name

Commoncrawl

What is your role related to the dataset

Data Preparer

Data Owner Country/Region

United States

Data Owner Industry

Life Science / Healthcare

Website

https://commoncrawl.org/

Social Media

https://commoncrawl.org/

Total amount of DataCap being requested

15PiB

Expected size of single dataset (one copy)

2.5PiB

Number of replicas to store

6

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

  • Use Custom Multisig

Identifier

No response

Share a brief history of your project and organization

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Is this project associated with other projects/ecosystem stakeholders?

Yes

If answered yes, what are the other projects/ecosystem stakeholders

https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2287
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2204

Describe the data being stored onto Filecoin

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (Country/Region)

China

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

We use a script to package the files originally stored in the nginx file server into tar files. Each tar file is controlled to be around 17-30G. Finally, the tar file package is converted into a car file. After the conversion is completed, a record of the car file and The metadata of the source file information is stored in our local system for later query.

If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

This website has a lot of data, as far as I know, no one has systematically stored all the data on the Filecoin network.

Please share a sample of the data

https://commoncrawl.org/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

  • I confirm

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Yearly

For how long do you plan to keep this dataset stored on Filecoin

2 to 3 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Europe

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), HTTP or FTP server, Shipping hard drives

How do you plan to choose storage providers

Slack, Big Data Exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

No response

How do you plan to make deals to your storage providers

Boost client, Lotus client

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

@Sunnyiscoming
Copy link
Collaborator

Please provide ID, City, Country, Organization of each SP here.

@nicelove666
Copy link
Author

Provider Location SP Entity or Personal
f02199203 Inner Mongolia Richard
f02223170 HK tianyou
f02831201 GuangDong Juwu Mine
f02824157 BeiJing zhongchuangyun

This is our cooperative SP. Around January 15th, we will add 5-7 SPs from Japan, Vietnam and Hong Kong. When they are launched, we will list them, thank you.

@Sunnyiscoming
Copy link
Collaborator

Hello, per the filecoin-project/notary-governance#922 for Open, Public Dataset applicants, please complete the following Fil+ registration form to identify yourself as the applicant and also please add the contact information of the SP entities you are working with to store copies of the data.

This information will be reviewed by Fil+ Governance team to confirm validity and then the application will be allowed to move forward for additional notary review.

@Sunnyiscoming
Copy link
Collaborator

SP List provided:
[{"providerID":"f02199203","City":"InnerMongolia","Country":"China","SPOrg","Richard"},
{"providerID":"f02223170","City":"HK","Country":"China","SPOrg","tianyou"},
{"providerID":"f02831201","City":"GuangDong","Country":"China","SPOrg","JuwuMine"},
{"providerID":"f02824157","City":"BeiJing","Country":"China","SPOrg","zhongchuangyun"},]

@nicelove666
Copy link
Author

WX20240109-112150@2x We submitted it, thank you

@nicelove666
Copy link
Author

nicelove666 commented Jan 11, 2024

https://www.ipqualityscore.com/user/search is a public, well-known and unbiased geolocation detection software. I paid to check the SP we cooperate with, and it turns out that their address location is real. f02199203 116.136.130.130 f02824157 116.172.66.38 f02824140 116.172.66.38 f02841613 210.209.77.161 f02831202 14.29.124.50 f0122215 119.167.140.136

Detection method: Find the IP corresponding to the sp in boost, enter the IP, and you can see the detection results.
If SP use VPN, the detection score may be greater than 70 points. The detection score is 0 points,means no fraud, which proves that the SP's address is an honest address.

@nicelove666
Copy link
Author

nicelove666 commented Jan 11, 2024

WX20240111-145600@2x WX20240111-145455@2x

@nicelove666
Copy link
Author

nicelove666 commented Jan 11, 2024

WX20240111-144512@2x WX20240111-145015@2x WX20240111-145537@2x

@nicelove666
Copy link
Author

Can you help us move forward, thank you. @Sunnyiscoming

@nicelove666
Copy link
Author

nicelove666 commented Jan 16, 2024

It took two weeks to apply, but it still hasn’t been approved. Therefore, the cooperative SP has changed, we have updated the cooperative SP:

f02199203 Richard Nei Mongol(Inner Mongolia) 116.136.130.130
WX20240116-160041@2x

f02824157 zhongchuangyun GuangDong 116.172.66.38
f02824140 zhongchuangyun GuangDong 116.172.66.38
2@2x

f02831202 Juwu Mine GuangDong 14.29.124.50
WX20240116-160408@2x

f0122215 SuSuanYun ShanDong 119.167.140.136
WX20240116-160454@2x

@nicelove666
Copy link
Author

This can clearly display the address location of each SP. Facts have proved that the SPs we cooperate with are honest and hope to get your approved. @Sunnyiscoming @Filplus-govteam @galen-mcandrew @Kevin-FF-USA @clriesco

@nicelove666
Copy link
Author

Please tell me, what else do I need to do? @Sunnyiscoming

Copy link

Deleting comment

@Sunnyiscoming hasn't the permissions to post this comment.

Please, contact the assignee of this issue.

@Sunnyiscoming Sunnyiscoming self-assigned this Jan 17, 2024
@Sunnyiscoming
Copy link
Collaborator

Datacap Request Trigger

Total DataCap requested

15PiB

Expected weekly DataCap usage rate

1PiB

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Copy link

DataCap Allocation requested

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

512TiB

Id

af9739bf-5ed7-4e71-a41a-9703387d3d7c

Copy link

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacectswk4dafjr6af3r5yuizh7hbc4oimxjdup67cn35ti32cgibpj2

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f1n5wlrrhoxpkgwij25xrtt7w7g2k3fhbthmdn6ri

Id

af9739bf-5ed7-4e71-a41a-9703387d3d7c

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacectswk4dafjr6af3r5yuizh7hbc4oimxjdup67cn35ti32cgibpj2

@DaYouGroup
Copy link

checker:manualTrigger

Copy link

DataCap and CID Checker Report Summary1

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 41.42% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients2

✔️ No CID sharing has been observed.

Full report

Click here to view the CID Checker report.
Click here to view the Retrieval Dashboard.

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

  2. To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

@DaYouGroup
Copy link

Willing to back this round based on past performance. But please pay attention to the problem of "Deal Data Replication"

Copy link

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacebq57fww6jsnjqds2lsnchvfkr6rmpsvsvjlgxvfk2cbfdrxklutm

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

2.00PiB

Signer Address

f1nwjsd2mc6hu4qrwnmd6ukrfkuu4h5fhs7u3exii

Id

199294f7-50fd-438d-8155-aab522e5756a

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebq57fww6jsnjqds2lsnchvfkr6rmpsvsvjlgxvfk2cbfdrxklutm

@nicelove666
Copy link
Author

checker:manualTrigger

Copy link

DataCap and CID Checker Report Summary1

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 34.15% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients2

✔️ No CID sharing has been observed.

Full report

Click here to view the CID Checker report.
Click here to view the Retrieval Dashboard.

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

  2. To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Copy link

DataCap Allocation requested

Request number 9

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

2PiB

Id

cb61f784-f39d-4919-a802-428497de7512

@nicelove666
Copy link
Author

Waiting for v5 onboarding

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests