DataCite Commons: WuDaoCorpora Text

3.8K Views 3.1K Downloads

WuDaoCorpora Text is a large pretraining Chinese corpus constructed by Beijing Academy of Artificial Intelligence(BAAI). The total data volume of the dataset has exceeded 5TB, including 200GB open data.Compared with other pretraining corpora, the WuDaoCorpora Text has the following advantages.1) In the process of data collection, we classify the quality of web pages according to the proportion of words in web pages and the integrity of DOM trees, and select high-quality web page for data collection to ensure the corpus quality.2) Through data cooperation with other institutions and web page data crawling, the dataset covers a wide range types of Chinese text, including news, comments, encyclopedias, forums, blogs, academic papers, etc.3) The dataset uses more than 20 cleaning rules to obtain the final corpus from the 100TB original web page data. In the cleaning process, special attention is paid to the removal of private information to avoid the risk of privacy disclosure.4) The dataset contains 50+ data tags, such as education and laws, which is convenient for users to extract specific-domain data for model training in that field.Please obey the following agreement if you use our dataset.https://data.baai.ac.cn/resources/agreement/BAAIDataAgreement.pdf

Version V1 of Dataset published 2022 in ScienceDB

DatasetChinese

https://doi.org/10.57760/sciencedb.o00126.00004

3,830 views reported since publication in 2022.

Zhao Xue	北京智源人工智能研究院
Hanyu Zhao	北京智源人工智能研究院
Sha Yuan	北京智源人工智能研究院
Yequan Wang	北京智源人工智能研究院

Zhao Xue	北京智源人工智能研究院
Hanyu Zhao	北京智源人工智能研究院
Sha Yuan	北京智源人工智能研究院
Yequan Wang	北京智源人工智能研究院

WuDaoCorpora Text

Cite as

Download Reports

WuDaoCorpora Text

Cite as

Download Reports

WuDaoCorpora Text

Cite as

Download Reports

Share

WuDaoCorpora Text

Cite as

Download Reports

Share