3.8K Views 3.1K Downloads
WuDaoCorpora Text is a large pretraining Chinese corpus constructed by Beijing Academy of Artificial Intelligence(BAAI). The total data volume of the dataset has exceeded 5TB, including 200GB open data.Compared with other pretraining corpora, the WuDaoCorpora Text has the following advantages.1) In the process of data collection, we classify the quality of web pages according to the proportion of words in web pages and the integrity of DOM trees, and select high-quality web page for data collection to ensure the corpus quality.2) Through data cooperation with other institutions and web page data crawling, the dataset covers a wide range types of Chinese text, including news, comments, encyclopedias, forums, blogs, academic papers, etc.3) The dataset uses more than 20 cleaning rules to obtain the final corpus from the 100TB original web page data. In the cleaning process, special attention is paid to the removal of private information to avoid the risk of privacy disclosure.4) The dataset contains 50+ data tags, such as education and laws, which is convenient for users to extract specific-domain data for model training in that field.Please obey the following agreement if you use our dataset.https://data.baai.ac.cn/resources/agreement/BAAIDataAgreement.pdf
3,830 views reported since publication in 2022.