论文笔记-LSHTC: A Benchmark for Large-Scale Text Classification-2015

it2022-05-05 277

关于LSHTC更多介绍见官网

文章目录

titleabstractdatasetLSHTC数据集介绍LSHTC1LSHTC2LSHTC3 & LSHTC4 评估方法References(论文提到的算法的论文)

title

LSHTC: A Benchmark for Large-Scale Text Classification

abstract

LSHTC is a series of challenges which aims to assess the performance of classification systems in large-scale classification in a a large number of classes (up to hundreds of thousands). This paper describes the dataset that have been released along the LSHTC series. The paper details the construction of the datsets and the design of the tracks as well as the evaluation measures that we implemented and a quick overview of the results. All of these datasets are available online and runs may still be submitted on the online server of the challenges.

dataset

1 http://www.bioasq.org 2 http://www.image-net.org/challenges/LSVRC/2014/ 3 http://research.microsoft.com/en-us/um/people/manik/events/xc13/ 4 http://lshtc.iit.demokritos.gr/WSDM_WS 5 http://lshtc.iit.demokritos.gr/ 6 http://dbpedia.org/About 7 http://www.dmoz.org/

LSHTC数据集介绍

LSHTC1

The tracks of the first year of the challenge were based on the DMOZ dataset (tree hierarchy) using only single-label instances. The challenge was split into 4 tracks which were composed by different combinations between Content and Description vectors. Since both types of vectors were used in this challenge only the intersection of the two sets of instances were used for this challenge (we used only instances which had both a Content and Description vector). 【挑战第一年的轨道基于DMOZ数据集（树层次结构）仅使用单标签实例。挑战分为4个轨道由内容和内容之间的不同组合组成描述向量。由于这两种类型的载体仅用于此挑战两组实例的交集用于此挑战（我们使用过只有同时具有内容和描述矢量的实例）。】

LSHTC2

During LSHTC2, we used multi-label instances and added non-tree hierarchies. Instead of using, for DMOZ, the intersection between the instances of Content and Description vector, we decided to keep one of them. We kept the Content vectors, since they did not require a human annotator in order be created. Since we decided to move to multi-label classification, we used all the Content vectors that we had. 【在LSHTC2期间，我们使用了多标签实例并添加了非树层次结构。对于DMOZ，我们决定保留其中一个实例，而不是使用内容和描述向量的实例之间的交集。我们保留了内容向量，因为它们不需要创建人类注释器。由于我们决定采用多标签分类，因此我们使用了所有内容向量】

LSHTC3 & LSHTC4

The two DBpedia datasets were also used, as Track 1, during the third iteration of the LSHTC challenges (LSHTC3) The only addition was regarding the Medium DBpedia dataset, were we also provided the original text of the instances, without beeing pre-processed. During LSHTC 4, only the Large DBpedia dataset was used for the first track called \Very Large Supervised Learning", which was evaluated at Kaggle[http://www.kaggle.com/]. 【在LSHTC挑战的第三次迭代期间，两个DBpedia数据集也被用作轨道1（LSHTC3）唯一的补充是关于中DBMB数据集，我们还提供了实例的原始文本，而没有预先处理。在LSHTC 4期间，只有大型DBpedia数据集被用于第一个名为“非常大的监督学习”的轨道，该轨道在Kaggle进行了评估。】

评估方法

During the classification tracks of all LSHTC challenges, we used two types of measures in order to evaluate the participating systems, flat and hierarchical.

最好结果：

References(论文提到的算法的论文)

[1] Christophe Brouard. Echo at the lshtc pascal challenge 2. PASCAL Workshop on Large-Scale Hierarchical Classification, ECML/PKDD 2011, pages 49-57, 2011. [2] Xiaogang Han, Shaohua Li, and Zhiqi Shen. A k-nn method for large scale hierarchical text classification at lshtc3. Discovery Challenge Workshop on Large Scale Hierarchical Classification, ECML/PKDD 2012, 2012. [3] Aris Kosmopoulos, Ioannis Partalas, Eric Gaussier, Georgios Paliouras, and Ion Androutsopoulos. Evaluation measures for hierarchical classification: a unified view and novel approaches. Data Mining and Knowledge Discovery, pages 1{46, 2014. [4] Dong-Hyun Lee. Multi-stage rocchio classification for large-scale multilabeled text data. Discovery Challenge Workshop on Large Scale Hierarchical Classification, ECML/PKDD 2012, 2012. [5] Xiao lin Wang, Hai Zhao, and Bao-Liang Lu. A meta-top-down method for large-scale hierarchical classification. Knowledge and Data Engineering, IEEE Transactions on, 26(3):500{513, March 2014. [6] Omid Madani and Jian Huang. Large-scale many-class prediction via flat techniques. In Large-Scale Hierarchical Classification Workshop of ECIR,2010. [7] Youdong Miao and Xipeng Qiu. Hierarchical centroid-based classifier for large scale text classification. Large Scale Hierarchical Text classification(LSHTC) Pascal Challenge, 18, 2009. [8] Antti Puurula and Albert Bifet. Ensembles of sparse multinomial classifiers for scalable text classification. Discovery Challenge Workshop on Large Scale Hierarchical Classification, ECML/PKDD 2012, 2012. [9] Yutaka Sasaki and Davy Weissenbacher. Tti’s system for the lshtc3 challenge. Discovery Challenge Workshop on Large Scale Hierarchical Classification, ECML/PKDD 2012, 2012. [10] Grigorios Tsoumakas and Ioannis Vlahavas. Random k-labelsets: An ensemble method for multilabel classification. In Machine Learning: ECML 2007, volume 4701 of Lecture Notes in Computer Science, pages 406{417. 2007. [11] Xiao-Lin Wang, Hai Zhao, and Bao-Liang Lu. Enhance k-nearest neighbour algorithm for large-scale multi-labeled hierarchical classification. PASCAL Workshop on Large-Scale Hierarchical Classification, ECML/PKDD 2011,pages 58{67, 2011. [12] Yiming Yang and Xin Liu. A re-examination of text categorization methods. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, pages 42{49. ACM Press, 1999.

专利

最新回复(0)