LSHTC: A Benchmark for Large-Scale Text Classification


LSHTC is a series of challenges which aims to assess the performance of classification systems in large-scale classification in a a large number of classes (up to hundreds of thousands). This paper describes the dataset that have been released along the LSHTC series. The paper details the construction of the datsets and the design of the tracks as well as the evaluation measures that we implemented and a quick overview of the results. All of these datasets are available online and runs may still be submitted on the online server of the challenges.


The tracks of the first year of the challenge were based on the DMOZ dataset (tree hierarchy) using only single-label instances. The challenge was split into 4 tracks which were composed by different combinations between Content and Description vectors. Since both types of vectors were used in this challenge only the intersection of the two sets of instances were used for this challenge (we used only instances which had both a Content and Description vector). 【挑战第一年的轨道基于DMOZ数据集(树层次结构)仅使用单标签实例。 挑战分为4个轨道由内容和内容之间的不同组合组成 描述向量。 由于这两种类型的载体仅用于此挑战两组实例的交集用于此挑战(我们使用过只有同时具有内容和描述矢量的实例)。】


During LSHTC2, we used multi-label instances and added non-tree hierarchies. Instead of using, for DMOZ, the intersection between the instances of Content and Description vector, we decided to keep one of them. We kept the Content vectors, since they did not require a human annotator in order be created. Since we decided to move to multi-label classification, we used all the Content vectors that we had. 【在LSHTC2期间,我们使用了多标签实例并添加了非树层次结构。对于DMOZ,我们决定保留其中一个实例,而不是使用内容和描述向量的实例之间的交集。 我们保留了内容向量,因为它们不需要创建人类注释器。 由于我们决定采用多标签分类,因此我们使用了所有内容向量】


The two DBpedia datasets were also used, as Track 1, during the third iteration of the LSHTC challenges (LSHTC3) The only addition was regarding the Medium DBpedia dataset, were we also provided the original text of the instances, without beeing pre-processed. During LSHTC 4, only the Large DBpedia dataset was used for the first track called \Very Large Supervised Learning", which was evaluated at Kaggle[http://www.kaggle.com/]. 【在LSHTC挑战的第三次迭代期间,两个DBpedia数据集也被用作轨道1(LSHTC3)唯一的补充是关于中DBMB数据集,我们还提供了实例的原始文本,而没有预先处理。 在LSHTC 4期间,只有大型DBpedia数据集被用于第一个名为“非常大的监督学习”的轨道,该轨道在Kaggle进行了评估。】


During the classification tracks of all LSHTC challenges, we used two types of measures in order to evaluate the participating systems, flat and hierarchical.



