请输入您要查询的百科知识:

 

词条 Hamshahri Corpus
释义

  1. Version 1.0

  2. Version 2.0

  3. See also

  4. References

  5. External links

The Hamshahri Corpus ({{lang-fa|پیکره همشهری}}) is a sizable Persian corpus based on the Iranian newspaper Hamshahri, one of the first online Persian newspapers in Iran. It was in initially collected and compiled by Ehsan Darrudi at DBRG Group[1] of University of Tehran. Later a team headed by Ale Ahmad [2] built on this corpus and created the first Persian Text Collection suitable for information retrieval evaluation tasks.

This corpus was created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern Information Retrieval experiments.

Version 1.0

The collection contains more than 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, Society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average of 1.8 KB.

The corpus is available in several formats for download:[2]

  • Tagged Text: 560 MB
  • In SQL Server 2000 Tables: 712 MB

Version 2.0

The second release of Hamshahri Corpus was launched on 20 October 2008. It offers several new features and improvements:

  • More News: 323,616 Text Stories in 3206 XML files (a file for each day)
  • Increased Time Span: From 22 June 1996 to 13 May 2007
  • Bigger in Size: 1.42 GB uncompressed
  • Standard Container: Unicode XML
  • Included Images: images have been extracted from the news and preserved (available in an additional package), which makes it suitable for Image Retrieval tasks.
  • Categorized News: the news stories have been categorized semi-automatically (appropriate for Text Categorization and Classification tasks).

The corpus is available for download in XML format.

See also

  • Bijankhan Corpus
  • Persian Today Corpus
  • Text corpus
  • Information Retrieval

References

1. ^DBRG News Database Research Group
2. ^Hamshahri Database Research Group

External links

  • Hamshahri Corpus Homepage
  • irBlogs Collection Homepage
{{Corpus linguistics}}{{DEFAULTSORT:Hamshahri Corpus}}

6 : Corpora|Persian-language newspapers|Persian language|Applied linguistics|Linguistic research|Media in Tehran

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/11/12 3:47:15