请输入您要查询的百科知识:

 

词条 Heritrix
释义

  1. Projects using Heritrix

  2. Arc files

      Tools for processing Arc files  

  3. Command-line tools

  4. See also

  5. References

  6. External links

{{Infobox software
| name = Heritrix
| logo = Heritrix logo.png
| screenshot = Heritrix-screenshot.png
| screenshot size = 250px
| caption = Screenshot of Heritrix Admin Console.
| developer =
| latest_release_version = 3.2.0
| latest_release_date = {{release date|2014|01|10}}
| operating_system = Linux/Unix-like/Windows (unsupported)
| programming_language = Java
| genre = Web crawler
| license = Apache License
| website = {{URL |crawler.archive.org }}
}}

Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.

Heritrix was developed jointly by the Internet Archive and the Nordic national libraries on specifications written in early 2003. The first official release was in January 2004, and it has been continually improved by employees of the Internet Archive and other interested parties.

Heritrix was not the main crawler used to crawl content for the Internet Archive's web collection for many years.[1] The largest contributor to the collection, as of 2011, is Alexa Internet.[1] Alexa crawls the web for its own purposes,[1] using a crawler named ia_archiver. Alexa then donates the material to the Internet Archive.[1] The Internet Archive itself did some of its own crawling using Heritrix, but only on a smaller scale.[1]

Starting in 2008, the Internet Archive began performance improvements to do its own wide scale crawling, and now does collect most of its content.[2]{{Failed verification|date=October 2017}}

Projects using Heritrix

A number of organizations and national libraries are using Heritrix, among them:{{Citation needed|date=October 2017}}

  • Austrian National Library, Web Archiving
  • Bibliotheca Alexandrina's Internet Archive
  • Bibliothèque nationale de France
  • British Library
  • California Digital Library's Web Archiving Service
  • CiteSeerX
  • Documenting Internet2
  • Internet Memory Foundation
  • Library and Archives Canada
  • Library of Congress[3]
  • National and University Library of Iceland
  • National Library of Finland
  • National Library of New Zealand
  • National Library of the Netherlands (Koninklijke Bibliotheek)[4]
  • Netarkivet.dk
  • Smithsonian Institution Archives
  • National Library of Israel

Arc files

Older versions of Heritrix by default stored the web resources it crawls in an Arc file. This file format is wholly unrelated to ARC (file format).

This format has been used by the Internet Archive since 1996 to store its web archives. More recently it saves by default in the WARC file format, which is similar to ARC but more precisely specified and more flexible. Heritrix can also be configured to store files in a directory format similar to the Wget crawler that uses the URL to name the directory and filename of each resource.

An Arc file stores multiple archived resources in a single file in order to avoid managing a large number of small files. The file consists of a sequence of URL records, each with a header containing metadata about how the resource was requested followed by the HTTP header and the response. Arc files range between 100 and 600 MB.{{Citation needed|date=October 2017}}

Example:

filedesc://IA-2006062.arc 0.0.0.0 20060622190110 text/plain 76

1 1 InternetArchive

URL IP-address Archive-date Content-type Archive-length

http://foo.edu:80/hello.html 127.10.100.2 19961104142103 text/html 187

HTTP/1.1 200 OK

Date: Thu, 22 Jun 2006 19:01:15 GMT

Server: Apache

Last-Modified: Sat, 10 Jun 2006 22:33:11 GMT

Content-Length: 30

Content-Type: text/html

Hello World!!!

Tools for processing Arc files

Heritrix includes a command-line tool called arcreader which can be used to extract the contents of an Arc file. The following command lists all the URLs and metadata stored in the given Arc file (in [https://archive.org/web/researcher/cdx_legend.php CDX] format):

The following command extracts hello.html from the above example assuming the record starts at offset 140:

Other tools:

  • [https://web.archive.org/web/20060111160619/http://wiki.lib.umn.edu/DI2/HowToCrawl Arc processing tools]
  • WERA (Web ARchive Access)

Command-line tools

Heritrix comes with several command-line tools:

  • htmlextractor - displays the links Heritrix would extract for a given URL
  • hoppath.pl - recreates the hop path (path of links) to the specified URL from a completed crawl
  • manifest_bundle.pl - bundles up all resources referenced by a crawl manifest file into an uncompressed or compressed tar ball
  • cmdline-jmxclient - enables command-line control of Heritrix
  • arcreader - extracts contents of ARC files (see above)

Further tools are available as part of the Internet Archive's warctools project.[5]

See also

{{Portal|Free and open-source software}}
  • Internet Archive
  • National Digital Information Infrastructure and Preservation Program
  • Web crawler

References

{{CCBYSASource
| sourcepath = http://webmasters.stackexchange.com/a/690/21219
| sourcearticle = Re: Control over the Internet Archive besides just “Disallow /”?
| revision = 531730721}}
1. ^{{cite web|author=Kris|title=Re: Control over the Internet Archive besides just "Disallow /"?|url=http://webmasters.stackexchange.com/a/690/21219|work=Pro Webmasters Stack Exchange|publisher=Stack Exchange, Inc.|accessdate=January 7, 2013|date=September 6, 2011}}
2. ^{{cite web|url=http://blog.archive.org/2013/01/09/updated-wayback|title=Wayback Machine: Now with 240,000,000,000 URLs - Internet Archive Blogs|website=blog.archive.org|accessdate=11 September 2017}}
3. ^{{Cite web|url=https://www.loc.gov/webarchiving/technical.html|title=About - Web Archiving (Library of Congress)|website=www.loc.gov|access-date=2017-10-29}}
4. ^{{cite web|url=http://www.kb.nl/organisatie/onderzoek-expertise/e-depot-duurzame-opslag/webarchivering/technische-aspecten-bij-webarchivering|title=Technische aspecten bij webarchivering - Koninklijke Bibliotheek|website=www.kb.nl|accessdate=11 September 2017}}
5. ^{{cite web|url=https://github.com/internetarchive/warctools|title=warctools|date=25 August 2017|publisher=|accessdate=11 September 2017|via=GitHub}}
{{Refbegin}}
  1. {{cite journal | author = Burner, M. | title = Crawling towards eternity – building an archive of the World Wide Web | journal = Web Techniques | year = 1997 | volume = 2 | issue = 5 | url = http://www.webtechniques.com/archives/1997/05/burner/ | archiveurl=https://web.archive.org/web/20080101070319/http://www.webtechniques.com/archives/1997/05/burner/ |archivedate=January 1, 2008}}
  2. {{cite conference | author = Mohr, G., Kimpton, M., Stack, M., Ranitovic, I. | year = 2004 | title = Introduction to Heritrix, an archival quality web crawler | booktitle = Proceedings of the 4th International Web Archiving Workshop (IWAW’04) | url = http://www.iwaw.net/04/Mohr.pdf}}
  3. {{cite conference | author = Sigurðsson, K. | year = 2005 | title = Incremental crawling with Heritrix | booktitle = Proceedings of the 5th International Web Archiving Workshop (IWAW’05) | url = http://www.iwaw.net/05/papers/iwaw05-sigurdsson.pdf}}
{{Refend}}

External links

Tools by Internet Archive:

  • [https://webarchive.jira.com/wiki/display/Heritrix/Heritrix Heritrix - official wiki]
  • NutchWAX - search web archive collections
  • Wayback (Open source Wayback Machine) - search and navigate web archive collections using NutchWax

Links to related tools:

  • [https://archive.org/web/researcher/ArcFileFormat.php Arc file format]
  • How to run Heritrix in Windows
  • WERA (Web ARchive Access) - search and navigate web archive collections using NutchWAX
{{Internet Archive navbox}}{{Web crawlers}}

3 : Web archiving|Free web crawlers|2014 software

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/11/16 12:13:54