词条 | Apache Nutch | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
释义 |
| name = Apache Nutch | logo = | screenshot = NutchScreenshot.png | screenshot size = 250px | caption = Nutch Web Interface Search | collapsible = yes | developer = Apache Software Foundation | latest release version = 1.14 and 2.3.1 | latest release date = {{release date|2017|12|23}} | status = Active | programming language = Java | operating system = Cross-platform | genre = Web Crawler | license = Apache License 2.0 | website = {{url|//nutch.apache.org}} | Documentation = {{urll//https://wiki.apache.org/nutch/FrontPage#What_is_Apache_Nutch.3F}} }} Apache Nutch is a highly extensible and scalable open source web crawler software project. FeaturesNutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering. The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project. HistoryNutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella. In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop. In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.[1] In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl.[2] While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.{{Citation needed|date=October 2015}} Release history
AdvantagesNutch has the following advantages over a simple fetcher:[6]
ScalabilityIBM Research studied the performance[7] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.[8] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5. The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.[9] Related projects
Search engines built with Nutch
See also{{Portal|Free and open-source software}}
References1. ^Nutch News 2. ^1 {{Cite web|title = Common Crawl’s Move to Nutch – Common Crawl – Blog|url = http://blog.commoncrawl.org/2014/02/common-crawl-move-to-nutch/|website = blog.commoncrawl.org|accessdate = 2015-10-14}} 3. ^{{cite web |url=http://nutch.apache.org/#22-january-2015-nutch-23-release |title=Nutch 2.3 Release |publisher=The Apache Software Foundation |date=22 January 2015 |website=Apache Nutch News |access-date=18 January 2016}} 4. ^{{cite web |url=https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12327187 |title=Nutch 1.10 Release Notes |publisher=The Apache Software Foundation |date=6 May 2015 |website=ASF JIRA |access-date=18 January 2016}} 5. ^{{cite web |url=https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12329358 |title=Nutch 1.11 Release Notes |publisher=The Apache Software Foundation |date=7 December 2015 |website=ASF JIRA |access-date=18 January 2016}} 6. ^{{cite web| last = Siren| first = Sami| title = Using Nutch with Solr| url = https://lucidworks.com/blog/2009/03/09/nutch-solr/| date = 9 March 2009| website = Lucidworks.com| accessdate = 18 January 2016}} 7. ^Scalability of the Nutch search engine 8. ^Base Operating System Provisioning and Bringup for a Commercial Supercomputer {{webarchive |url=https://web.archive.org/web/20081203064621/http://weather.ou.edu/~apw/projects/cso/prov_paper.pdf |date=December 3, 2008 }} 9. ^The Sapphire Web Crawler - Crawl Statistics. Boston.lti.cs.cmu.edu (2008-10-01). Retrieved on 2013-07-21. 10. ^{{cite web|url=https://creativecommons.org/weblog/entry/4388|date=2004-09-03|title=Our Updated Search|publisher=Creative Commons}} 11. ^{{cite web|url=https://creativecommons.org/press-releases/entry/5064 |title=Creative Commons Unique Search Tool Now Integrated into Firefox 1.0 |date=2004-11-22 |publisher=Creative Commons |deadurl=yes |archiveurl=https://web.archive.org/web/20100107065707/http://creativecommons.org/press-releases/entry/5064 |archivedate=2010-01-07 }} 12. ^{{cite web|url=https://creativecommons.org/weblog/entry/6002|date=2006-08-02|title=New CC search UI|publisher=Creative Commons}} 13. ^Where can I get the source code for Wikia Search? 14. ^Update on Wikia – doing more of what’s working Bibliography{{Refbegin}}
| first1 = J | last1 = Shoberg | title = Building Search Applications with Lucene and Nutch | publisher = Apress | edition = 1st | page = 350 | date = October 26, 2006 | isbn = 978-1-59059-687-6 | url = http://www.apress.com/book/view/9781590596876 }}{{Refend}} External links
5 : Internet search engines|Free search engine software|Java (programming language) libraries|Cross-platform free software|Free web crawlers |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
随便看 |
|
开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。