请输入您要查询的百科知识:

 

词条 URL normalization
释义

  1. Normalization process

     Normalizations that preserve semantics  Normalizations that usually preserve semantics  Normalizations that change semantics 

  2. Normalization based on URL lists

  3. See also

  4. References

{{Distinguish|URL canonicalization}}

URL normalization is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized URL so it is possible to determine if two syntactically different URLs may be equivalent.

Search engines employ URL normalization in order to {{clarify span|assign importance to web pages|date=April 2014}} and to reduce indexing of duplicate pages. Web crawlers perform URL normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached.

Normalization process

There are several types of normalization that may be performed. Some of them are always semantics preserving and some may not be.

Normalizations that preserve semantics

The following normalizations are described in RFC 3986 [1] to result in equivalent URLs:

  • Converting the scheme and host to lower case. The scheme and host components of the URL are case-insensitive. Most normalizers will convert them to lowercase. Example:

HTTP://www.Example.com/http://www.example.com/

  • Capitalizing letters in escape sequences. All letters within a percent-encoding triplet (e.g., "%3A") are case-insensitive, and should be capitalized.{{fact|reason=Why should they be capitalized if the case doesn't matter?|date=December 2017}} Example:

http://www.example.com/a%c2%b1bhttp://www.example.com/a%C2%B1b

  • Decoding percent-encoded octets of unreserved characters. For consistency, percent-encoded octets in the ranges of ALPHA (%41%5A and %61%7A), DIGIT (%30%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.[2] Example:

http://www.example.com/%7Eusername/http://www.example.com/~username/

  • Removing the default port. The default port (port 80 for the “http” scheme) may be removed from (or added to) a URL. Example:

http://www.example.com:80/bar.htmlhttp://www.example.com/bar.html

Normalizations that usually preserve semantics

For http and https URLs, the following normalizations listed in RFC 3986 may result in equivalent URLs, but are not guaranteed to by the standards:

  • Adding trailing / {{clarify span|Directories|date=April 2014}} are indicated with a trailing slash and should be included in URLs. Example:

http://www.example.com/alicehttp://www.example.com/alice/

However, there is no way to know if a URL path component represents a directory or not. RFC 3986 notes that if the former URL redirects to the latter URL, then that is an indication that they are equivalent.

  • Removing dot-segments. The segments “..” and “.” can be removed from a URL according to the algorithm described in RFC 3986 (or a similar algorithm). Example:

http://www.example.com/../a/b/../c/./d.htmlhttp://www.example.com/a/c/d.html

However, if a removed ".." component, e.g. "b/..", is a symlink to a directory with a different parent, eliding "b/.." will result in a different path and URL.[3] In rare cases depending on the web server, this may even be true for the root directory (e.g. "//www.example.com/.." may not be equivalent to "//www.example.com/".

Normalizations that change semantics

Applying the following normalizations result in a semantically different URL although it may refer to the same resource:

  • Removing directory index. Default directory indexes are generally not needed in URLs. Examples:

http://www.example.com/default.asphttp://www.example.com/

http://www.example.com/a/index.htmlhttp://www.example.com/a/

  • Removing the fragment. The fragment component of a URL is never seen by the server and can sometimes be removed. Example:

http://www.example.com/bar.html#section1http://www.example.com/bar.html

However, AJAX applications frequently use the value in the fragment.

  • Replacing IP with domain name. Check if the IP address maps to a domain name. Example:

http://208.77.188.166/http://www.example.com/

The reverse replacement is rarely safe due to virtual web servers.

  • Limiting protocols. Limiting different application layer protocols. For example, the “https” scheme could be replaced with “http”. Example:

https://www.example.com/http://www.example.com/

  • Removing duplicate slashes Paths which include two adjacent slashes could be converted to one. Example:

http://www.example.com/foo//bar.htmlhttp://www.example.com/foo/bar.html

  • Removing or adding “www” as the first domain label. Some websites operate identically in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first, the latter being known as a naked domain. For example, http://example.com/ and http://www.example.com/ may access the same website. Many websites redirect the user from the www to the non-www address or vice versa. A normalizer may determine if one of these URLs redirects to the other and normalize all URLs appropriately. Example:

http://www.example.com/http://example.com/

  • Sorting the query parameters. Some web pages use more than one query parameter in the URL. A normalizer can sort the parameters into alphabetical order (with their values), and reassemble the URL. Example:

http://www.example.com/display?lang=en&article=fredhttp://www.example.com/display?article=fred&lang=en

However, the order of parameters in a URL may be significant (this is not defined by the standard) and a web server may allow the same variable to appear multiple times.[4]

  • Removing unused query variables. A page may only expect certain parameters to appear in the query; unused parameters can be removed. Example:

http://www.example.com/display?id=123&fakefoo=fakebarhttp://www.example.com/display?id=123

Note that a parameter without a value is not necessarily an unused parameter.

  • Removing default query parameters. A default value in the query string may render identically whether it is there or not. Example:

http://www.example.com/display?id=&sort=ascendinghttp://www.example.com/display

  • Removing the "?" when the query is empty. When the query is empty, there may be no need for the "?". Example:

http://www.example.com/display?http://www.example.com/display

Normalization based on URL lists

Some normalization rules may be developed for specific websites by examining URL lists obtained from previous crawls or web server logs. For example, if the URL

http://example.com/story?id=xyz

appears in a crawl log several times along with

http://example.com/story_xyz

we may assume that the two URLs are equivalent and can be normalized to one of the URL forms.

Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URLs with similar text) rules that can be applied to URL lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URLs in a URL list.

See also

  • Uniform Resource Locator
  • Fragment identifier
  • Web crawler

References

1. ^RFC 3986, Section 6: Normalization and Comparison
2. ^RFC 3986, Section 2.3.: Unreserved Characters
3. ^{{cite web|url=https://www.securecoding.cert.org/confluence/download/attachments/26017980/08+File+System+Vulnerabilities.pdf |title=Secure Coding in C and C++ |publisher=Securecoding.cert.org |accessdate=2013-08-24}}
4. ^{{cite web|url=http://benalman.com/news/2009/12/jquery-14-param-demystified/ |title=jQuery 1.4 $.param demystified |publisher=Ben Alman |date=2009-12-20 |accessdate=2013-08-24}}
  • RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
  • {{cite conference | author1 = Sang Ho Lee | author2 = Sung Jin Kim | author3 = Seok Hoo Hong | last-author-amp = yes | year = 2005 | title = On URL normalization | conference = Proceedings of the International Conference on Computational Science and its Applications (ICCSA 2005) | pages = 1076–1085 | url = http://dblab.ssu.ac.kr/publication/LeKi05a.pdf | deadurl = yes | archiveurl = https://web.archive.org/web/20060918115757/http://dblab.ssu.ac.kr/publication/LeKi05a.pdf | archivedate = 2006-09-18 | df = }}
  • {{cite conference |author1=Uri Schonfeld |author2=Ziv Bar-Yossef |author3=Idit Keidar |last-author-amp=yes | year = 2006 | title = Do not crawl in the dust: different URLs with similar text | conference = Proceedings of the 15th international conference on World Wide Web | pages = 1015–1016 | url = http://www2006.org/programme/item.php?id=p20}}
  • {{cite conference |author1=Uri Schonfeld |author2=Ziv Bar-Yossef |author3=Idit Keidar |last-author-amp=yes | year = 2007 | title = Do not crawl in the dust: different URLs with similar text | conference = Proceedings of the 16th international conference on World Wide Web | pages = 111–120 | url = http://www2007.org/paper194.php}}

2 : URL|Internet search algorithms

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/11/13 10:51:45