Class URL2

java.lang.Object
it.unimi.dsi.big.webgraph.webbase.URL2
All Implemented Interfaces:
Serializable, Comparable<URL2>

public final class URL2 extends Object implements Serializable, Comparable<URL2>
A reimplementation of URL better tailored to our needs. This class performs some normalization on URL names, etc. In particular, it strips references.
See Also:
  • Constructor Details

    • URL2

      public URL2(String spec)
      Creates a URL object from the String representation.

      This constructor is equivalent to a call to the two-argument constructor with a null first argument.

      Parameters:
      spec - the String to parse as a URL.
    • URL2

      public URL2(URL2 context, String spec)
      Creates a URL by parsing the given spec within a specified context. The new URL is created from the given context URL and the spec argument as described in RFC2396 "Uniform Resource Identifiers : Generic Syntax" :
                <scheme>://<authority><path>?<query>#<fragment>
       
      The reference is parsed into the scheme, authority, path, query and fragment parts. If the path component is empty and the scheme, authority, and query components are undefined, then the new URL is a reference to the current document. Otherwise the any fragment and query parts present in the spec are used in the new URL. If the scheme component is defined in the given spec and does not match the scheme of the context, then the new URL is created as an absolute URL based on the spec alone. Otherwise the scheme component is inherited from the context URL. If the authority component is present in the spec then the spec is treated as absolute and the spec authority and path will replace the context authority and path. If the authority component is absent in the spec then the authority of the new URL will be inherited from the context. If the spec's path component begins with a slash character "/" then the path is treated as absolute and the spec path replaces the context path. Otherwise the path is treated as a relative path and is appended to the context path. The path is canonicalized through the removal of directory changes made by occurences of ".." and ".". For a more detailed description of URL parsing, refer to RFC2396. NOTE: some sanitization is now performed on paths and queries. In particular, "//" sequences are collapsed in paths, and "/" is %-encoded in queries.
      Parameters:
      context - the context in which to parse the specification.
      spec - the String to parse as a URL.
  • Method Details

    • normalizeURLFragment

      public static String normalizeURLFragment(CharsetEncoder UTF8Encoder, String fragment) throws CharacterCodingException
      Normalizes a URL fragment.

      This method return the normalization of its argument. All character that are illegal are first UTF-8 encoded, and then represented with the %-notation.

      Parameters:
      UTF8Encoder - a (possibly cached) UTF-8 encoder.
      fragment - a URL fragment (possibly null).
      Returns:
      the normalized version.
      Throws:
      CharacterCodingException
    • normalizeURLFragment

      public static String normalizeURLFragment(String fragment) throws CharacterCodingException
      Normalizes a URL fragment.

      This method return the normalization of its argument. All character that are illegal are first UTF-8 encoded, and then represented with the %-notation.

      Parameters:
      fragment - a URL fragment (possibly null).
      Returns:
      the normalized version.
      Throws:
      CharacterCodingException
    • parseURL

      protected void parseURL(URL2 u, String spec, int start, int limit)
      Parses the string representation of a URL into a URL object.

      If there is any inherited context, then it has already been copied into the URL argument.

      The parseURL method of URLStreamHandler parses the string representation as if it were an http specification. Most URL protocol families have a similar parsing. A stream protocol handler for a protocol that has a different syntax must override this routine.

      Parameters:
      u - the URL to receive the result of parsing the spec.
      spec - the String representing the URL that must be parsed.
      start - the character index at which to begin parsing. This is just past the ':' (if there is one) that specifies the determination of the protocol name.
      limit - the character position to stop parsing at. This is the end of the string or the position of the "#" character, if present. All information after the sharp sign indicates an anchor.
    • set

      protected void set(String protocol, String host, int port, String authority, String userInfo, String path, String query, String ref)
      Sets the specified 8 fields of the URL. This is not a public method so that only URLStreamHandlers can modify URL fields. URLs are otherwise constant.
      Parameters:
      protocol - the name of the protocol to use
      host - the name of the host
      port - the port number on the host
      authority - the authority part for the url
      userInfo - the username and password
      path - the file on the host
      query - the query part of this URL
      ref - the internal reference in the URL
    • isValid

      public boolean isValid()
    • getQuery

      public String getQuery()
      Returns the query part of this URL.
      Returns:
      the query part of this URL.
    • getPath

      public String getPath()
      Returns the path part of this URL.
      Returns:
      the path part of this URL.
    • getUserInfo

      public String getUserInfo()
      Returns the userInfo part of this URL.
      Returns:
      the userInfo part of this URL.
    • getAuthority

      public String getAuthority()
      Returns the authority part of this URL.
      Returns:
      the authority part of this URL.
    • getPort

      public int getPort()
      Returns the port number of this URL. Returns -1 if the port is not set.
      Returns:
      the port number
    • getProtocol

      public String getProtocol()
      Returns the protocol name of this URL.
      Returns:
      the protocol of this URL.
    • getScheme

      public String getScheme()
      An alias for getProtocol().
      Returns:
      the protocol of this URL.
    • getHost

      public String getHost()
      Returns the host name of this URL, if applicable.
      Returns:
      the host name of this URL.
    • getFile

      public String getFile()
      Returns the file name of this URL.
      Returns:
      the file name of this URL.
    • getFileExtension

      public String getFileExtension()
      Returns the file name extension of this URL. In case of file name is index.html, html will be returned but if no valid extension is found null will be returned.
      Returns:
      the file name extension of this URL.
    • getRef

      public String getRef()
      Returns the anchor (also known as the "reference") of this URL.
      Returns:
      the anchor (also known as the "reference") of this URL.
    • getFragment

      public String getFragment()
      An alias for getRef().
      Returns:
      the anchor (also known as the "reference") of this URL.
    • equals

      public boolean equals(Object obj)
      Compares two URLs. The result is true if and only if the argument is not null and is a URL object that represents the same URL as this object. Two URL objects are equal if they have the same protocol and reference the same host, the same port number on the host, and the same file and anchor on the host.
      Overrides:
      equals in class Object
      Parameters:
      obj - the URL to compare against.
      Returns:
      true if the objects are the same; false otherwise.
    • compareTo

      public int compareTo(URL2 url)
      Specified by:
      compareTo in interface Comparable<URL2>
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object
    • hashCode64

      public long hashCode64()
    • toString

      public String toString()
      Constructs a string representation of this URL.
      Overrides:
      toString in class Object
      Returns:
      a string representation of this object.
    • getDomain

      public String getDomain()
      Extracts domain name for a given URL. Very useful to avoid correlated-links. This method works by considering the right-most, most-significant and non-common suffix of a given URL. Examples:

      http://www.ox.ac.uk/ returns: ox.ac.uk http://something.somethingelse.web.com/ returns: somethingelse.web.com http://www.microsoft.com/ returns: microsoft.com http://www.dsi.unimi.it/ returns: unimi.it

      Returns:
      a String indicating the domain name.