Class BuildHostMap

java.lang.Object
it.unimi.dsi.big.webgraph.BuildHostMap

public class BuildHostMap extends Object
A class computing host-related data given a list of URLs (usually, the URLs of the nodes of a web graph). All processing is performed by the static utility method run(BufferedReader, PrintStream, DataOutputStream, DataOutputStream, boolean, ProgressLogger).

Warning: this class provides a main method that saves the host list to standard output, but it does some logging, too, so be careful not to log to standard output.

Author:
Sebastiano Vigna
  • Field Details

    • DOTTED_ADDRESS

      public static final Pattern DOTTED_ADDRESS
  • Constructor Details

    • BuildHostMap

      public BuildHostMap()
  • Method Details

    • run

      public static void run(BufferedReader br, PrintStream hosts, DataOutputStream mapDos, DataOutputStream countDos, boolean topPrivateDomain, it.unimi.dsi.logging.ProgressLogger pl) throws IOException, URISyntaxException
      This method reads URLs and writes hosts (or, possibly, top private domains), together with a map from URLs to hosts and a host count.

      Warning: presently, this method uses an Object2IntOpenHashMap to store the map from host names to host indices. Thus, it cannot handle more than ≈700 million hosts.

      Parameters:
      br - the buffered reader returning the list of URLs.
      hosts - the print stream where hosts will be printed.
      mapDos - the data output stream where the map from URLs to hosts will be written (one long per URL).
      countDos - the data output stream where the host counts will be written (one long per host).
      topPrivateDomain - if true, we use InternetDomainName.topPrivateDomain() to map to top private domains, rather than hosts.
      pl - a progress logger, or null.
      Throws:
      IOException
      URISyntaxException
    • main

      public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, URISyntaxException
      Throws:
      IOException
      com.martiansoftware.jsap.JSAPException
      URISyntaxException