Class BuildHostMap


  • public class BuildHostMap
    extends java.lang.Object
    A class computing host-related data given a list of URLs (usually, the URLs of the nodes of a web graph). All processing is performed by the static utility method run(BufferedReader, PrintStream, DataOutputStream, DataOutputStream, boolean, ProgressLogger).

    Warning: this class provides a main method that saves the host list to standard output, but it does some logging, too, so be careful not to log to standard output.

    Author:
    Sebastiano Vigna
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static java.util.regex.Pattern DOTTED_ADDRESS  
    • Constructor Summary

      Constructors 
      Constructor Description
      BuildHostMap()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static void main​(java.lang.String[] arg)  
      static void run​(java.io.BufferedReader br, java.io.PrintStream hosts, java.io.DataOutputStream mapDos, java.io.DataOutputStream countDos, boolean topPrivateDomain, it.unimi.dsi.logging.ProgressLogger pl)
      This method reads URLs and writes hosts (or, possibly, top private domains), together with a map from URLs to hosts and a host count.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • DOTTED_ADDRESS

        public static final java.util.regex.Pattern DOTTED_ADDRESS
    • Constructor Detail

      • BuildHostMap

        public BuildHostMap()
    • Method Detail

      • run

        public static void run​(java.io.BufferedReader br,
                               java.io.PrintStream hosts,
                               java.io.DataOutputStream mapDos,
                               java.io.DataOutputStream countDos,
                               boolean topPrivateDomain,
                               it.unimi.dsi.logging.ProgressLogger pl)
                        throws java.io.IOException,
                               java.net.URISyntaxException
        This method reads URLs and writes hosts (or, possibly, top private domains), together with a map from URLs to hosts and a host count.

        Warning: presently, this method uses an Object2IntOpenHashMap to store the map from host names to host indices. Thus, it cannot handle more than ≈700 million hosts.

        Parameters:
        br - the buffered reader returning the list of URLs.
        hosts - the print stream where hosts will be printed.
        mapDos - the data output stream where the map from URLs to hosts will be written (one long per URL).
        countDos - the data output stream where the host counts will be written (one long per host).
        topPrivateDomain - if true, we use InternetDomainName.topPrivateDomain() to map to top private domains, rather than hosts.
        pl - a progress logger, or null.
        Throws:
        java.io.IOException
        java.net.URISyntaxException
      • main

        public static void main​(java.lang.String[] arg)
                         throws java.io.IOException,
                                com.martiansoftware.jsap.JSAPException,
                                java.net.URISyntaxException
        Throws:
        java.io.IOException
        com.martiansoftware.jsap.JSAPException
        java.net.URISyntaxException