Class RemoveMinorityScriptsTextFilter

  • All Implemented Interfaces:
    TextFilter

    public class RemoveMinorityScriptsTextFilter
    extends java.lang.Object
    implements TextFilter
    Removes text written in scripts that are not the dominant script of the text. TODO this does not do special handling for Japanese (3 scripts) and Korean (2 scripts), they should be counted together and kept.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private double threshold  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private java.util.Map<java.lang.Character.UnicodeScript,​java.lang.Long> countByScript​(java.lang.CharSequence text)  
      java.lang.String filter​(java.lang.CharSequence text)  
      private long findMost​(java.util.Map<java.lang.Character.UnicodeScript,​java.lang.Long> counts)  
      static RemoveMinorityScriptsTextFilter forThreshold​(double threshold)
      If a script has less than this fraction of content compared to the most used one, its text is removed.
      private void increment​(java.util.Map<java.lang.Character.UnicodeScript,​java.lang.Long> counter, java.lang.Character.UnicodeScript unicodeScript)  
      private java.lang.String remove​(java.lang.CharSequence text, java.util.Set<java.lang.Character.UnicodeScript> toRemove)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • threshold

        private final double threshold
    • Constructor Detail

      • RemoveMinorityScriptsTextFilter

        private RemoveMinorityScriptsTextFilter​(double threshold)
    • Method Detail

      • forThreshold

        public static RemoveMinorityScriptsTextFilter forThreshold​(double threshold)
        If a script has less than this fraction of content compared to the most used one, its text is removed. Example: Latin 10%, Cyrillic 80%, Common 10% (punctuation n'stuff). Now 10 is put in relation to 80.
        Parameters:
        threshold - 0-1, suggested value is 0.3. If smaller then removed, equal remains.
      • filter

        public java.lang.String filter​(java.lang.CharSequence text)
        Specified by:
        filter in interface TextFilter
      • remove

        private java.lang.String remove​(java.lang.CharSequence text,
                                        java.util.Set<java.lang.Character.UnicodeScript> toRemove)
      • findMost

        private long findMost​(java.util.Map<java.lang.Character.UnicodeScript,​java.lang.Long> counts)
      • countByScript

        private java.util.Map<java.lang.Character.UnicodeScript,​java.lang.Long> countByScript​(java.lang.CharSequence text)
      • increment

        private void increment​(java.util.Map<java.lang.Character.UnicodeScript,​java.lang.Long> counter,
                               java.lang.Character.UnicodeScript unicodeScript)