Package com.optimaize.langdetect.text
Class RemoveMinorityScriptsTextFilter
- java.lang.Object
-
- com.optimaize.langdetect.text.RemoveMinorityScriptsTextFilter
-
- All Implemented Interfaces:
TextFilter
public class RemoveMinorityScriptsTextFilter extends java.lang.Object implements TextFilter
Removes text written in scripts that are not the dominant script of the text. TODO this does not do special handling for Japanese (3 scripts) and Korean (2 scripts), they should be counted together and kept.
-
-
Field Summary
Fields Modifier and Type Field Description private double
threshold
-
Constructor Summary
Constructors Modifier Constructor Description private
RemoveMinorityScriptsTextFilter(double threshold)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private java.util.Map<java.lang.Character.UnicodeScript,java.lang.Long>
countByScript(java.lang.CharSequence text)
java.lang.String
filter(java.lang.CharSequence text)
private long
findMost(java.util.Map<java.lang.Character.UnicodeScript,java.lang.Long> counts)
static RemoveMinorityScriptsTextFilter
forThreshold(double threshold)
If a script has less than this fraction of content compared to the most used one, its text is removed.private void
increment(java.util.Map<java.lang.Character.UnicodeScript,java.lang.Long> counter, java.lang.Character.UnicodeScript unicodeScript)
private java.lang.String
remove(java.lang.CharSequence text, java.util.Set<java.lang.Character.UnicodeScript> toRemove)
-
-
-
Method Detail
-
forThreshold
public static RemoveMinorityScriptsTextFilter forThreshold(double threshold)
If a script has less than this fraction of content compared to the most used one, its text is removed. Example: Latin 10%, Cyrillic 80%, Common 10% (punctuation n'stuff). Now 10 is put in relation to 80.- Parameters:
threshold
- 0-1, suggested value is 0.3. If smaller then removed, equal remains.
-
filter
public java.lang.String filter(java.lang.CharSequence text)
- Specified by:
filter
in interfaceTextFilter
-
remove
private java.lang.String remove(java.lang.CharSequence text, java.util.Set<java.lang.Character.UnicodeScript> toRemove)
-
findMost
private long findMost(java.util.Map<java.lang.Character.UnicodeScript,java.lang.Long> counts)
-
countByScript
private java.util.Map<java.lang.Character.UnicodeScript,java.lang.Long> countByScript(java.lang.CharSequence text)
-
increment
private void increment(java.util.Map<java.lang.Character.UnicodeScript,java.lang.Long> counter, java.lang.Character.UnicodeScript unicodeScript)
-
-