Četnost znaků v českém textu, využitelné zejména pro frekvenční analýzu textu.
Znak | Četnost (%) |
---|---|
a | 6.2193% |
á | 2.2355% |
b | 1.5582% |
c | 1.6067% |
č | 0.9490% |
d | 3.6019% |
ď | 0.0222% |
e | 7.6952% |
é | 1.3346% |
ě | 1.6453% |
f | 0.2732% |
g | 0.2729% |
h | 1.2712% |
ch | 1.1709% |
i | 4.3528% |
í | 3.2699% |
j | 2.1194% |
k | 3.7367% |
l | 3.8424% |
m | 3.2267% |
n | 6.5353% |
ň | 0.0814% |
o | 8.6664% |
ó | 0.0313% |
p | 3.4127% |
q | 0.0013% |
r | 3.6970% |
ř | 1.2166% |
s | 4.5160% |
š | 0.8052% |
t | 5.7268% |
ť | 0.0426% |
u | 3.1443% |
ú | 0.1031% |
ů | 0.6948% |
v | 4.6616% |
w | 0.0088% |
x | 0.0755% |
y | 1.9093% |
ý | 1.0721% |
z | 2.1987% |
ž | 0.9952% |
Četnost znaků v českém textu (%)
Bigramy
ST, PR, SK, CH, DN, TR
Trigramy
PRO, UNI, OST, STA, ANI, OVA, YCH, STI, PRI, PRE, OJE, REN, IST, STR, EHO, TER, RED, ICH
Kód
/** * Vypise na vystup cetnost jednotlivych znaku souboru (v procentech), * ignoruje znak noveho radku * @param source zdrojovy soubor * @param encoding kodovani souboru */ public static void count(File source, String encoding) throws UnsupportedEncodingException, IOException{ BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(source), encoding)); TreeMap<Character, Integer> occurences = new TreeMap<Character, Integer>(); String s = null; int counter = 0; while((s = reader.readLine())!= null){ for(int i = 0; i < s.length(); i++){ counter++; Character curr = (Character) s.charAt(i); if(occurences.get(curr) == null){ occurences.put(curr, new Integer(1)); } else { occurences.put(curr, occurences.get(curr).intValue() + 1); } } } for(Character ch : occurences.keySet()){ System.out.println(ch.toString() + ": " + (occurences.get(ch).intValue()/(double)counter * 100)); } }
Literatura
- KRÁLÍK, Jan. Czech Alphabet. The Czech Language [online]. 2001 [cit. 2012-09-18]. Dostupné z: http://www.czech-language.cz/alphabet/alph-prehled.html