Information Retrieval Systems
INSYS 300
Lecture Notes by Dr. Xia Lin
College of Information Science and Technology
Drexel University

English Letter Usage Statistics

using as a sample, _A Tale of Two Cities_

by Charles Dickens (with Roman Numeral chapter numbers removed).

Total letter count = 586747

Letter use frequencies:

E: 72881 12.4%

T: 52397 8.9%

A: 47072 8.0%

O: 45116 7.6%

N: 41316 7.0%

I: 39710 6.7%

H: 38334 6.5%

S: 36770 6.2%

R: 35946 6.1%

D: 27487 4.6%

L: 21479 3.6%

U: 16218 2.7%

M: 14928 2.5%

W: 13835 2.3%

C: 13223 2.2%

F: 13152 2.2%

G: 12121 2.0%

Y: 11849 2.0%

P: 9452 1.6%

B: 8163 1.3%

V: 5044 0.8%

K: 4631 0.7%

-: 2327 0.3%

': 1168 0.1%

Q: 655 0.1%

X: 637 0.1%

J: 623 0.1%

Z: 213 0.0%
 
 
 
 
 
 
 
 
 
 
 
 

Doubled letter frequencies:

LL: 2979 20.6%

EE: 2146 14.8%

SS: 2128 14.7%

OO: 2064 14.3%

TT: 1169 8.1%

RR: 1068 7.4%

--: 701 4.8%

PP: 628 4.3%

FF: 430 2.9%

NN: 301 2.0%

CC: 243 1.6%

MM: 207 1.4%

DD: 201 1.3%

GG: 99 0.6%

BB: 41 0.2%

ZZ: 13 0.0%

AA: 2 0.0%

HH: 1 0.0%