256 ANDREWS UNIVERSITY SEMINARY STUDIES 59 (FALL 2021) The next step is to determine the dictionary for the corpus, which is the list of unique words across all the documents. For the simplified corpus above, we get as the dictionary: Sp = {elpnvn, Beg, xal, x0plog, TATNP, OU, XAPLS, XPLOTOS, ATO, EY, inools, wera, mas, dyanaw, adbapaia, év, 6, mvelual In table 2, we show, for each document, the tuples (w, f) where w indicates a word in the dictionary and findicates the frequency of that word in a particular document. Table 2. Frequency count for the tokens in the simplified corpus. 1 Corinthians 1:3 Ephesians 6:24 ~~ Philemon 6:24 (elpnvn, 1), (xapts, 1), (xUprog, 1), (Beds, 1), (év, 1), (xapts, 1), (xal, 2), (¢yw, 1), (xproTos, 1), (bprog, 1), (xbprog, 1), (noobs, 1), (matrp, 1), (etd, 1), (aU, 1), (xpi, 1), (még, 1), (nett, 1), (xproTés, 1), (6, 3), (6, 3), (émo, 1), (xproTos, 1), (mvedpa, 1) (av, 1), (Gyamaw, 1), (noob, 1), (Gpbapaia, 1), (gyw, 1) (noobs, 1) The TF-IDF model is so named because the calculation of the matrix element, which will be explained later, comprises two steps: (1) calculating the term frequency of a token, and (2) calculating the inverse document frequency of a token. The simplest choice for the term frequency is the raw count of a token per document." In this paper, the term frequency of a token (Wi) in document (d;) is defined as: ( Nw, d; tf (wi, d;) L(d) where: 1. N(w, d;) is the number of times that token 7 appears in document J. 2. L{d;} is the length of the document. The length is the number of tokens in the document and thus, in our case, simply the word count. The term frequency is both “token” and “document” dependent, and it is a local parameter. The inverse document frequency part of the matrix element aims to increase the contribution of words that are rare and decrease the contribution of common words. It is defined as: 7 L. H. Peter, “A Statistical Approach to Mechanized Encoding and Searching of Literary Information,” IBM Journal of Research and Development 1.4 (1975): 309-317.