Term Frequency-Inverse Document Frequency (TF-IDF)

Term Frequency-Inverse Document Frequency (TF-IDF) measures how important a word is to a document considering the document context (collection of documents also called corpus). Is a method to vectorize a non-structured text data into structured tabular data.

TF-IDF plays a important role in tasks such as text mining, information retrieval (RAG) and document classification.

Term Frequency (TF)

Measures the frequency of a term $T$ within a document $D$ .

$F re q T (D)$ : number of times that a term $T$ appears in a document $D$ .
$T o t a l T er m s (D)$ : number of total terms in a document $D$ .

TF (T, D) = \frac{F re q T ( D )}{T o t a l T er m s ( D )}

Inverse Document Frequency (IDF)

Measures the rarity of a term across the corpus $C$ . This part aims to penalize words that are common across all documents within collection.

$N$ : total number of documents in the collection $C$ .
$Doc F re q (T)$ : number of documents where $T$ is present.

I D F (T, C) = l n (\frac{N}{Doc F re q ( T )})

I D F (T, C) = l n (\frac{N}{Doc F re q ( T )}) + 1

Smooth IDF

The smooth IDF weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. This approach prevents zero divisions.

I D F (T, C) = l n (\frac{N + 1}{Doc F re q ( T ) + 1}) + 1

Combining TF-IDF

Multiplying TF and IDF gives the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.

TF - I D F (T, D, C) = TF (T, D) \times I D F (T, C)

🗂️ Knowledge Wiki

Explorer

Term Frequency-Inverse Document Frequency (TF-IDF)

Term Frequency (TF)

Inverse Document Frequency (IDF)

Smooth IDF

Combining TF-IDF

Table of Contents

Graph View

Backlinks