Term Frequency-Inverse Document Frequency (TF-IDF) measures how important a word is to a document considering the document context (collection of documents also called corpus). Is a method to vectorize a non-structured text data into structured tabular data.

TF-IDF plays a important role in tasks such as text mining, information retrieval (RAG) and document classification.

Term Frequency (TF)

Measures the frequency of a term within a document .

  • : number of times that a term appears in a document .
  • : number of total terms in a document .

Inverse Document Frequency (IDF)

Measures the rarity of a term across the corpus . This part aims to penalize words that are common across all documents within collection.

  • : total number of documents in the collection .
  • : number of documents where is present.

Smooth IDF

The smooth IDF weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. This approach prevents zero divisions.

Combining TF-IDF

Multiplying TF and IDF gives the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.