What does TF-IDF do in Python? [With Examples]

It is commonly used to measure the relevance of a term within a document corpus.

How do you calculate TF-IDF in Python?

Here’s an example of how to calculate TF-IDF using Python’s scikit-learn library:

from sklearn.feature_extraction.text import TfidfVectorizer

# Example documents
documents = [
    "I love coding",
    "Coding is fun",
    "Coding is my passion",
    "I enjoy programming"
]

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get the feature names (terms)
feature_names = vectorizer.get_feature_names()

# Print the TF-IDF values for each term in each document
for doc_index, doc in enumerate(documents):
    print("Document:", doc)
    for term_index, term in enumerate(feature_names):
        tfidf_value = tfidf_matrix[doc_index, term_index]
        if tfidf_value > 0:
            print("   Term:", term, "   TF-IDF:", tfidf_value)
Code language: Python (python)

In this example, we have a list of documents represented by strings. The TfidfVectorizer class is used to convert these documents into a matrix of TF-IDF features. We fit and transform the documents using vectorizer.fit_transform(documents).

After that, we can access the TF-IDF matrix using tfidf_matrix. We can also get the feature names (terms) using vectorizer.get_feature_names(). Then, we iterate over each document and term to print the corresponding TF-IDF value.

The TF-IDF value quantifies the importance of a term within a document. Higher values indicate that a term is more relevant to the document.

What does Tfidfvectorizer mean in python?

The TfidfVectorizer class in Python, provided by the scikit-learn library (sklearn), enables the application of TF-IDF vectorization. It serves the purpose of converting a collection of text documents into a matrix consisting of TF-IDF features.

TF-IDF vectorization encompasses two essential steps: term frequency (TF) and inverse document frequency (IDF).

Term Frequency (TF) denotes the frequency of a term (word) within a document. It is computed by counting the occurrences of a term in a specific document.

Inverse Document Frequency (IDF) measures the importance of a term in the entire document corpus. It is calculated as the logarithmic inverse fraction of the number of documents containing the term.

The TfidfVectorizer class takes care of these calculations and performs the following tasks:

  • Tokenization: It breaks down the input text into individual words or terms.
  • Counting: It counts the frequency of each term in each document.
  • TF-IDF Calculation: It calculates the TF-IDF scores for each term in each document using the formula:TF-IDF = (term frequency in a document) * (inverse document frequency of the term)
  • Normalization: It normalizes the TF-IDF scores to have a unit norm, which can be useful for comparing documents.

The resulting output of the TfidfVectorizer is a matrix where each row represents a document, and each column represents a term (word).

The values in the matrix correspond to the TF-IDF scores of the terms in the documents.

By using the TfidfVectorizer, you can convert a collection of documents into a numerical representation that can be used as input for machine learning algorithms, such as clustering, classification, or information retrieval tasks.

Read More;

  • Dmytro Iliushko

    I am a middle python software engineer with a bachelor's degree in Software Engineering from Kharkiv National Aerospace University. My expertise lies in Python, Django, Flask, Docker, REST API, Odoo development, relational databases, and web development. I am passionate about creating efficient and scalable software solutions that drive innovation in the industry.

    View all posts

Leave a Comment