Computer Sciencemedium

Content-Based Recommender System Built Using a Vector Space Model

Question

A content-based recommender system is built using a vector space model. Given document vector A=(1,0,1)A = (1, 0, 1), compute the cosine similarity with other document vectors and rank the recommendations.

Computing cosine similarity in a vector space model for recommendations.

Similarity Formula

cos(θ)

cos(θ) = (A · B) / (||A|| · ||B||) — higher values = more similar

1Recall Cosine Similarity

Cosine similarity between vectors AA and BB is: cos(θ)=ABAB\cos(\theta) = \frac{A \cdot B}{||A|| \cdot ||B||} where ABA \cdot B is the dot product and A||A|| is the Euclidean norm. Values range from 1-1 (opposite) to 11 (identical direction).

Why Cosine Similarity?

In information retrieval, cosine similarity measures the angle between document vectors regardless of their magnitude. A long document and a short document with the same topic get a high score because only the **direction** matters, not the length.

2Compute the Norm of A

Given query vector A=(1,0,1)A = (1, 0, 1): A=12+02+12=21.414||A|| = \sqrt{1^2 + 0^2 + 1^2} = \sqrt{2} \approx 1.414

3Compute Dot Products

For each candidate document vector BiB_i, the dot product simplifies because the middle component of AA is zero: ABi=1b1+0b2+1b3=b1+b3A \cdot B_i = 1 \cdot b_1 + 0 \cdot b_2 + 1 \cdot b_3 = b_1 + b_3
DocumentVector BA·B||B||cos(θ)
Doc 1(1, 1, 0)1√20.500
Doc 2(1, 0, 1)2√21.000
Doc 3(0, 1, 1)1√20.500
Doc 4(1, 1, 1)2√30.816
Doc 5(0, 0, 1)110.707

4Rank by Similarity

Ranking documents by descending cosine similarity gives the recommendation order:
1st: Doc 2 (1,0,1)cos = 1.000
2nd: Doc 4 (1,1,1)cos = 0.816
3rd: Doc 5 (0,0,1)cos = 0.707
4th: Doc 1 (1,1,0)cos = 0.500
4th: Doc 3 (0,1,1)cos = 0.500
TOP RECOMMENDATIONDoc 2

5Key Concepts

TF-IDF Weighting

In practice, vector components use TF-IDF weights (Term Frequency × Inverse Document Frequency) rather than binary 0/1 values. This gives higher weight to terms that are distinctive to a document.

Sparsity Optimization

Real document vectors have thousands of dimensions (one per vocabulary term) but are extremely sparse. Inverted indices skip zero-valued dimensions, making cosine similarity fast even for large vocabularies.

Quiz

Test your understanding with these questions.

1
What is the Euclidean norm of vector A=(1,0,1)A = (1, 0, 1)?
2
Cosine similarity ranges between which values?