sql - How to create similarity matrix in Google BigQuery like pdist in MATLAB? -
in matlab , python (scipy), there function (pdist) return pairwise distances between every row of given matrix.
so table in bigquery:
a = user1 | 0 0 | user2 | 0 3 | user3 | 4 0 |
should return
user1 user2 user3 dist = user1 | 0 3 4 | user2 | 3 0 5 | user3 | 4 5 0 |
or variant (perhaps without diagonal , upper or lower half of matrix since redundant.)
the pairs columns acceptable (the approach (my guess far) use self join, not sure how iterate on columns - for example have ~3000 columns). solution like:
dist = |user1 user2 3 | |user1 user3 4 | |user2 user3 5 |
also distance metric between users, don't wan't euclidean distance example here, general distance. 1 such distance
sum(min(user1_d, user2_d) / diff(user1_d - user2_d)) d dimensions between 2 users.
has found google bigquery solution this?
there 2 answers:
- you can wth cross join, , either build query programatically compute distance function, or use user-defined functions feature in closed alpha release.
please judicious using cross join large datasets. cross joins n^2 work, , can generate n^2 output. cross join of 2 smallish tables million rows each generates trillion rows of output. cross joining 2 billion row tables generates sestillion rows of output.
if do cross joins of large datasets, should try pre-aggregate or filter reduce number of rows need joined.