sql - How to create similarity matrix in Google BigQuery like pdist in MATLAB? -


in matlab , python (scipy), there function (pdist) return pairwise distances between every row of given matrix.

so table in bigquery:

a = user1 | 0 0 |     user2 | 0 3 |     user3 | 4 0 | 

should return

              user1 user2 user3 dist = user1 |  0      3     4   |         user2 |  3      0     5   |         user3 |  4      5     0   |  

or variant (perhaps without diagonal , upper or lower half of matrix since redundant.)

the pairs columns acceptable (the approach (my guess far) use self join, not sure how iterate on columns - for example have ~3000 columns). solution like:

dist = |user1  user2  3 |         |user1  user3  4 |         |user2  user3  5 | 

also distance metric between users, don't wan't euclidean distance example here, general distance. 1 such distance

sum(min(user1_d, user2_d) / diff(user1_d - user2_d)) d dimensions between 2 users. 

has found google bigquery solution this?

there 2 answers:

  1. you can wth cross join, , either build query programatically compute distance function, or use user-defined functions feature in closed alpha release.
  2. please judicious using cross join large datasets. cross joins n^2 work, , can generate n^2 output. cross join of 2 smallish tables million rows each generates trillion rows of output. cross joining 2 billion row tables generates sestillion rows of output.

    if do cross joins of large datasets, should try pre-aggregate or filter reduce number of rows need joined.


Popular posts from this blog