python - Scraped data into two MongoDB collections--now how do I compare the results? -


complete mongodb/database noob here tip appreciated. scraped data using scrapy straight locally-hosted mongodb server. compare "price" data 1 collection "price7" data in other collection. names field same across collections. best way of doing this?

sloppy screenshot of data here: enter image description here

unfortunately can't compare directly between 2 collections in mongo without peppering in fancy javascript.

here's example of how accomplish that, https://stackoverflow.com/a/9240952/4760274

since you're using scrapy, , seemingly not comfortable crazy mongodb internals, easy enough whip python script evaluation

import pymongo conn = pymongo.connection('localhost', 27017) db = conn['databasename']  item in db.collection1.find():     _id = item['_id']     item2 = db.collection2.find({'_id':_id})     print "{}: {}, {}: {}, diff: {}, a>b?:{}".format(         item['name'], item['price'], item1['name'],          item1['price'], item['price'] - item1['price'],         item['price'] > item1['price']) 

finally, can modify scrapy modules insert both same collection, tweak field names identify distinct values different sources , allow mongo coalesce it, in single collection can simpler query compare prices

db.unified_collection.find({$where: "this.price1 > this.price2"})   

(this doesn't allow difference in single query sql query could)

edit: port must int :)

update: it'll wise note comparison above^ assuming you're setting id , not using mongo's generated _id (which appears may using), randomly generated there's no relation between 2 identical entities. in order match them either approach mentioned above (script, or having separate crawlers use same data model), you'll need qualify uniqueness on in order sane comparison between 2 sources.
image of data, looks safest bet on "name" field, if there's slight amount of variance you're going insufficient results. whether iterating through 2 collections , comparing or coalescing you'll need rule clean , compare match (regex, soundex, other string manipulation tricks), if done in crawler/model side you'd need make unified collection unique on field, , hash of cleaned names make candidate value (as keep original values in tact).

another option sql, useful analytic tests doing, again face problem of how relate (better, how manipulate relate), , holdups of schema changes/migrations (and lack of ability store misc. data available).


Popular posts from this blog