python - Scraped data into two MongoDB collections--now how do I compare the results? -
complete mongodb/database noob here tip appreciated. scraped data using scrapy straight locally-hosted mongodb server. compare "price" data 1 collection "price7" data in other collection. names field same across collections. best way of doing this?
sloppy screenshot of data here:
unfortunately can't compare directly between 2 collections in mongo without peppering in fancy javascript.
here's example of how accomplish that, https://stackoverflow.com/a/9240952/4760274
since you're using scrapy, , seemingly not comfortable crazy mongodb internals, easy enough whip python script evaluation
import pymongo conn = pymongo.connection('localhost', 27017) db = conn['databasename'] item in db.collection1.find(): _id = item['_id'] item2 = db.collection2.find({'_id':_id}) print "{}: {}, {}: {}, diff: {}, a>b?:{}".format( item['name'], item['price'], item1['name'], item1['price'], item['price'] - item1['price'], item['price'] > item1['price'])
finally, can modify scrapy modules insert both same collection, tweak field names identify distinct values different sources , allow mongo coalesce it, in single collection can simpler query compare prices
db.unified_collection.find({$where: "this.price1 > this.price2"})
(this doesn't allow difference in single query sql query could)
edit: port must int :)
update: it'll wise note comparison above^ assuming you're setting id , not using mongo's generated _id (which appears may using), randomly generated there's no relation between 2 identical entities. in order match them either approach mentioned above (script, or having separate crawlers use same data model), you'll need qualify uniqueness on in order sane comparison between 2 sources.
image of data, looks safest bet on "name" field, if there's slight amount of variance you're going insufficient results. whether iterating through 2 collections , comparing or coalescing you'll need rule clean , compare match (regex, soundex, other string manipulation tricks), if done in crawler/model side you'd need make unified collection unique on field, , hash of cleaned names make candidate value (as keep original values in tact).
another option sql, useful analytic tests doing, again face problem of how relate (better, how manipulate relate), , holdups of schema changes/migrations (and lack of ability store misc. data available).