One question to ask yourself is if all the data you want to track “belongs” to users who would authorize your application for access – or would the data you want to track belong to any possible user on Twitter, and that what data is being tracked depends on your users’ interests.
Site Streams beta is difficult to get into right now, as its constrained.
The public streaming API allows you track tweets “around” specific users that haven’t necessarily given your app access to work on their behalf. (This is the follow command). It also lets you track specific terms, which could be provided by your end users (this is called track).
The public streams though have a 1% cap on them such that if the total volume of tweets that you might receive in a given moment exceed more than 1% of the total possible volume of the firehose at that moment, you’ll only get the matching tweets up to that 1% cap – and you won’t necessarily know which tweets you missed during that period.
For the kind of volume you’re thinking of, you might find it wiser to go to a certified partner provider of streaming data like Topsy, Gnip, or Datasift.
See [node:14935] for some more discussion you may find useful.