Review of Angel
Review of Angel
The goal of the project is to develop an understanding of what is happening in a community. I am doing this by leveraging Twitter. Twitter is moving in the direction of geo-location . Things like this are emerging all the time - such as Almost.at today but this has to be open source and durable. The project can be broken up into a series of parts:
- Outreach including invitations, a website, blog posts such as this one, statements, source code on github and things of that nature.
- Basic infrastructure issues such as the Ruby on Rails foundation, the data-model, aggregation of content, clustering, full text search, parts of speech tagging, geolocation, semantic analysis.
- A simple user interface including the search display, a map, login, administration and the marking up of interesting messages, match-making.
- Rich visualization using tools such as processing, papervision3d and iphone native applications.
- Metrics to reflect on the projects impact; showing total happiness in a community and the like.
In my research so far I've found some interesting links related to many of these pieces. If you look at This Map of the Twitterverse of related applications or this mind map of twitter you will see that there are literally entire businesses around just solving one fragment of this problem.
One of the core goals is to aggregate from Twitter and Facebook and other sources on the web. Currently for Aggregation I am using John Nunemaker's Twitter Gem. Twitter has serious rate limits so it is only part of the answer. I've just started to try turn to other services such as Yahoo's YQL service and I've been considering also pinging Friend Feed although I think I'll just stick with YQL . I will also have to ask Twitter to whitelist me for more queries soon.
For text analysis I have a few different options. My goal is to tear apart sentences into subject/noun/verb and to pull out hashtags and to pull out URLS and to pull out geo-location information. Basically I am trying to extract as much meaning from the text fragments as possible as a prelude to further analysis. I even want to track if a message is a reply to some other message. For starters I put a traditional full text search engine directly into the application using the Write an Internet Search Engine in 200 lines of Ruby Code blog post. This is the classic high school implementation, stem your terms, score them by the inverse of their frequency and the like and it's a great job. I saw another similar effort entitled Eigenclass as well. Beyond this there are many more industrial solutions here such as Ferret and Sphinx and acts_as_solr and Sunspot . Note that articles on using Solr mostly completely suck - it is SPECTACULAR how bad the documentation is - it makes me want to write a service to rewrite the web. Here is one article on solr that does not totally suck". Also look here for something with more meat: QuarkRuby. For parts of speech tagging I've been looking at the Ruby Linguistics Framework . The shalmaneser engine seemed good as well. However the best one seems to be entagger . For brute force geo-location I am now using the Yahoo PlaceMaker engine, and as well I'm still using the MetaCarta Query Parser API .
Clustering is turning out to be the real hot-spot and I'm relying on it even for deciding what content to aggregate - especially because of the limits on polling Twitter. Andrew Turner pointed out Nexxus for examples of how this is done. As well on Wikipedia you can see a few good articles on clustering techniques such as this clustering overview . Of course my plan is to use Carrot2 since one of my collaborators at Meedan Dawid Weiss is a key contributer to this effort. Here is one of many links to people discussing this topic on the net . This stuff is all a cesspool of half completed works, works that are described but have no implementations, theory and the like. Some of the interesting work such as ProteoLens is not clearly documented down to the source code level so it cannot be candidates for use. As well many things have pretty pictures such as Tools to visualize your FaceBook network but are end user apps - not real tools.
How is all this stuff going to be used? I am not sure what will end up happening. As I have been moving through the work I have found that the way I am phrasing what I want to do keeps changing. But I do have a current plan. My current big picture strategy is this:
- A participant will arrive at the website and will see a search box and a map. They can 1) move the map around to set a focus 2) enter search terms 3) enter a list of participants into the search term box 4) enter an url into the search term box 5) enter hash tags into the search term box 6) enter a location into the search term box.
- A query to my site may result in queries to twitter. I don't have a lot of caching - I will try cache but I think that I can't anticipate the range of data that users are going to want and anyway the twitter firehose is too big to hold.
- A query will have a geography and "a social network filter". I will filter the query on a set of users so that I only accept results from that given social network. Ideally these are users that you have specified, or are users who are friends of yours, or failing that, are users that are known to be "good people" in that geography. I want to have all queries be based on a ground truth understanding of a neighborhood - so being able to specify users is key - and if you do not specify them then I will have to use a variety of other approaches to try get some initial users. For example if you are really interested in activity in Iran, then your query would probably be anchored on one or two users who are in Iran. I will spread the query out so that friends and friends of friends are included in the query results, but ultimately it is filtered by that anchor - it is not a free for all because there is too much spam. Here is one project that shows how to follow local twitter users . Tweepz also lets you find by geography. I will copy similar functionality to help seed my geolocal queries if required.
- A query can also have ordinary search terms and the like. I am not sure what to do if I filter by some users and get no results. I guess I will have to do a naked search query and allow any results through. Another option is to use users that I know are good users in the region if any.
- If you don't put in a search term, then what you are going to get back is a series of posts scored by importance in that network. I score by two factors. First is velocity - if there is a fast trending URL or if there is a fast trending hashtag then it will be scored up and more visible. Second is magic phrases. If I see phrases like "I need" or "I want" or "I have" then I score those up as well.
- As a user you have some interaction possibilities. If you click on a post then that post can become the anchor for further queries. If you drag a post into your "save me" bucket then the post is scored up and will show up in further queries. I think scores are subjective to you and your social network.
- Finally you will be able to connect two people together, so you can bookmark people and then connect them to other people that you find.
End goal applications of all this are to find what is going viral - for example see Visualizing retweets by Dan Zarrella and separating real issues in a neighborhood from spam. The participation will be circular - people can score up issues to make them more visible.
For rich visualization I am looking at either Processing or Papervision 3D . However I am still stuck on the earlier stages so I have to try close out there before focusing here.

Comments