Thursday, December 11, 2014

Active English Vocabulary

How can Big Data can help us learn

http://www.paulhelmick.com/wp-content/uploads/2014/04/number-of-users-on-social-networks-infographic-clip.pngTwitter was the second biggest social network in the world in 2012. By the time many of similar networks came up and proved us unquestionable strength of them and their developers. As the time goes by, human kind produce much more data than ever before. A few years ago data scientists started to discuss a huge bang called "Big Data". That means sets of data bigger than casual databases based on relational algebra could handle very easily.





That was the space-time where awareness of non-relational dbs or non-sql dbs has been raised. Between 2002 and 2004 has shown up an idea of MapReduce and a next year was created Apache Hadoop, tool or framework for handling big sets of json data files framed by non exact schema design pattern.

About 2 years ago University of Washington made an online course, which is nowadays pretty popular, on Coursera, called Introduction to Data Science. I enrolled last summer and got a certificate from this class. In this class we went over fundamentals of Python, Relational Algebra, NoSQL, Hadoop and mapReduce, Pig, Hive, CouchDB and so on and so forth.

Here came up an idea of bringing something similar into a university I am currently attending. And so the code was born.

I started to work on my partial projects we've been coding during the class, basically written in Python. Many of them are intended to process the big data, downloaded as an online stream from Twitter. I wanted to create something useful and analytical using the big data and knowledge how to precess them I was obtained with by University of Washington.

__Fine, so.. How it actually works?__

To access the live stream, you will need to install the oauth2 library so you can properly authenticate.

  1. Create a twitter account if you do not already have one.
  2. Go to https://dev.twitter.com/apps and log in with your twitter credentials.
  3. Click "Create New App"
  4. Fill out the form and agree to the terms. Put in a dummy website if you don't have one you want to use.
  5. On the next page, click the "API Keys" tab along the top, then scroll all the way down until you see the section "Your Access Token"
  6. Click the button "Create My Access Token". You can Read more about Oauth authorization.
  7. You will now copy four values into twitterstream.py. These values are your "API Key", your "API secret", your "Access token" and your "Access token secret". All four should now be visible on the API Keys page. (You may see "API Key" referred to as "Consumer key" in some places in the code or on the web; they are synonyms.) Open twitterstream.py and set the variables corresponding to the api key, api secret, access token, and access secret. You will see code like the below:
    api_key = "<Enter api key>"
    api_secret = "<Enter api secret>"
    access_token_key = "<Enter your access token key here>"
    access_token_secret = "<Enter your access token secret here>"
    
    
  8. Run the following and make sure you see data flowing and that no errors occur.
    $ python twitterstream.py > output.txt
    This command pipes the output to a file. Stop the program with Ctrl-C, but wait at least 3 minutes for data to accumulate. Keep the file output.txt for the duration of the assignment; we will be reusing it in later problems. Don't use someone else's file; we will check for uniqueness in other parts of the assignment.
  9. If you wish, modify the file to use the twitter search API to search for specific terms. For example, to search for the term "microsoft", you can pass the following url to the twitterreq function:
    https://api.twitter.com/1.1/search/tweets.json?q=microsoft
    
     
You can easily create plots like this one and find out when are people more likely to share their chocolate with you:


I collected loads of data and decided to help people around me by teaching them intentionally mainly words used in active English vocabulary, which means about 2000 words [see reference].

I created an application AEV (active_english_vocabulary) that process as much actual data as you'd like (and your machine can handle) and creates a list of most commonly used words now.

This application has its code on my GitHub page and is free to fork. I used TKinter (see documentation) - GUI cross platform builder for Python. Also, I embedded html links into the result of my analysis pointed to English-Slovak dictionary with query for every single word.

Knowledge of processing big data means that we are able to come to concluions much more than people were before they knew these modern technologies. That is the reason why I decided to write my Bachelor thesis about Apache hadoop and encourage you to learn new things and not to be afraid of databases.

Twitter works as microblog for every user - it means users can write status messages visible for public or followers. Such message is called "tweet". (I know you all know it, it's just for my teachers.)

__How are Tweets represented as data?__

For more informations, see official documentation. However, basically it is a lot of lists and dictionaries similar to JSON files.

I'm adding links to online courses for further informations about the technologies I mentioned.
Python class
DataScience class

http://nealcaren.web.unc.edu/files/2012/04/pizza_json.png

Example of JSON document enclosed x)