Interested in Big-data Research but not sure where to get the data?
Many established organizations working in big-data have well-established facilities (such as large clusters of machines and enough number of individuals to maintain those machines) that allow them to do big-data research with ease, but how about a start-up or independent researcher? How can an individual get started with creating new Big-data ideas, especially when they are limited on machines? Many of our students who attend our big-data classes have this big question before them:
“is it possible to do big-data research with just my laptop? I do not have access to 100s of machines”.
The answer is: Yes. In a nutshell, Work or Research in big-data is not really just only about ‘large data sets’. Sure, you would need hundreds of machines if you want to store Tera-bytes of data and start analysing “all” of it. But, that’s only part of the story, and there is more that can be (and need to be) done even before one starts working on tera-bytes of data. In other words,
Not all big-data research requires really large volumes of data.
Wait a minute, what did we say? Volume?
That’s right. Volume: One of the famous Vs of big-data, the other Vs being: Velocity and Variety.
- Big-data, as many define, has Volume, Velocity and Variety attributes. Except for Volume, which requires massive infrastructures, research on Velocity and Variety problems need no more than tradition lab settings (at least during the research phase).
- Big-data in itself is useless, if not “applied” somewhere – which is the 4th V, people typically refer to as, Value add. These ‘application’ problems are, most of the times, just regular problems that just got bigger with ‘scaled up/out’ complexity. So, research on these problems, once again, need no more than traditional lab settings.
- When both of the above scenarios fail for your particular research case, the world anyway has “free for some time” small instances of AWS and Azure, that should satisfy any reasonably well defined big-data problem.
Velocity and Variety, as the names imply, refer to the ‘speed’ of the data arrival/processing and ‘nature/type’ of the data being processed, respectively.
For example, a sensor generating data from a fast moving vehicle (notifying the speed of the vehicle every second), or traffic arriving on Google search, are good examples for the Velocity of the data. The biggest challenge in such cases is ‘how to achieve low latency and high throughput’. Even a single slow component in the processing pipeline can bring down the whole system. This is where stream-processing can help.
Multiple-sensors generating different ‘types’ of data from the same fast moving vehicle (notifying the engine temperature, audio feed from tail-vibration, video feed from any attached camera on the front) is an example of Variety of data. The biggest challenge in this case is dealing with all the different types (audio, video, key-value pairs, time-series etc.) in a unified way. This is where schema-less databases and NOSQL techniques can help.
Loosely speaking, Social media data is a very good example of both Velocity and Variety, since the data arrives at a very high speed (thousands of tweets / sec) and with high variety (text, link, image, video so on). Processing social media data for deriving the Value-add, the 4th V, is the example of ‘scaled up/out’ complexity of the problem.
To understand this correctly, consider a simple tweet of text and doing NLP (text mining) over it. Its a pretty simple straightforward CPU-bound process. Now, instead of one tweet, imagine 1 million tweets to process. The fundamental problem (digesting the tweet with NLP) has not become any complex (in terms of algorithmic complexity) – rather it just got bigger. So, instead of using one computer, if you use 1 million computers in parallel, you should be able to solve it as ‘easily’ as the initial problem. (If you really want to understand it, contrast it with solving a 2-color problem vs 3-color problem in graph coloring – where just scaling out and spawning new computers will not make the problem any simpler or easy).
In other-words, these big-data Value-add problems have always been there for quite a number of decades (e.g. estimating the no. of people who are going to signup for your discount deal) – but their ‘scale’ got bigger in the recent years (that is, instead of dealing with 1000 customers, now they deal with 1 million customers) – which is perfectly solvable, just by adding more computers to work in parallel. (Which is not the case with certain fundamental algorithmic problems, where adding more computers, would not reduce the complexity of problem in any significant manner).
So, to do research on these kind of Value-add problems, one would not really need 100 computers, instead you can just start with 1 computer, solve a small ‘instance’ of the problem and put it in production on 1000 machines to run in parallel, and the chances are you are already solving the ‘bigger’ instance of the same problem. (Well, there are certain restrictions that make this invalid, but in most of the cases where big-data is being applied today in the industry, this stands valid).
Having said that, there is still a small problem remaining: where do you obtain the sample data even for these small instances of problems?
Well, that’s easy. If the problem is ‘obtaining data’, then there are different techniques available (ranging from traditional web-scraping to creating pseudo data-sets by one-self) that vary with domain.
Especially in ‘natural language processing (NLP)’ (and all its peers like sentiment analysis, trend analysis, intent analysis etc.) – social media data is very popular these days as the data-source. There are both commercial data aggregators such as ‘Radian 6’ (some of our projects use this), as well as free web-scraping tools such as Apache Nutch (basically a crawler, but can be customized to retrieve only portion of what you want).
And then, there are web-archiving facilities such as wayback machine: archive.org (with custom API) that provide access to roughly 80 Terabytes of web archived data. It has an impressive 2,273,840,159 number of unique URLs crawled across 29,032,069 hosts. Since it is accessible for free, you can’t get better deal than this for your big-data research on web-data.
How about data sets for other domains? Sure, here are few freely accessible data sets that can get you started with your big-data research. The list is by no means conclusive or exhaustive – its just a starting point for you to get going on your big-data work.
- KDNuggets has good source of links for different datasets: http://www.kdnuggets.com/datasets/index.html
- Google Public Data Directory: http://www.google.com/publicdata/directory
- Windows Azure Data Market: https://datamarket.azure.com/browse/data
- Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets.html
- Machine Learning public contributed data sets: http://mldata.org/repository/data/
- United Nations Data: http://data.un.org/
Finally, if you would like to get started with big-data learning or need help in formulating your big-data organization’s offerings, My-Classes would be happy to help you with online e-Classes on big data, predictive analytics and also providing access to world class consultants who have more than a decade experience in providing solutions around large scale distributed architectures, scalable designs, and big data anlaytics.
Also, some of our big-data and data-science practice exams may help you boost your career with certifications. Check’em out.