What is Big Data?

Big Data is a buzzword that few people can agree on. So if someone starts spewing on about Big Data, ask them “What is your definition of Big Data?”

Here’s mine.

From a technical standpoint, Big Data is an extension of data warehousing and business intelligence. Data warehousing has always used large data sets, and data mining has always been an advanced analytical method for business intelligence. These technologies have been around since the 1990’s and are well-developed and mature.

So, what is Big Data, truly? Early adopters like Yahoo were trying to apply data warehousing techniques to Social Media data sets, which turned out to be considerably larger than most previous data sets. They solved their storage issues by mimicing Google’s approach using large server farms to store the data. There is now a packaged approach for this called Hadoop. Hadoop allows you to manage the servers, while the data is processed with Google’s Map-Reduce technique.

Because this Social Media was difficult to store in a traditional relational database (RDBMS), early adopters turned to other alternatives. These databases are now typically called NoSQL databases, but this term is vague and useless. The types of databases are many and varied, so find out the name and type of your database. Always use this in a discussion to avoid confusion.

The type of analysis done on this data is typical of business intelligence: basic reporting, probability, statistics, and data mining. Although these techniques are not new, the labor force for them is scarce. Big Data projects often require analysts with advanced skill sets along with additional creative skills to work outside the box.

Who is using Big Data? Primarily the retail and telco sectors, with some new adopters in the financial and health sectors.

In summary, Big Data uses a tool like Hadoop to store and process data, uses a non-traditional database like MangoDB to give structure the data, and uses advanced analytical techniques like data mining to make sense of the data.