Category Archives: Statistics & Probability

What is Big Data?

Big Data is a buzzword that few people can agree on. So if someone starts spewing on about Big Data, ask them “What is your definition of Big Data?”

Here’s mine.

From a technical standpoint, Big Data is an extension of data warehousing and business intelligence. Data warehousing has always used large data sets, and data mining has always been an advanced analytical method for business intelligence. These technologies have been around since the 1990’s and are well-developed and mature.

So, what is Big Data, truly? Early adopters like Yahoo were trying to apply data warehousing techniques to Social Media data sets, which turned out to be considerably larger than most previous data sets. They solved their storage issues by mimicing Google’s approach using large server farms to store the data. There is now a packaged approach for this called Hadoop. Hadoop allows you to manage the servers, while the data is processed with Google’s Map-Reduce technique.

Because this Social Media was difficult to store in a traditional relational database (RDBMS), early adopters turned to other alternatives. These databases are now typically called NoSQL databases, but this term is vague and useless. The types of databases are many and varied, so find out the name and type of your database. Always use this in a discussion to avoid confusion.

The type of analysis done on this data is typical of business intelligence: basic reporting, probability, statistics, and data mining. Although these techniques are not new, the labor force for them is scarce. Big Data projects often require analysts with advanced skill sets along with additional creative skills to work outside the box.

Who is using Big Data? Primarily the retail and telco sectors, with some new adopters in the financial and health sectors.

In summary, Big Data uses a tool like Hadoop to store and process data, uses a non-traditional database like MangoDB to give structure the data, and uses advanced analytical techniques like data mining to make sense of the data.

 

 

Hans Rosling on TED

Stats that reshape your worldview.

You’ve never seen data presented like this. With the drama and urgency of a sportscaster, statistics guru Hans Rosling debunks myths about the so-called “developing world.”

In Hans Rosling’s hands, data sings. Global trends in health and economics come to vivid life. And the big picture of global development—with some surprisingly good news—snaps into sharp focus. Full bio »

Example Bayesian Inference using R and OpenBUGS

  1. Open the R console
    Go to File/Change dir and set the working directory for R.
  2. Use your favorite editor to save the following code as schools.bug in your R working directory.
    model {
    for (j in 1:J){
    y[j] ~ dnorm (theta[j], tau.y[j])
    theta[j] ~ dnorm (mu.theta, tau.theta)
    tau.y[j] <- pow(sigma.y[j], -2)
    }
    mu.theta ~ dnorm (0.0, 1.0E-6)
    tau.theta <- pow(sigma.theta, -2)
    sigma.theta ~ dunif (0, 1000)
    }
  3. Save this data as schools.dat in your R working directory.
    school estimate sd
    A  28  15
    B   8  10
    C  -3  16
    D   7  11
    E  -1   9
    F   1  11
    G  18  10
    H  12  18
  4. Go back to the R console and type the following commands.
    schools <- read.table (“schools.dat”, header=TRUE)
    J <- nrow(schools)
    y <- schools$estimate
    sigma.y <- schools$sd
    data <- list (“J”, “y”, “sigma.y”)
    inits <- function(){list(theta=rnorm(J,0,100),mu.theta=rnorm(1,0,100),sigma.theta=runif(1,0,100))}
    parameters <- c(“theta”, “mu.theta”, “sigma.theta”)
    library (“BRugs”)

    schools.sim <- bugs (data, inits, parameters, “schools.bug”, n.chains=3, n.iter=1000, program=”openbugs”)

 

From an example at columbia.edu.

Bayesian Inference Using R

Here are the steps I took to create an environment for doing Bayesian Inference.

  1. Download and install R
    Find a mirror near you, download the executable, then run the install.
  2. Get familiar with the R environment
    Read the manual.
  3. Download and install OpenBUGS
    Go to OpenBUGS, pick an OS, download, and install.
  4. Go back to R and install the coda package (you will need this for BRugs)
    Open the R console and type install.packages(‘coda’).
    Choose a mirror, then finish install.
    Type library() at the console to verify coda is in the list of installed packages.
  5. Go back to R and install the BRugs package (you will need this to reference OpenBUGS from R)
    Open the R console and type install.packages(‘BRugs’).
    Type library(BRugs) at the console to open BRugs.