Big Data Tutorial

Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

Why Is Big Data Important?

The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable

Cost Reductions
Time reductions
New product development and optimized offerings
Smart decision making.

When you combine big data with high-powered analytics, you can accomplish business-related tasks such as:

Determining root causes of failures, issues, and defects in near-real time.
Generating coupons at the point of sale based on the customer’s buying habits.
Recalculating entire risk portfolios in minutes.
Detecting fraudulent behavior before it affects your organization.

Types of Big Data

Big Data could be of three types:

Structured
Semi-Structured
Unstructured

Structured

The data that can be stored and processed in a fixed format is called as Structured Data. Data stored in a relational database management system (RDBMS) is one example of ‘structured’ data. It is easy to process structured data as it has a fixed schema. Structured Query Language (SQL) is often used to manage such kind of Data.

Semi-Structured

Semi-Structured Data is a type of data which does not have a formal structure of a data model, i.e. a table definition in a relational DBMS, but nevertheless, it has some organizational properties like tags and other markers to separate semantic elements that make it easier to analyze. XML files or JSON documents are examples of semi-structured data.

Unstructured

The data which have unknown form and cannot be stored in RDBMS and cannot be analyzed unless it is transformed into a structured format is called as unstructured data. Text Files and multimedia contents like images, audios, videos are examples of unstructured data. The unstructured data is growing quicker than others, experts say that 80 percent of the data in an organization is unstructured.

Big Data Driving Factors

The quantity of data on planet earth is growing exponentially for many reasons. Various sources and our day to day activities generate lots of data. With the intent of the web, the whole world has gone online, every single thing we do leaves a digital trace. With the smart objects going online, the data growth rate has increased rapidly. The major sources of Big Data are social media sites, sensor networks, digital images/videos, cell phones, purchase transaction records, weblogs, medical records, archives, military surveillance, eCommerce, complex scientific research and so on. All these information amounts to around some Quintillion bytes of data. By 2020, the data volumes will be around 40 Zettabytes which is equivalent to adding every single grain of sand on the planet multiplied by seventy-five.

Big Data Characteristics

The five characteristics that define Big Data are Volume, Velocity, Variety, Veracity, and Value.

VOLUME

Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace. The size of data generated by humans, machines and their interactions on social media itself is massive. Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be generated by 2020, which is an increase of 300 times from 2005.

VELOCITY

Velocity is defined as the pace at which different sources generate the data every day. This flow of data is massive and continuous. There are 1.03 billion Daily Active Users (Facebook DAU) on Mobile as of now, which is an increase of 22% year-over-year. This shows how fast the number of users is growing on social media and how fast the data is getting generated daily. If you are able to handle the velocity, you will be able to generate insights and take decisions based on real-time data.

VARIETY

As there are many sources which are contributing to Big Data, the type of data they are generating is different. It can be structured, semi-structured or unstructured. Hence, there is a variety of data which is getting generated every day. Earlier, we used to get the data from excel and databases, now the data are coming in the form of images, audios, videos, sensor data etc. as shown in below image. Hence, this variety of unstructured data creates problems in capturing, storage, mining and analyzing the data.

VERACITY

Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. In the image below, you can see that few values are missing from the table. Also, a few values are hard to accept, for example – 15000 minimum value in the 3rd row, it is not possible. This inconsistency and incompleteness is Veracity.

VALUE

After discussing Volume, Velocity, Variety, and Veracity, there is another V that should be taken into account when looking at Big Data i.e. Value. It is all well and good to have access to big data but unless we can turn it into value it is useless. By turning it into value I mean, Is it adding to the benefits of the organizations who are analyzing big data? Is the organization working on Big Data achieving high ROI (Return On Investment)? Unless it adds to their profits by working on Big Data, it is useless.

Justin Ong

+ posts

Justin studied at Brock University and Humber College. He leads the team at websiteTOON Digital. He has been a featured speaker with the Toronto Board of Trade, Kitchener/Waterloo Board of Trade, Trade Accelerator Program, World Trade Centre Institute, and other educational bodies. He loves to read and plays Starcraft on the side.