Author : MD TAREQ HASSAN
What is Big Data?
Big data refers to data that has following properties
- Volume: huge data
- Velocity: comes at high speed
- Variety: different types of data
- Veracity (accuracy): accurate data
In contrast, Big Data is a method to understand patterns and behaviors of people * like clicks on social media apps and corporate websites (amongst many others). Why would a business want to do that? So they can determine what customers want (based on their behavior) and provide them with a better “digital” experience so that they will buy more over time. As advertising moves to apps and the web, this capability becomes more and more important as they sell to you and me. By the way, professionals in IT consider this type of data to be “unstructured” which is another way of saying the data sits in loose files that have to be gathered, integrated, and analyzed. Think about taking hundreds of thousands of hand written notes and looking for information tends across them. Painful right?
Technologies and strategies that
- Gather large datasets
- Organize the data
- Process the data
- Gather insights from the data
Big Data Analytics
- Sales and Marketing
- Risk Management
- Product Design
- Supply Chain Management
- Planning and Safety
Big Data Processing
Flow: Ingest > Persist > Analyze > Visualize
Ingest
- ETL : Extract-transform-load
- Modifying
- Categorizing
- Filtering bad data
- Validating data
- Often stored as raw data
Persist
- Data warehouse
- Distributed file systems (Hadoop)
Analyze
- Batch processing (Splitting, mapping, reducing, assembling)
- Realtime processing
Visualize
- Querying and reporting
- Visuals, dashboards
What is data science
See: explain/data-science
Types of data
- unstructured data
- semistructured data
- structured data
Semi-structured data
It has structured, but that structure depends on the source. You work with semistructured data all the time. Your email is semistructured data. It has a pretty consistent structure. You always have a sender and a recipient, but the names and contents of your field might vary. Data science teams will typically work with more semistructured data than structured data. These are the volumes of email, weblogs, and social network sites which can be analyzed.