Big data is one of the biggest hyped buzzwords of the last two years. With all hype around, it is very hard to find a good definition when it comes to a simple question about what “big data” means for every specific case in your industry and your applications. The following definition is how wikipedia describes big data:
Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
Web is an interesting place to dig for new sources of information. These days web is going much beyond just web pages and database driven websites. Web contains lots of structured information that can be used by businesses. Manufacturing companies are one of them. Information about products, customers, interests, priorities – this is a new goldmine era for web researchers.
I’ve been skimming the information from semanticweb.com website. The publication about Web Data common project caught my attention. Web data common is about structured data on the internet. Here is an interesting snippet about what it does:
More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using markup formats such as RDFa, Microdata and Microformats. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and also in the form of CSV-tables for common entity types (e.g. product, organization, location, …). In addition, we calculate and publish statistics about the deployment of the different formats as well as the vocabularies that are used together with each format.
Dig a bit inside to learn about statistics of structured data. You can see some information here – Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus. According to this statistic, product-related information is the most popular in the data corpus researched. Look on the following passage:
Products in RDFa. We identified three RDFa classes, og:”product”, dv:Product, and gr:Offering, that are used each on at least 500 different websites for describing products. og:”product” is the most popular class, being used by more than 19,000 websites.
In addition to that, product data was found in websites using microdata and microformats.
Reviewing all Microdata classes that are used in more than 100 different websites, we could identify four classes, schema:Product, schema:Offer, datavoc:Product, and datavoc:Offer, that are frequently used to describe products or offers. The following table shows the co-occurences of these classes with other product-related classes on the same website. For instance, 4,308 websites provide product data together with aggregate ratings for these products.In addition to the class co-occurrences, we analyzed which properties are frequently used to describe schema:Products. The table below shows that schema:Product/name, schema:Product/description, schema:Product/image, and schema:Product/offers are the most frequently used properties.
What is my conclusion? Manufacturing companies are looking how to improve the decision process related to products. The potential leverage can come from the analyses of web data about products and services. PLM vendors can think about non traditional approaches to get information about products and customers. Important. Just my thoughts…
Pingback: A dataset of resumes | DL-UAT()