“Unstructured data”的意思、由来-开放百科全书

Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form.^[1] This rule of thumb is not based on primary or any quantitative research, but nonetheless is accepted by some.^[2] Other sources have reported similar or higher percentages of unstructured data.^[3]^[4]^[5]

Background

The earliest research into business intelligence focused in on unstructured textual data, rather than numerical data.^[8] As early as 1958, computer science researchers like H.P. Luhn were particularly concerned with the extraction and classification of unstructured text.^[8] However, only since the turn of the century has the technology caught up with the research interest. In 2004, the SAS Institute developed the SAS Text Miner, which uses Singular Value Decomposition (SVD) to reduce a hyper-dimensional textual space into smaller dimensions for significantly more efficient machine-analysis.^[9] The mathematical and technological advances sparked by machine textual analysis prompted a number of businesses to research applications, leading to the development of fields like sentiment analysis, voice of the customer mining, and call center optimization.^[10] The emergence of Big Data in the late 2000s led to a heightened interest in the applications of unstructured data analytics in contemporary fields such as predictive analytics and root cause analysis.^[11]

Issues with terminology

Dealing with unstructured data

Techniques such as data mining, natural language processing (NLP), and text analytics provide different methods to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata or part-of-speech tagging for further text mining-based structuring. The Unstructured Information Management Architecture (UIMA) standard provided a common framework for processing this information to extract meaning and create structured data about the information.^[12]

Software that creates machine-processable structure can utilize the linguistic, auditory, and visual structure that exist in all forms of human communication.^[13] Algorithms can infer this inherent structure from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents, metadata, health records, audio, video, analog data, images, files, and unstructured text such as the body of an e-mail message, Web page, or word-processor document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, …) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data".^[14] For example, an HTML web page is tagged, but HTML mark-up typically serves solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. XHTML tagging does allow machine processing of elements, although it typically does not capture or convey the semantic meaning of tagged terms.

Since unstructured data commonly occurs in electronic documents, the use of a content or document management system which can categorize entire documents is often preferred over data transfer and manipulation from within the documents. Document management thus provides the means to convey structure onto document collections.

Search engines have become popular tools for indexing and searching through such data, especially text.

Approaches in natural language processing

Specific computational workflows have been developed to impose structure upon the unstructured data contained within text documents. These workflows are generally designed to handle sets of thousands or even millions of documents, or far more than manual approaches to annotation may permit. Several of these approaches are based upon the concept of online analytical processing, or OLAP, and may be supported by data models such as text cubes.^[15] Once document metadata is available through a data model, generating summaries of subsets of documents (i.e., cells within a text cube) may be performed with phrase-based approaches.^[16]

Approaches in medicine and biomedical research

Biomedical research generates one major source of unstructured data as researchers often publish their findings in scholarly journals. Though the language in these documents is challenging to derive structural elements from (e.g., due to the complicated technical vocabulary contained within and the domain knowledge required to fully contextualize observations), the results of these activities may yield links between technical and medical studies^[17] and clues regarding new disease therapies.^[18] Recent efforts to enforce structure upon biomedical documents include self-organizing map approaches for identifying topics among documents,^[19] general-purpose unsupervised algorithms,^[20] and an application of the CaseOLAP workflow^[16] to determine associations between protein names and cardiovascular disease topics in the literature.^[21] CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.^[21]

See also

Notes

References

1. ^{{cite web |last1=Shilakes |first1=Christopher C. |last2=Tylman |first2=Julie |title=Enterprise Information Portals |url=https://web.archive.org/web/20110724175845/http://ikt.hia.no/perep/eip_ind.pdf |website=Merrill Lynch |date=16 Nov 1998}}
2. ^{{cite web |last1=Grimes |first1=Seth |title=Unstructured Data and the 80 Percent Rule |url=http://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule |website=Breakthrough Analysis - Bridgepoints |publisher=Clarabridge |date=1 August 2008}}
3. ^{{Cite journal|last=Gandomi|first=Amir|last2=Haider|first2=Murtaza|date=April 2015|title=Beyond the hype: Big data concepts, methods, and analytics|journal=International Journal of Information Management|volume=35|issue=2|pages=137–144|doi=10.1016/j.ijinfomgt.2014.10.007|issn=0268-4012}}
4. ^{{Cite news|url=https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/|title=The biggest data challenges that you might not even know you have - Watson|date=2016-05-25|work=Watson|access-date=2018-10-02|language=en-US}}
5. ^{{Cite web|url=https://www.datamation.com/big-data/structured-vs-unstructured-data.html|title=Structured vs. Unstructured Data|website=www.datamation.com|language=en|access-date=2018-10-02}}
6. ^{{cite web |title=EMC News Press Release: New Digital Universe Study Reveals Big Data Gap: Less Than 1% of World's Data is Analyzed; Less Than 20% is Protected |url=http://www.emc.com/about/news/press/2012/20121211-01.htm |website=www.emc.com |publisher=EMC Corporation |date=December 2012}}
7. ^{{Cite news|url=https://www.seagate.com/our-story/data-age-2025/|title=Trends {{!}} Seagate US|work=Seagate.com|access-date=2018-10-01|language=en-US}}
8. ^¹{{cite web|last1=Grimes|first1=Seth|title=A Brief History of Text Analytics|url=http://www.b-eye-network.com/view/6311|website=B Eye Network|accessdate=June 24, 2016}}
9. ^{{cite web|last1=Albright|first1=Russ|title=Taming Text with the SVD|url=ftp://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf|website=SAS|accessdate=June 24, 2016}}
10. ^{{cite web|last1=Desai|first1=Manish|title=Applications of Text Analytics|url=http://mybusinessanalytics.blogspot.com/2009/08/applications-of-text-analytics.html|website=My Business Analytics @ Blogspot|accessdate=June 24, 2016|date=2009-08-09}}
11. ^{{cite web|last1=Chakraborty|first1=Goutam|title=Analysis of Unstructured Data: Applications of Text Analytics and Sentiment Mining|url=https://support.sas.com/resources/papers/proceedings14/1288-2014.pdf|website=SAS|accessdate=June 24, 2016}}
12. ^{{cite book |first1=Andreas |last1=Holzinger |first2=Christof |last2=Stocker |first3=Bernhard |last3=Ofner |first4=Gottfried |last4=Prohaska |first5=Alberto |last5=Brabenetz |first6=Rainer |last6=Hofmann-Wellenhof |year=2013 |chapter=Combining HCI, Natural Language Processing, and Knowledge Discovery – Potential of IBM Content Analytics as an Assistive Technology in the Biomedical Field |doi=10.1007/978-3-642-39146-0_2 |pages=13–24 |editor1-first=Andreas |editor1-last=Holzinger |editor2-first=Gabriella |editor2-last=Pasi |title=Human-Computer Interaction and Knowledge Discovery in Complex, Unstructured, Big Data |series=Lecture Notes in Computer Science |publisher=Springer |isbn=978-3-642-39146-0}}
13. ^{{cite web |title=Structure, Models and Meaning: Is "unstructured" data merely unmodeled? |url=http://www.intelligententerprise.com/showArticle.jhtml?articleID=59301538 |website=InformationWeek |language=en |date=March 1, 2005}}
14. ^{{cite web |last1=Malone |first1=Robert |title=Structuring Unstructured Data |url=https://www.forbes.com/2007/04/04/teradata-solution-software-biz-logistics-cx_rm_0405data.html |website=Forbes |language=en |date=April 5, 2007}}
15. ^{{Cite book|last=Lin|first=Cindy Xide|last2=Ding|first2=Bolin|last3=Han|first3=Jiawei|last4=Zhu|first4=Feida|last5=Zhao|first5=Bo|date=December 2008|title=Text Cube: Computing IR Measures for Multidimensional Text Database Analysis|url=https://ieeexplore.ieee.org/document/4781199/?reload=true|journal=2008 Eighth IEEE International Conference on Data Mining|language=en-US|publisher=IEEE|doi=10.1109/icdm.2008.135|isbn=9780769535029|citeseerx=10.1.1.215.3177}}
16. ^¹{{cite web |title=Multi-Dimensional, Phrase-Based Summarization in Text Cubes |url=http://sites.computer.org/debull/A16sept/p74.pdf |last=Tao|first=Fangbo | last2=Zhuang|first2=Honglei | last3=Yu|first3=Chi Wang| first4=Qi|last4=Wang | first5=Taylor|last5=Cassidy | first6=Lance|last6=Kaplan | first7=Clare|last7=Voss| last8=Han | first8=Jiawei | date=2016}}
17. ^{{Cite journal|last=Collier|first=Nigel|last2=Nazarenko|first2=Adeline|last3=Baud|first3=Robert|last4=Ruch|first4=Patrick|date=June 2006|title=Recent advances in natural language processing for biomedical applications|journal=International Journal of Medical Informatics|volume=75|issue=6|pages=413–417|doi=10.1016/j.ijmedinf.2005.06.008|issn=1386-5056|pmid=16139564}}
18. ^{{Cite journal|last=Gonzalez|first=Graciela H.|last2=Tahsin|first2=Tasnia|last3=Goodale|first3=Britton C.|last4=Greene|first4=Anna C.|last5=Greene|first5=Casey S.|date=January 2016|title=Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery|journal=Briefings in Bioinformatics|volume=17|issue=1|pages=33–42|doi=10.1093/bib/bbv087|issn=1477-4054|pmc=4719073|pmid=26420781}}
19. ^{{Cite journal|last=Skupin|first=André|last2=Biberstine|first2=Joseph R.|last3=Börner|first3=Katy|date=2013|title=Visualizing the topical structure of the medical sciences: a self-organizing map approach|journal=PLOS One|volume=8|issue=3|pages=e58779|doi=10.1371/journal.pone.0058779|issn=1932-6203|pmc=3595294|pmid=23554924}}
20. ^{{Cite journal|last=Kiela|first=Douwe|last2=Guo|first2=Yufan|last3=Stenius|first3=Ulla|last4=Korhonen|first4=Anna|date=2015-04-01|title=Unsupervised discovery of information structure in biomedical documents|journal=Bioinformatics|volume=31|issue=7|pages=1084–1092|doi=10.1093/bioinformatics/btu758|issn=1367-4811|pmid=25411329}}
21. ^¹{{Cite journal|last=Liem|first=David A.|last2=Murali|first2=Sanjana|last3=Sigdel|first3=Dibakar|last4=Shi|first4=Yu|last5=Wang|first5=Xuan|last6=Shen|first6=Jiaming|last7=Choi|first7=Howard|last8=Caufield|first8=John H.|last9=Wang|first9=Wei|last10=Ping|first10=Peipei|last11=Han|first11=Jiawei|date=Oct 1, 2018|title=Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease|journal=American Journal of Physiology. Heart and Circulatory Physiology|volume=315|issue=4|pages=H910–H924|doi=10.1152/ajpheart.00175.2018|issn=1522-1539|pmid=29775406|pmc=6230912}}