词条 | Data profiling |
释义 |
}} Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data.[1] The purpose of these statistics may be to:
IntroductionData profiling refers to the analysis of information for use in a data warehouse in order to clarify the structure, content, relationships, and derivation rules of the data.[3] Profiling helps to not only understand anomalies and assess data quality, but also to discover, register, and assess enterprise metadata.[4][5] The result of the analysis is used to determine the suitability of the candidate source systems, usually giving the basis for an early go/no-go decision, and also to identify problems for later solution design.[3] Data profiling takes place in various procedures. Majority of the time civilization is unaware of the collection of their data. Connecting to wifi in locations, entering your email in a contest, downloading any type of app, and writing surveys. Data profiling is a key attribution to the ever-developing societal civilization, it allows researchers to gain access to a comprehensive collection of data. How data profiling is conductedData profiling utilizes methods of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, variation, aggregates such as count and sum, and additional metadata information obtained during data profiling such as data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition.[4][6][7] The metadata can then be used to discover problems such as illegal values, misspellings, missing values, varying value representation, and duplicates. Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in a cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis.[4] Normally, purpose-built tools are used for data profiling to ease the process.[3][4][6][7][8][9] The computation complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools.[5] When data profiling is conductedAccording to Kimball,[3] data profiling is performed several times and with varying intensity throughout the data warehouse developing process. A light profiling assessment should be undertaken immediately after candidate source systems have been identified and DW/BI business requirements have been satisfied. The purpose of this initial analysis is to clarify at an early stage if the correct data is available at the appropriate detail level and that anomalies can be handled subsequently. If this is not the case the project may be terminated.[3] Addition, more in-depth profiling is done prior to the dimensional modeling process in order assess what is required to convert data into a dimensional model. Detailed profiling extends into the ETL system design process in order to determine the appropriate data to extract and which filters to apply to the data set.[3] Additionally, data profiling may be conducted in the data warehouse development process after data has been loaded into staging, the data marts, etc. Conducting data at these stages helps ensure that data cleaning and transformations have been done correctly and in compliance of requirements. Benefits and examplesThe benefits of data profiling are to improve data quality, shorten the implementation cycle of major projects, and improve users' understanding of data.[9] Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling.[5] Data profiling is one of the most effective technologies for improving data accuracy in corporate databases.[9] An example of data profiling is its relationship with Health Tracking. Data is collected from apps, and other media outlets to collect a general understanding of the health and well-being of civilization. Data is collected from apps upon various concepts, such as fitness, menstruation cycles, mental health, and health conditions such as diabetes, cardiovascular failure, and obesity. The statistics gained from these platforms are then utilized to gain extensive multiple perspectives and experiences from users. This information can be used in attribution to health care professionals to determine the most common ground on which users stand within their health. It can also give a glimpse into whether utilizing the app is improving the health of patients, and what can be done in extent to assist. It allows those in health care to tailor the app to the needs of patients, and also see if the app performs truly helps the patient. Although a concern that runs within this is the tampering of information. However, assuming the majority of users input correct information, the outcome will most typically balance out. Data profiling toolsSome tools are free software and open source; however, many, but not all free data profiling tools are open source projects. In general, their functionality is more limited than that of commercial products, and they may not offer free telephone or online support. Furthermore, their documentation is not always thorough. However, some small companies still use these free tools instead of expensive commercial software, considering the benefits that free tools provide.[10] See also
References1. ^{{cite article |first=Theodore |last=Johnson |date=2009 |title=Data Profiling |encyclopedia=Encyclopedia of Database Systems |editor-last=Springer |editor-first=Heidelberg}} {{DEFAULTSORT:Data Profiling}}2. ^{{Cite journal|last=Woodall|first=Philip|last2=Oberhofer|first2=Martin|last3=Borek|first3=Alexander|year=2014|title=A classification of data quality assessment and improvement methods|url=http://www.inderscience.com/link.php?id=68656|journal=International Journal of Information Quality|language=en|volume=3|issue=4|pages=|doi=10.1504/ijiq.2014.068656|via=}} 3. ^1 2 3 4 5 {{cite book |first=Ralph |last=Kimball |display-authors=etal |date=2008 |title=The Data Warehouse Lifecycle Toolkit |edition=Second |publisher=Wiley |isbn=9780470149775 |pages=376}} 4. ^1 2 3 {{cite book |first=David |last=Loshin |date=2009 |title=Master Data Management |publisher=Morgan Kaufmann |isbn=9780123742254 |pages=94-96}} 5. ^1 2 {{cite book |first=David |last=Loshin |date=2003 |title=Business Intelligence: The Savvy Manager’s Guide, Getting Onboard with Emerging IT |publisher=Morgan Kaufmann |isbn=9781558609167 |pages=110–111}} 6. ^1 {{cite article |first1=Erhard |last1=Rahm |first2=Hong |last2=Hai Do |title=Data Cleaning: Problems and Current Approaches |journal=Bulletin of the Technical Committee on Data Engineering |publisher=IEEE Computer Society |volume=23 |number=4 |date=December 2000}} 7. ^1 {{cite journal |first1=Ranjit |last1=Singh |first2=Kawaljeet |last2=Singh |displayauthors=etal |title=A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing |journal=IJCSI International Journal of Computer Science Issue |volume=7 |issue=3 |series=2 |date=May 2010}} 8. ^{{cite web |first=Ralph |last=Kimball |date=2004 |title=Kimball Design Tip #59: Surprising Value of Data Profiling |publisher=Kimball Group |url=http://www.kimballgroup.com/wp-content/uploads/2012/05/DT59SurprisingValue.pdf}} 9. ^1 2 {{cite book |title=Data Quality: The Accuracy Dimension |first=Jack E. |last=Olson |date=2003 |publisher=Morgan Kaufmann |pages=140-142}} 10. ^{{cite book |chapter=Data Profiling Technology of Data Governance Regarding Big Data: Review and Rethinking |title=Information Technology, New Generations |pages=439-450 |first1=Wei |last1=Dai |first2=Isaac |last2=Wardlaw |url=https://www.researchgate.net/publication/301632599_Data_Profiling_Technology_of_Data_Governance_Regarding_Big_Data_Review_and_Rethinking}} 3 : Data analysis|Data management|Data quality |
随便看 |
|
开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。