![]() The resulting models may constitute an important asset for automatic classification of texts that can be applied, not only company descriptions, but to other texts, such as web pages, text blogs, news pages, etc. In order to address the multilabel problem, two classification strategies have been tested using different classification methods: a) we have created 46 binary models, one for each one of the categories, where the set of categories for a given description is achieved by combining the result of the 46 models b) we have created a single model that gives the most probable categories for a given description. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the proposed models predict the set of associated categories based solely on the company textual description. This paper compares different approaches for multilabel text classification, using recent information collected from Crunchbase. Crunchbase database contains up-to-date details about over 600000 companies, including a short description, a detailed description, number of employees, headquarters regions, contacts, market share, and the current areas of activity. Until 2015, TechCrunch was the owner of the Crunchbase data, but by that time Crunchbase decoupled itself from TechCrunch to focus on its own products. Founded in 2007 by Michael Arrington, originally, it was the data storage for its mother company TechCrunch. ).Ĭrunchbase is the largest companies’ database in the world, containing a large variety of up-to-date information about each company. Other well-known text classification tasks, nowadays receiving increasingly importance, include sentiment analysis and emotion detection, that consist of assign a positive/negative sentiment or an emotion to a text (e.g. E-mail spam detection is one of the most well-known applications of text classification, where the main goal consists of automatically assigning one of two possible labels (spam or ham) to each message. Text classification may be considered a relatively simple task, but it plays a fundamental role in a variety of systems that process textual data. This creates the need of processing all this data in order to be able to collect useful information from it. We live in a digital society where data grows day by day, most of it consisting of unstructured textual data. The resulting models may constitute an important asset for automatic classification of texts, not only consisting of company descriptions, but also other texts, such as web pages, text blogs, news pages, etc. In a second set of experiments, a multiclass problem that attempts to find the most probable category, we obtained about 67% accuracy using SVM and Fuzzy Fingerprints. Our findings reveal that the description text of each company contain features that allow to predict its area of activity, expressed by its corresponding categories, with about 70% precision, and 42% recall. This is a highly unbalanced dataset, where the frequency of each category ranges from 0.7% to 28%. ![]() A number of natural language processing strategies have been tested for feature extraction, including stemming, lemmatization, and part-of-speech tags. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the proposed models predict the categories based solely on the company textual description. This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |