Tullio Facchinetti
Tag cloud
Picture wall
Daily
RSS Feed
  • RSS Feed
  • ATOM Feed
  • Daily Feed
Filters

Links per page

  • 20 links
  • 50 links
  • 100 links

Display

Filter untagged links
page 1 / 2
27 results tagged text_mining  ✕
Headliner — Easy training and deployment of seq2seq models https://medium.com/axel-springer-tech/headliner-easy-training-and-deployment-of-seq2seq-models-2a26508b4dae
Thu 06 Feb 2020 04:51:19 PM CET
QRCode

At Axel Springer, Europe’s largest digital publishing house, we own a lot of news articles from various media outlets such as Welt, Bild, Business Insider and many more. Arguably, the most important part of a news article is its title, and it is not surprising that journalists tend to spend a fair amount of their time to come up with a good one. For this reason, it was an interesting research question for us at Axel Springer AI whether we could create an NLP model that generates quality headlines from Welt news articles (see Figure 1). This could, for example, serve our journalists as inspiration for creating SEO titles, which our journalists often don’t have time for (in fact we’re working together with our colleagues from SPRING and AWS on creating a SEO title generator).

article data_mining text-processing text_generation text_mining
GATE - General Architecture for Text Engineering https://gate.ac.uk/
Sun 02 Feb 2020 02:02:45 PM CET
QRCode
  • GATE is an open source software toolkit capable of solving almost any text processing problem
  • It has a mature and extensive community of developers, users, educators, students and scientists
  • It is used by corporations, SMEs, research labs and Universities worldwide
  • It has a world-class team of language processing developers
homepage software text_mining
tokenizers - Fast State-of-the-Art Tokenizers https://github.com/huggingface/tokenizers
Tue 14 Jan 2020 06:48:48 AM CET
QRCode

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

  • Train new vocabularies and tokenize, using today's most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
coding_lang:rust library opensource software text_manipulation text_mining
magnitude - A fast, efficient universal vector embedding utility package https://github.com/plasticityai/magnitude
Mon 06 Jan 2020 01:53:35 PM CET
QRCode

A fast, efficient universal vector embedding utility package.

coding_lang:python library NLP opensource software text_mining
Language-games - Dead simple games made with word vectors. https://github.com/Hellisotherpeople/Language-games
Mon 06 Jan 2020 11:13:17 AM CET
QRCode

Dead simple games made with word vectors.

#cli-app coding_lang:python games opensource software text_mining
Document Clustering with Python http://brandonrose.org/clustering_mobile
Sun 15 Dec 2019 04:27:24 PM CET
QRCode

A guide to document clustering with Python

jupyter machine_learning python research text_mining
stopwords detector https://github.com/amarallab/stopwords
Mon 02 Dec 2019 09:28:27 PM CET
QRCode
algorithm coding_lang:python source_code text_mining
Tutorial: Extracting Keywords with TF-IDF and Python's Scikit-Learn https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/
Sat 02 Nov 2019 02:00:22 AM CET
QRCode

Learn how to use TF-IDF and scikit-learn to extract important keywords from documents. This is a full working example using the Stack Overflow dataset.

article python text_mining tutorial
Topical Keyphrase Extraction with Hierarchical Semantic Networks https://arxiv.org/abs/1910.07848
Sun 20 Oct 2019 07:24:43 PM CEST
QRCode

Topical keyphrase extraction is used to summarize large collections of text
documents. However, traditional methods cannot properly reflect the intrinsic
semantics and relationships of keyphrases because they rely on a simple
term-frequency-based process. Consequently, these methods are not effective in
obtaining significant contextual knowledge. To resolve this, we propose a
topical keyphrase extraction method based on a hierarchical semantic network
and multiple centrality network measures that together reflect the hierarchical
semantics of keyphrases. We conduct experiments on real data to examine the
practicality of the proposed method and to compare its performance with that of
existing topical keyphrase extraction methods. The results confirm that the
proposed method outperforms state-of-the-art topical keyphrase extraction
methods in terms of the representativeness of the selected keyphrases for each topic. The proposed method can effectively reflect intrinsic keyphrase semantics and interrelationships.

paper research text_mining
A comprehensive guide to extracting keywords from text https://monkeylearn.com/keyword-extraction/
Wed 09 Oct 2019 03:47:24 PM CEST
QRCode

Keyword extraction (also known as keyword detection or keyword analysis) is a text analysis technique that consists of automatically extracting the most important words and expressions in a text.

It helps summarize the content of a text and recognize the main topics which are being discussed.

article research text_mining
TOM: A library for topic modeling and browsing https://github.com/AdrienGuille/TOM
Sun 04 Aug 2019 09:41:17 AM CEST
QRCode

TOM (TOpic Modeling) is a Python 3 library for topic modeling and browsing, licensed under the MIT license.

Its objective is to allow for an efficient analysis of a text corpus from start to finish, via the discovery of latent topics. To this end, TOM features functions for preparing and vectorizing a text corpus. It also offers a common interface for two topic models (namely LDA using either variational inference or Gibbs sampling, and NMF using alternating least-square with a projected gradient method), and implements three state-of-the-art methods for estimating the optimal number of topics to model a corpus. What is more, TOM constructs an interactive Web-based browser that makes it easy to explore a topic model and the related corpus.

coding_lang:python library machine_learning opensource software source_code text_mining
Automation in Systematic, Scoping and Rapid Reviews by an NLP Toolkit: A Case Study in Enhanced Living Environments https://link.springer.com/chapter/10.1007%2F978-3-030-10752-9_1
Fri 02 Aug 2019 06:34:56 PM CEST
QRCode

With the increasing number of scientific publications, the analysis of the trends and the state-of-the-art in a certain scientific field is becoming very time-consuming and tedious task. In response to urgent needs of information, for which the existing systematic review model does not well, several other review types have emerged, namely the rapid review and scoping reviews.

The paper proposes an NLP powered tool that automates most of the review process by automatic analysis of articles indexed in the IEEE Xplore, PubMed, and Springer digital libraries. We demonstrate the applicability of the toolkit by analyzing articles related to Enhanced Living Environments and Ambient Assisted Living, in accordance with the PRISMA surveying methodology. The relevant articles were processed by the NLP toolkit to identify articles that contain up to 20 properties clustered into 4 logical groups.

The analysis showed increasing attention from the scientific communities towards Enhanced and Assisted living environments over the last 10 years and showed several trends in the specific research topics that fall into this scope. The case study demonstrates that the NLP toolkit can ease and speed up the review process and show valuable insights from the surveyed articles even without manually reading of most of the articles. Moreover, it pinpoints the most relevant articles which contain more properties and therefore, significantly reduces the manual work, while also generating informative tables, charts and graphs.

machine_learning NLP paper research systematic_literature_review text_mining
mat2vec: Supplementary Materials for "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019). https://github.com/materialsintelligence/mat2vec
Fri 02 Aug 2019 06:33:34 PM CEST
QRCode

Supplementary Materials for the paper Tshitoyan et al. "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019).

research source_code text_mining
Topic Modeling in Python: Latent Dirichlet Allocation (LDA) https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0
Fri 02 Aug 2019 06:29:18 PM CEST
QRCode

In a nutshell, it is a type of statistical model used for tagging abstract “topics” that occur in a collection of documents that best represents the information in them.

Many techniques are used to obtain topic models. This post aims to demonstrate the implementation of LDA: a widely used topic modeling technique.

article machine_learning models python research science text_mining topic_modeling work
pyLDAvis - Interactive topic model visualization https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb
Fri 02 Aug 2019 06:24:12 PM CEST
QRCode

pyLDAvis is a python library for interactive topic model visualization. It is a port of the fabulous R package by Carson Sievert and Kenny Shirley. They did the hard work of crafting an effective visualization. pyLDAvis makes it easy to use the visualiziation from Python and, in particular, Jupyter notebooks.

To learn more about the method behind the visualization, it is possible to read the original paper explaining it.

This notebook provides a quick overview of how to use pyLDAvis.

coding_lang:python jupyter library machine_learning models notebook python source_code text_mining
Building a Text Analytics App in Python with Flask, Requests, BeautifulSoup, and TextBlob https://thecodinginterface.com/blog/text-analytics-app-with-flask-and-textblob/
Mon 29 Jul 2019 05:04:08 AM CEST
QRCode

This article introduces how to build a Python and Flask based web application for performing text analytics on internet resources such as blog pages. To perform text analytics I will utilizing Requests for fetching web pages, BeautifulSoup for parsing html and extracting the viewable text and, apply the TextBlob package to calculate a few sentiment scores.

article machine_learning programming python text_mining
AI Trained on Old Scientific Papers Makes Discoveries Humans Missed https://www.vice.com/en_us/article/neagpb/ai-trained-on-old-scientific-papers-makes-discoveries-humans-missed
Wed 10 Jul 2019 07:07:29 PM CEST
QRCode

Scientists used machine learning to reveal new scientific knowledge hidden in old research papers.

Using just the language in millions of old scientific papers, a machine learning algorithm was able to make completely new scientific discoveries.

In a study published in Nature on July 3, researchers from the Lawrence Berkeley National Laboratory used an algorithm called Word2Vec sift through scientific papers for connections humans had missed. Their algorithm then spit out predictions for possible thermoelectric materials, which convert heat to energy and are used in many heating and cooling applications.

article research systematic_literature_review text_mining
Unsupervised word embeddings capture latent knowledge from materials science literature | Nature https://www.nature.com/articles/s41586-019-1335-8
Fri 05 Jul 2019 09:30:52 AM CEST
QRCode

Natural language processing algorithms applied to three million materials science abstracts uncover relationships between words, material compositions and properties, and predict potential new thermoelectric materials.

The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.

data_science paper research science systematic_literature_review techniques text_mining
Automated Keyword Extraction from Articles using NLP https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34
Sat 15 Jun 2019 04:34:18 PM CEST
QRCode

In research & news articles, keywords form an important component since they provide a concise representation of the article’s content. Keywords also play a crucial role in locating the article from information retrieval systems, bibliographic databases and for search engine optimization. Keywords also help to categorize the article into the relevant subject or discipline.

Conventional approaches of extracting keywords involve manual assignment of keywords based on the article content and the authors’ judgment. This involves a lot of time & effort and also may not be accurate in terms of selecting the appropriate keywords. With the emergence of Natural Language Processing (NLP), keyword extraction has evolved into being effective as well as efficient.

And in this article, we will combine the two — we’ll be applying NLP on a collection of articles (more on this below) to extract keywords.

article bibliometry machine_learning NLP python research text_mining
GitHub - mpuig/textclassification: A brief overview of how to use fastText to train powerful text classifiers in a python notebook. https://github.com/mpuig/textclassification
Wed 26 Sep 2018 09:26:53 PM CEST
QRCode

A brief overview of how to use fastText to train powerful text classifiers in a python notebook. - mpuig/textclassification

coding_lang:python library machine_learning source_code text_mining tutorial
page 1 / 2
3672 links
Shaarli - The personal, minimalist, super-fast, database free, bookmarking service by the Shaarli community - Theme by kalvn