128 private links
At Axel Springer, Europe’s largest digital publishing house, we own a lot of news articles from various media outlets such as Welt, Bild, Business Insider and many more. Arguably, the most important part of a news article is its title, and it is not surprising that journalists tend to spend a fair amount of their time to come up with a good one. For this reason, it was an interesting research question for us at Axel Springer AI whether we could create an NLP model that generates quality headlines from Welt news articles (see Figure 1). This could, for example, serve our journalists as inspiration for creating SEO titles, which our journalists often don’t have time for (in fact we’re working together with our colleagues from SPRING and AWS on creating a SEO title generator).
- GATE is an open source software toolkit capable of solving almost any text processing problem
- It has a mature and extensive community of developers, users, educators, students and scientists
- It is used by corporations, SMEs, research labs and Universities worldwide
- It has a world-class team of language processing developers
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.
Main features:
- Train new vocabularies and tokenize, using today's most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
A fast, efficient universal vector embedding utility package.
Dead simple games made with word vectors.
A guide to document clustering with Python
Learn how to use TF-IDF and scikit-learn to extract important keywords from documents. This is a full working example using the Stack Overflow dataset.
Topical keyphrase extraction is used to summarize large collections of text
documents. However, traditional methods cannot properly reflect the intrinsic
semantics and relationships of keyphrases because they rely on a simple
term-frequency-based process. Consequently, these methods are not effective in
obtaining significant contextual knowledge. To resolve this, we propose a
topical keyphrase extraction method based on a hierarchical semantic network
and multiple centrality network measures that together reflect the hierarchical
semantics of keyphrases. We conduct experiments on real data to examine the
practicality of the proposed method and to compare its performance with that of
existing topical keyphrase extraction methods. The results confirm that the
proposed method outperforms state-of-the-art topical keyphrase extraction
methods in terms of the representativeness of the selected keyphrases for each topic. The proposed method can effectively reflect intrinsic keyphrase semantics and interrelationships.
Keyword extraction (also known as keyword detection or keyword analysis) is a text analysis technique that consists of automatically extracting the most important words and expressions in a text.
It helps summarize the content of a text and recognize the main topics which are being discussed.
TOM (TOpic Modeling) is a Python 3 library for topic modeling and browsing, licensed under the MIT license.
Its objective is to allow for an efficient analysis of a text corpus from start to finish, via the discovery of latent topics. To this end, TOM features functions for preparing and vectorizing a text corpus. It also offers a common interface for two topic models (namely LDA using either variational inference or Gibbs sampling, and NMF using alternating least-square with a projected gradient method), and implements three state-of-the-art methods for estimating the optimal number of topics to model a corpus. What is more, TOM constructs an interactive Web-based browser that makes it easy to explore a topic model and the related corpus.
With the increasing number of scientific publications, the analysis of the trends and the state-of-the-art in a certain scientific field is becoming very time-consuming and tedious task. In response to urgent needs of information, for which the existing systematic review model does not well, several other review types have emerged, namely the rapid review and scoping reviews.
The paper proposes an NLP powered tool that automates most of the review process by automatic analysis of articles indexed in the IEEE Xplore, PubMed, and Springer digital libraries. We demonstrate the applicability of the toolkit by analyzing articles related to Enhanced Living Environments and Ambient Assisted Living, in accordance with the PRISMA surveying methodology. The relevant articles were processed by the NLP toolkit to identify articles that contain up to 20 properties clustered into 4 logical groups.
The analysis showed increasing attention from the scientific communities towards Enhanced and Assisted living environments over the last 10 years and showed several trends in the specific research topics that fall into this scope. The case study demonstrates that the NLP toolkit can ease and speed up the review process and show valuable insights from the surveyed articles even without manually reading of most of the articles. Moreover, it pinpoints the most relevant articles which contain more properties and therefore, significantly reduces the manual work, while also generating informative tables, charts and graphs.
Supplementary Materials for the paper Tshitoyan et al. "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019).
In a nutshell, it is a type of statistical model used for tagging abstract “topics” that occur in a collection of documents that best represents the information in them.
Many techniques are used to obtain topic models. This post aims to demonstrate the implementation of LDA: a widely used topic modeling technique.
pyLDAvis
is a python library for interactive topic model visualization. It is a port of the fabulous R package by Carson Sievert and Kenny Shirley. They did the hard work of crafting an effective visualization. pyLDAvis
makes it easy to use the visualiziation from Python and, in particular, Jupyter notebooks.
To learn more about the method behind the visualization, it is possible to read the original paper explaining it.
This notebook provides a quick overview of how to use pyLDAvis
.
This article introduces how to build a Python and Flask based web application for performing text analytics on internet resources such as blog pages. To perform text analytics I will utilizing Requests for fetching web pages, BeautifulSoup for parsing html and extracting the viewable text and, apply the TextBlob package to calculate a few sentiment scores.
Scientists used machine learning to reveal new scientific knowledge hidden in old research papers.
Using just the language in millions of old scientific papers, a machine learning algorithm was able to make completely new scientific discoveries.
In a study published in Nature on July 3, researchers from the Lawrence Berkeley National Laboratory used an algorithm called Word2Vec sift through scientific papers for connections humans had missed. Their algorithm then spit out predictions for possible thermoelectric materials, which convert heat to energy and are used in many heating and cooling applications.
Natural language processing algorithms applied to three million materials science abstracts uncover relationships between words, material compositions and properties, and predict potential new thermoelectric materials.
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
In research & news articles, keywords form an important component since they provide a concise representation of the article’s content. Keywords also play a crucial role in locating the article from information retrieval systems, bibliographic databases and for search engine optimization. Keywords also help to categorize the article into the relevant subject or discipline.
Conventional approaches of extracting keywords involve manual assignment of keywords based on the article content and the authors’ judgment. This involves a lot of time & effort and also may not be accurate in terms of selecting the appropriate keywords. With the emergence of Natural Language Processing (NLP), keyword extraction has evolved into being effective as well as efficient.
And in this article, we will combine the two — we’ll be applying NLP on a collection of articles (more on this below) to extract keywords.
A brief overview of how to use fastText to train powerful text classifiers in a python notebook. - mpuig/textclassification