Tullio Facchinetti
Tag cloud
Picture wall
Daily
RSS Feed
  • RSS Feed
  • ATOM Feed
  • Daily Feed
Filters

Links per page

  • 20 links
  • 50 links
  • 100 links

Display

Filter untagged links
14 results tagged text_manipulation  ✕
rapidfuzz - Rapid fuzzy string matching in Python and C++ using the Levenshtein Distance https://github.com/rhasspy/rapidfuzz
Mon 30 Mar 2020 10:59:26 PM CEST
QRCode

image

library opensource software source_code text-processing text_manipulation
ydiff - View colored, incremental diff https://github.com/ymattw/ydiff
Mon 10 Feb 2020 04:44:52 PM CET
QRCode

View colored, incremental diff in workspace or from stdin with side by side and auto pager support (was "cdiff").

#cli-app opensource software source_code text_manipulation tools utility
GNU Recutils https://www.gnu.org/software/recutils/manual/
Sun 26 Jan 2020 08:51:48 PM CET
QRCode

GNU recutils is a set of tools and libraries to access human-editable, text-based databases called recfiles. The data is stored as a sequence of records, each record containing an arbitrary number of named fields. Advanced capabilities usually found in other data storage systems are supported: data types, data integrity (keys, mandatory fields, etc.) as well as the ability of records to refer to other records (sort of foreign keys). Despite its simplicity, recfiles can be used to store medium-sized databases.

#cli-app data_structure database file_format text text_manipulation utility
Parsr - Transforms PDF, Documents and Images into Enriched Structured Data https://github.com/axa-group/Parsr
Tue 14 Jan 2020 06:49:38 AM CET
QRCode

Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.

Currently, Parsr can perform:

  • Document Hierarchy Regeneration - Words, Lines and Paragraphs
  • Headings Detection
  • Table Detection and Reconstruction
  • Lists Detection
  • Text Order Detection
  • Named Entity Recognition (Dates, Percentages, etc)
  • Key-Value Pair Detection (for the extraction of specific form-based entries)
  • Page Number Detection
  • Header-Footer Detection
  • Link Detection
  • Whitespace Removal
coding_lang:python framework opensource software source_code text-processing text_manipulation
tokenizers - Fast State-of-the-Art Tokenizers https://github.com/huggingface/tokenizers
Tue 14 Jan 2020 06:48:48 AM CET
QRCode

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

  • Train new vocabularies and tokenize, using today's most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
coding_lang:rust library opensource software text_manipulation text_mining
Levenshtein distance - Wikipedia https://en.m.wikipedia.org/wiki/Levenshtein_distance
Fri 23 Aug 2019 08:33:22 PM CEST
QRCode

information theory, linguistics and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

algorithm article text_manipulation
homer: a text analyser in Python, can help make your text more clear, simple and useful for your readers. https://github.com/wyounas/homer
Sat 17 Aug 2019 02:31:04 PM CEST
QRCode

Homer, a text analyser in Python, can help make your text more clear, simple and useful for your readers.

analytics coding_lang:python opensource software source_code text_manipulation
peco - Interactive filtering tool https://www.linuxlinks.com/excellent-utilities-peco-interactive-filtering-tool/
Mon 12 Aug 2019 07:40:09 AM CEST
QRCode

peco is a CLI utility that filters text interactively. The tool is written in the Go programming language. It's free and open source software.

#cli-app article coding_lang:go console interactive linux software terminal text_manipulation tutorial utility
Regex tutorial for Linux (Sed & AWK) examples - Like Geeks https://likegeeks.com/regex-tutorial-linux/amp/
Mon 03 Dec 2018 04:53:08 AM CET
QRCode

In order to successfully work with the Linux sed editor and the awk command in your shell scripts, you have to understand regular expressions or in short regex. Since there are many engines for regex, we will use the shell regex and see the bash power in working with regex.

First, we need to understand what regex is, then we will see how to use it.

Table of contents include:

What is regex, Types of regex, Define BRE Patterns, Special Characters, Anchor Characters, The dot Character, Character Classes, Negating Character Classes, Using Ranges, Special Character Classes, The Asterisk, Extended Regular Expressions, The question mark, The Plus Sign, Curly Braces, Pipe Symbol, Grouping Expressions, Practical examples, Counting Directory Files, Validating E-mail Address.

5_stars regex text_manipulation tutorial
Turn Vim Into Excel: Tips for Editing Tabular Data http://alangrow.com/blog/turn-vim-into-excel-tips-for-tabular-data-editing
Mon 03 Dec 2018 04:07:04 AM CET
QRCode

The author tried to edit data in spreadsheet programs.

This post illustrate ho to use Vim to edit tabular data, although there are a few things that will make it more pleasant. It is assumed that editing files are in tab-separated value format (TSV).

"But what about CSV files?" Just. Don't.

Do: convert your CSV to TSV and back for editing.

image

article csv post text_manipulation tutorial vim
texar: Toolkit for Text Generation and Beyond https://github.com/asyml/texar
Mon 10 Sep 2018 09:39:33 PM CEST
QRCode

Toolkit for Text Generation and Beyond. Contribute to asyml/texar development by creating an account on GitHub.

coding_lang:python library machine_learning NLP opensource programming source_code tensorflow text_generation text_manipulation
TextBlob: Simplified Text Processing https://textblob.readthedocs.io/en/dev/index.html
Wed 22 Aug 2018 03:36:15 AM CEST
QRCode

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

Features

  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • Add new models or languages through extensions
  • WordNet integration
coding_lang:python docs library machine_learning NLP text_manipulation text_mining
glitchcat - Cat-like program with glitch animation https://github.com/kuviman/glitchcat
Tue 21 Aug 2018 11:20:35 PM CEST
QRCode

Cat-like program with glitch animation.

image

#cli-app coding_lang:rust funny opensource software source_code text_manipulation
glitchcat - Creating CLI apps in Rust is super easy https://blog.kuviman.com/2018/07/20/glitchcat.html
Thu 09 Aug 2018 10:23:32 AM CEST
QRCode

glitchcat is a cat-like program with glitch animation.

image

#cli-app article funny opensource rust text_manipulation
3660 links
Shaarli - The personal, minimalist, super-fast, database free, bookmarking service by the Shaarli community - Theme by kalvn