128 private links
A close to drop in replacement for cut that can use a regex delimiter instead of a fixed string. Additionally this tool allows for specification of the order of the output columns using the same column selection syntax as cut (see below for examples).
This tool serializes the output of popular gnu linux command line tools and file types to structured JSON output.
This allows piping of output to tools like jq.
At Axel Springer, Europe’s largest digital publishing house, we own a lot of news articles from various media outlets such as Welt, Bild, Business Insider and many more. Arguably, the most important part of a news article is its title, and it is not surprising that journalists tend to spend a fair amount of their time to come up with a good one. For this reason, it was an interesting research question for us at Axel Springer AI whether we could create an NLP model that generates quality headlines from Welt news articles (see Figure 1). This could, for example, serve our journalists as inspiration for creating SEO titles, which our journalists often don’t have time for (in fact we’re working together with our colleagues from SPRING and AWS on creating a SEO title generator).
Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.
It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.
Currently, Parsr can perform:
- Document Hierarchy Regeneration - Words, Lines and Paragraphs
- Headings Detection
- Table Detection and Reconstruction
- Lists Detection
- Text Order Detection
- Named Entity Recognition (Dates, Percentages, etc)
- Key-Value Pair Detection (for the extraction of specific form-based entries)
- Page Number Detection
- Header-Footer Detection
- Link Detection
- Whitespace Removal
TextDistance, a python library for comparing distance between two or more sequences by many algorithms.
Features:
- 30+ algorithms
- Pure python implementation
- Simple usage
- More than two sequences comparing
- Some algorithms have more than one implementation in one class.
- Optional
numpy
usage for maximum speed.
rga
is a line-oriented search tool that allows you to look for a regex in a multitude of file types.
rga
wraps the awesome ripgrep and enables it to search in pdf, docx, sqlite, jpg, movie subtitles (mkv, mp4), etc.
Executes SQL-like queries on CSVs/TSVs tabular data files; each tabular file is treated as a database table; support to all SQL constructs (WHERE
, GROUP BY
, JOIN
).
Utility that allows users to choose one option from a set of choices using an interface with fuzzy search functionality.
A Python script that
1) receives input lines from stdin
or a file,
2) lists the input lines and waits for input that filter/select the line(s),
3) outputs the selected line(s) to stdout
;
Can be used to add interactivity to many regular shell commands.
(JSON Query?) is sed-like processor for JSON data; can be used to process JSON files and data streams and perform operations such as those allowed by cat
, sed
, grep
and awk
on regular text files.
(Generic Colouriser) can be configured to parse a given text stream and to colorize it according to regexp written in configuration files; different patterns can be associated to file types.
ripgrep
is a line-oriented search tool that recursively searches your current directory for a regex pattern while respecting your gitignore rules.
ripgrep
has first class support on Windows, macOS and Linux, with binary downloads available for every release.
ripgrep
is similar to other popular search tools like The Silver Searcher, ack and grep.
(FuZzy Finder) is a general-purpose command-line finder with fuzzy search/filter capabilities; good integration with vim
.