Search: [data_science] - Toolleeo's Links

Microsoft’s GraphRAG + AutoGen + Ollama + Chainlit = Local & Free Multi-Agent RAG Superbot

article programming RAG AI NLP data_science

Sat Jan 4 22:46:41 2025 · permalink

·

https://ai.gopubby.com/microsofts-graphrag-autogen-ollama-chainlit-fully-local-free-multi-agent-rag-superbot-61ad3759f06f

Noterat - Mapping ~400k speeches from the Swedish parlament

data_science article NLP

Sat Jan 4 22:31:17 2025 · permalink

·

https://noterat.github.io/posts/noteringar/202407301845.html

Miller - CSV/TSV and other formats toolkit

csv · file_management · software · opensource · homepage · data_science · tools

Mon Jul 20 16:00:23 2020 * · permalink

·

http://johnkerl.org/miller/doc/index.html

VisiData

VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.

spreadsheet · #cli-app · software · homepage · data_science

Sun May 31 19:46:48 2020 * · permalink

·

https://www.visidata.org/

OpenRefine

OpenRefine (previously Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

OpenRefine always keeps your data private on your own computer until YOU want to share or collaborate. Your private data never leaves your computer unless you want it to. (It works by running a small server on your computer and you use your web browser to interact with it)

homepage · data_science · software · opensource · dataset

Mon Mar 23 20:15:31 2020 * · permalink

·

https://openrefine.org/

MIDAS - Detecting Microcluster Anomalies in Edge Streams

Real-Time Streaming Anomaly Detection in Dynamic Graphs.

library · data_science · graph · opensource · source_code · coding_lang:c++

Sat Feb 8 13:04:55 2020 * · permalink

·

https://github.com/bhatiasiddharth/MIDAS

PandaPy - The speed of NumPy and the usability of Pandas

PandaPy has the speed of NumPy and the usability of Pandas (10x to 50x faster).

framework · opensource · source_code · math · data_science · coding_lang:python

Sat Jan 25 12:44:30 2020 * · permalink

·

https://github.com/firmai/pandapy

How to analyse 100 GB of data on your laptop with Python

Your laptop is way more powerful than you think. Unleash its full potential with the Vaex dataframe library.

python · data_science · article · big_data

Mon Dec 2 16:48:54 2019 · permalink

·

https://towardsdatascience.com/how-to-analyse-100s-of-gbs-of-data-on-your-laptop-with-python-f83363dda94

Scikit-learn’s Defaults are Wrong

This recent Tweet erupted a discussion about how logistic regression in Scikit-learn uses L2 penalization with a lambda of 1 as default options. If you don’t care about data science, this sou…

critics · python · data_science · machine_learning · article

Fri Nov 1 07:43:28 2019 · permalink

·

https://ryxcommar.com/2019/08/30/scikit-learns-defaults-are-wrong/

Principal Component Analysis

PCA is a linear dimensionality reduction technique. Many non-linear dimensionality reduction techniques exist, but linear methods are more mature, if more limited.

article · statistics · algorithm · methodology · analytics · data_science

Wed Oct 9 17:46:37 2019 * · permalink

·

http://www.oranlooney.com/post/ml-from-scratch-part-6-pca/

Unsupervised word embeddings capture latent knowledge from materials science literature | Nature

Natural language processing algorithms applied to three million materials science abstracts uncover relationships between words, material compositions and properties, and predict potential new thermoelectric materials.

The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.

paper · research · text_mining · systematic_literature_review · techniques · science · data_science

Fri Jul 5 07:30:52 2019 * · permalink

·

https://www.nature.com/articles/s41586-019-1335-8

Statistical forecasting: notes on regression and time series analysis

This web site contains notes and materials for an advanced elective course on statistical forecasting that is taught at the Fuqua School of Business, Duke University. It covers linear regression and time series forecasting models as well as general principles of thoughtful data analysis.

The time series material is illustrated with output produced by Statgraphics, a statistical software package that is highly interactive and has good features for testing and comparing models, including a parallel-model forecasting procedure that I designed many years ago.

The material on multivariate data analysis and linear regression is illustrated with output produced by RegressIt, a free Excel add-in which I also designed. However, these notes are platform-independent. Any statistical software package ought to provide the analytical capabilities needed for the various topics covered here.

statistics · time_series · forecasting · research · data_science · 5_stars

Tue Oct 16 16:59:54 2018 * · permalink

·

http://people.duke.edu/~rnau/411home.htm

Time Series Forecasting: Creating a seasonal ARIMA model using Python and Statsmodel.

python · time_series · forecasting · tutorial · data_science

Tue Oct 16 13:26:54 2018 * · permalink

·

http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/

ROC curves calculator

A receiver operating characteristic (ROC) is a graph that illustrates the performance of a binary classifier as its discrimination threshold (cutoff) is changed.

The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various cutoff settings. The true-positive rate is known as sensitivity, the false-positive rate is known as the fall-out and is calculated as (1 - specificity).

The ROC curve is thus a plot of the true positives (TPR) versus the false positives (FPR). The ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from - ∞ to + ∞ ) of the correct detection probability in the y-axis versus the cumulative distribution function of the false-alarm probability in x-axis.

statistics · web · math · science · data_science

Sat Jul 7 10:43:07 2018 * · permalink

·

https://kennis-research.shinyapps.io/ROC-Curves/