83 private links
A JSON Query Language CLI tool. A little like the venerable jq a little not, has different selection (or 'query') syntax.
Basically you pass in JSON to select specific JSON from whatever data.
Feels a little weird to me at first, but might actually be easier for intermediate/advanced use cases then the mind-bending mess jq sometimes turns into for me
(or rather, the constant trial and error with jiq that it turns into)
Enable rendering cricit-markup in your quarto output.
Could be really useful for a ms-word -less authoring pipeline.
Bibtex parser for Python 3. Parse bibtex, do whatever you want with it now as a python data structure.
One example of doing bibtex -> pandas dataframe is here
A pretty flexible and interesting approach to organizing data science projects. Combined with: https://www.earthdatascience.org/courses/intro-to-earth-data-science/open-reproducible-science/get-started-open-reproducible-science/best-practices-for-organizing-open-reproducible-science/ for more academic-oriented ideas,
should give a rough guide to finding good organizational structures.
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more. - GitHub - eBay/tsv-utils: eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
TSV data wrangling utilities from the command line.
Written in D language.
A useful early warning signal computing library which can detect, calculate and notify you of bifurcations in time series.
A fantastic tool for the commandline to quickly get a single view of a csv file. Auto-spaces, auto-indents, automatically tries to find the right numeric scale to display one csv file. Quick, easy, sweet!
(Called tidy-viewer on Archlinux AUR and command)
Jupyter notebooks in the terminal. Run complete notebooks from your commandline for exploratory data analysis, before you use something like quarto for more permanent rendering. Seems very neat.
A framework for elegantly configuring complex applications.
Configuration management for python projects, may be useful to store simple and repeatable configurations for data science projects as well.
Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.
A way to track data - even if it is in different locations - alongside code, mimicking its version control. Seems a little complicated but really useful, especially with additional features like data pipelines that are contained
Reproducible datasci - you create data ingestors, then create small modules of transformation (the engineering), then do whatever you want with the data (the science).
Seems quite nice for larger projects and like it could save you some time down the road (forking another project off an existing or returning to an old project with its standardized nature).
An exhaustive book, free and available online, on publishing workflow.
Getting, preparing, cleaning data. Exploratory analysis and modelling with regression. Creating reproducible documents with quarto. Seems really nice and good to delve into for data analysis.
Statistical inferrence, various python plot types and Correlation vs causation explained in a series of blog-posts. Very beginner-friendly with drawings etc
Statistics concepts explained (and tried to do so in plain english). That means some nuance will be lost but might get you to understand results quicker.
Allows you to collaborate on RMarkdown writing through google docs. You will have to use RMarkdown syntax in google docs however, which seems even more cumbersome than plaintext integrations.
As far as I can see on the demonstration, it will also not do anything for better presentation while writing (since it isn't knitting or anything before you download from gdocs again of course). Don't know how well people would adopt this then.
RMarkdown for the python world, built on pandoc. This seems like an amazing alternative to the R world (though it includes support for R) and all the bookdown and blogdown alternatives.
List of resources to delve deeper into data science and/or data engineering. Very interesting suggestions and enough overlap that it's not just a 'random list'
Goes over advanced concepts of scraping (with Python):
- asynchronous loading pages / client-side rendering (Selenium)
- authentication
- blacklisting
- header inspection
- request frequency
- pattern detection
- honeypots
- captchas, redirects
An 8-part series on understanding the python pandas pipeline and concepts.