79 private links
A useful early warning signal computing library which can detect, calculate and notify you of bifurcations in time series.
A fantastic tool for the commandline to quickly get a single view of a csv file. Auto-spaces, auto-indents, automatically tries to find the right numeric scale to display one csv file. Quick, easy, sweet!
(Called tidy-viewer on Archlinux AUR and command)
Jupyter notebooks in the terminal. Run complete notebooks from your commandline for exploratory data analysis, before you use something like quarto for more permanent rendering. Seems very neat.
A framework for elegantly configuring complex applications.
Configuration management for python projects, may be useful to store simple and repeatable configurations for data science projects as well.
Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.
A way to track data - even if it is in different locations - alongside code, mimicking its version control. Seems a little complicated but really useful, especially with additional features like data pipelines that are contained
Reproducible datasci - you create data ingestors, then create small modules of transformation (the engineering), then do whatever you want with the data (the science).
Seems quite nice for larger projects and like it could save you some time down the road (forking another project off an existing or returning to an old project with its standardized nature).
An exhaustive book, free and available online, on publishing workflow.
Getting, preparing, cleaning data. Exploratory analysis and modelling with regression. Creating reproducible documents with quarto. Seems really nice and good to delve into for data analysis.
Statistical inferrence, various python plot types and Correlation vs causation explained in a series of blog-posts. Very beginner-friendly with drawings etc
Statistics concepts explained (and tried to do so in plain english). That means some nuance will be lost but might get you to understand results quicker.
Allows you to collaborate on RMarkdown writing through google docs. You will have to use RMarkdown syntax in google docs however, which seems even more cumbersome than plaintext integrations.
As far as I can see on the demonstration, it will also not do anything for better presentation while writing (since it isn't knitting or anything before you download from gdocs again of course). Don't know how well people would adopt this then.
RMarkdown for the python world, built on pandoc. This seems like an amazing alternative to the R world (though it includes support for R) and all the bookdown and blogdown alternatives.
List of resources to delve deeper into data science and/or data engineering. Very interesting suggestions and enough overlap that it's not just a 'random list'
Goes over advanced concepts of scraping (with Python):
- asynchronous loading pages / client-side rendering (Selenium)
- authentication
- blacklisting
- header inspection
- request frequency
- pattern detection
- honeypots
- captchas, redirects
An 8-part series on understanding the python pandas pipeline and concepts.
Can convert (and revert) jupyter notebooks to markdown and script files (i.e. plaintext files instead of singular json code files).
Could be useful for data tracking or converting between a jupyter-centric and a vim-centric data workflow.
Third edition of the famous data analysis learning book for pandas (and numpy) by the pandas author.
karlicoss of the data liberation project HPI explains how to best store and access data moved from various points in the cloud/web/internet to your drives and why databases might not always be the best choice.
TLDR.
Save your grabbed data without any manipulation.
Let the manipulation happen every time you access/interpret the data.
If you have slices of data (mostly time frames), don't try to merge them on disk but save as extra files and merge on access/interpretation as well.
You can make use of databases for access caching since the last points generate some overhead for each access.
easiest answer is with pandas as a library:
df = pd.read_json('inputfile.json')
df.to_csv('outputfile.csv', encoding='utf-8', index = False)
read_json converts a JSON string to a pandas object (either a series or dataframe).
to_csv can either return a string or write directly to a csv-file. See the docs for to_csv.
works best when json is an array of structured objects (unstructured data, see SO answer in link)
additional pandas to csv tips see this SO thread
Also, a really generic template you could use is something like this: 1. Find a data blob, API, or web scrape a site for raw data you're interested in. 2. Figure out how to store that data. Do you need a relational database or maybe NoSQL? How will the records be stored and what does your data model look like? 3. Use analytics packages like numpy or something else, draw conclusions or find interesting themes about your data 4. Now do something with it! Maybe a front end to display it all. You can use Dash to build a quick and light visualization of your findings or something more full stack like a Django application or even Flask. Totally up to you.
An interesting use of loki to grab shell history, store it centrally, and then re-use it from the commandline to replace its traditional history functionality. Also includes a little tidbit on then integrating your shell history with e.g. Grafana.