https://github.com/3b1b/manim/releases
Super awesome, and you can make it into an MCP for Cursor.
[1] https://stackoverflow.com/questions/73470828/ggplot2-is-slow...
The fact that they are using WGPU, which appears to be a Python native implementation of WebGPU, suggests there is an interesting possible extended case. As a few other comments suggest, if one knows that the data is available on a machine in a cluster rather than on the local machine of a user, it might make sense to start up a server, expose a port and pass along the data over http to be rendered in a browser. That would make it shareable across the lab. The limit would be the data bandwidth over http (e.g. for the 3 million point case) but it seems like for simpler cases it would be very useful.
That would lead to an interesting exercise of defining a protocol for transferring plot points over http in such a way they could be handed over to a the browser WebGPU interface efficiently. Perhaps even a more efficient representation is possible with some pre-processing on the server side?
jupyter-rfb lets you do remote rendering for this, render to a remote frame buffer and send over a jpeg byte stream. We and a number of our scientific users use it like this. https://fastplotlib.org/ver/dev/user_guide/faq.html#what-fra...
> defining a protocol for transferring plot points
This sounds more like GSP, which Cyrille Rossant (who's made some posts here) works on, it has a slightly different kind of use case.
Forgive me for doing this, but I used an LLM to find that. They’re exceptionally useful for disambiguation tasks like this. Knowing what an acronym refers to is very useful for next token prediction, so they’re quite good at it. It’s usually trivial to figure out if they’re hallucinating with a search engine.
But if I understand correctly it's a protocol for serializing graphical objects, pretty neat idea.
https://pygraphistry.readthedocs.io/en/latest/performance.ht...
I followed one of their online workshops, and it feels really powerful, although it is a bit confusing which part of it does what (it's basically 6 or 7 projects put together under an umbrella)
I will work on adding somewhere in our docs some metrics for this kind of thing (I think it could be helpful for many).
Certainly! A comparison of performance with specialized tools for large point clouds would be very interesting (like cloudcompare and potree).
Haven't had the time to get very far yet, but will gladly contribute an example once I figure something out. Some of the ideas I want to eventually get to is to render shadertoys(interactively?) into a fpl subplot (haven't looked at the code at all, but might be doable), eventually run those interactively in the browser and do the network layout on the GPU with compute shaders (out of scope for fpl).
But it doesn't seem to answer how it works in Jupyter notebooks, or if it does at all. Is the GPU acceleration done "client-side" (JavaScript?) or "server-side" (in the kernel?) or is there an option for both?
Because I've used supposedly fast visualization libraries in Google Colab before, but instead of updating at 30 fps, it takes 2 seconds to update after a click, because after the new image is rendered it has to be transmitted via the Jupyter connector and network and that can turn out to be really slow.
I believe the performance is pretty decent, especially if you run the kernel locally
Their docs also cover this as mentioned by @clewis7 below: https://www.fastplotlib.org/ver/dev/user_guide/faq.html#what...
Just to add on, colab is weird and not performant, this PR outlines our attempts to get jupyter-rfb working on colab: https://github.com/vispy/jupyter_rfb/pull/77
I think a killer feature of these gpu-plotting libraries would be if they could take torch/jax cuda arrays directly and not require a (slow) transfer over cpu.
tinygrad which I haven't used seems torch-like and has a WGPU backend: https://github.com/tinygrad/tinygrad
I'm now working on a way for users to wrap a Datoviz GPU buffer as a CuPy array that directly references the Datoviz-managed GPU memory. This should, in principle, enable efficient GPU-based array operations on GPU data without any transfers.
[2] https://registry.khronos.org/vulkan/specs/latest/man/html/VK...
[3] https://docs.cupy.dev/en/latest/reference/generated/cupy.cud...
WGPU has security protections since it's designed for the browser so I'm guessing it's impossible.
When would you reach for a different library instead of fastplotlib?
How does this deal with really large datasets? Are you doing any type of downsampling?
How does this work with pandas? I didn't see it as a requirement in setup.py
Does this work in Jupyter notebooks? What about marimo?
> When would you reach for a different library instead of fastplotlib?
Use the best tool for your usecase, we're focused on GPU accelerated interactive visualization. Our use cases broadly are developing ML algorithms, user-end ML Ops tools, and looking live data off of live scientific instruments.
> How does this deal with really large datasets? Are you doing any type of downsampling?
Depends on your hardware, see https://fastplotlib.org/ver/dev/user_guide/faq.html#do-i-nee...
> How does this work with pandas? I didn't see it as a requirement in setup.py
If you pass in numpy-like types that use the buffer protocol it should work, we also want to support direct dataframe input in the future: https://github.com/fastplotlib/fastplotlib/issues/395
There are more low-level priorities in the meantime.
> Does this work in Jupyter notebooks? What about marimo?
Jupyter yes via juptyer-rfb, see our repo: https://github.com/fastplotlib/fastplotlib?tab=readme-ov-fil...
I’ve been using kst-plot for live streaming data from instruments and interactive plots. It’s fast and I haven’t found any limit for the amount of data it can plot. Development has basically stopped - the product is done, feature complete, and works perfectly! It is used by European and Canadian space agencies. Maybe it will be interesting to you to see how they have solved or approached some of the same problems you have solved or will also solve !
A major reason why other plotting libraries don't take of is use of complicated APIs. But data analysis doesn't need Application Programming Interfaces, it needs User Interfaces.
I would argue that the Matplotlib syntax is horribly broken (or rather, the Matlab syntax it historically tried to emulate, and had to stick with for better or worse..)
of which the matplotlib is the embodiment. Terrible API with terrible terrible performance.
For architecture astronauts there's also the OOP API over which the pylab API is a wrapper.
Of course there are also a lot of all sorts of declarative APIs, which are popular with people copy-pasting code from cookbooks. These become very painful very fast if you do something that's not in the cookbook.
Matplotlib does struggle with performance in some/many cases, but it has little to do with the API.
Not sure if that is the right tutorial, but many years ago in the guile 1.x days I wrote a local visualizer for the data from a particle physics accelerator entirely in Guile and Gnuplot. It was very MVC and used guile as the controller and Gnuplot as the viewer.
Was it stupid? Yes. Did it work better than all the other tools I had at the time? Also yes.
https://almarklein.org/triangletricks.html
https://almarklein.org/line_rendering.html
A big shader refactor was done in this PR: https://github.com/pygfx/pygfx/pull/628
> These days, having a GPU is practically a prerequisite to doing science, and visualization is no exception.
It becomes really funny when they go on to this, as if it was a big deal:
> Depicted below is an example of plotting 3 million points
Anybody who has ever used C or fortran knows that a modern CPU can easily churn through "3 million points" at more than 30 frames per second, using just one thread. It's not a particularly impressive feat, three million points is the size of a mid-resolution picture, and you can zoom-in and out those trivially in real-time using a CPU (and you could do that 20 years ago, as well). Maybe the stated slowness of fastplotlib comes from the unholy mix of rust and python?
Now, besides this rant, I think that fastplotlib is fantastic and, as an (unwilling) user of Python for data science, it's a godsend. It's just that the hype of that website sits wrong in me. All the demos show things that could be done much easier and just as fast when I was a teenager. The big feat, and a really big one at that, is that you can access this sort of performance from python. I love it, in a way, because it makes my life easier now; but it feels like a self-inflicted problem was solved in a very roundabout way.
> Anybody who has ever used C or fortran knows that a modern CPU can easily churn through "3 million points" at more than 30 frames per second, using just one thread. It's not a particularly impressive feat, three million points is the size of a mid-resolution picture, and you can zoom-in and out those trivially in real-time using a CPU (and you could do that 20 years ago, as well). Maybe the stated slowness of fastplotlib comes from the unholy mix of rust and python?
That's a misrepresentation though, it's 3 million points in sine waves, e.g. something like 1000 sine waves with e.g. 3000 points in each. If you look at the zoomed in image, the sine waves are spaced significantly, so if you would represent this as an image it would be at least a factor 10 larger. In fact that is likely a significant underestimation, i.e. you need to connect the points inside the sine waves as well.
The comparison case would be to take a vector graphics (e.g. svg) with 1000 sine wave lines and open it in a viewer (written in C or Fortran if you want) and try zooming in and out quickly.
If you insist to fit the entire thing in memory, it may seem better to do so in the plain RAM, which nowadays is of humongous size even in "modest" systems.
But with 10 billion points, you need to consider more sophisticated approaches.
see: https://fastplotlib.org/ver/dev/user_guide/faq.html#what-fra...
People have used fastplotlib and jupyter-rfb in vscode, but it can be troublesome and we don't currently have the resources to figure out exactly why.
I especially like that there is a PyQt interface which might provide an alternative to another great package: pyqtgraph[0].
We are hoping for pyodide integration soon, which would allow fastplotlib to be run strictly in the browser!
As Caitlin pointed out below pyodide is a future goal.
The quickest install would be `pip install fastplotlib`. This would be if you were interested in just having the barebones (no imgui or notebook) for desktop viz using something like glfw.
We can think about adding in our docs some kind of import time metrics.
but we haven't benchmarked it yet
Hopefully both are implemented.
Fastplotlib / pygfx are primarily meant to run on desktop. When using it via the notebook the server does the rendering.
As Ivo said, we have plans to support running in the browser via Pyodide, which opens some interesting things, but is not the primary purpose.
https://github.com/pygfx/wgpu-py/issues/407
PRs welcome though :-)
I previously tried this on matplotlib and it took 20-30 minutes to make a single rendering because matplotlib only uses a single core on a cpu and doesn’t support gpu acceleration. I also tried Man im, but I couldn’t get an actual video file, and opengl seems to be a bit complicated to work with (I went and worked on other things though I should ask around about the video file output). Anyway, I’m excited about the prospect of a gpu accelerated dataviz tool that utilizes Vulkan, and I hope this library can cover my usecase.
I never knew I needed this until now
https://fastplotlib.org/ver/dev/_gallery/line/line_colorslic...
https://fastplotlib.org/ver/dev/_gallery/line/line_cmap_more...
https://fastplotlib.org/ver/dev/_gallery/line/line_cmap.html...
And with collections if you want to go crazy: https://fastplotlib.org/ver/dev/_gallery/line_collection/lin...
For me, matplotlib still reigns supreme. Rather than a fancy new visualization framework, I'd love for matplotlib to just be improved (admittedly, fastplotlib covers a different set of needs than what matplotlib does... but the author named it what they named it, so they have invited comparison. ;-) ).
Two things for me at least that would go a long way:
1) Better 3D plotting. It sucks, it's slow, it's basically unusable, although I do like how it looks most of the time. I mainly use PyVista now but it sure would be nice to have the power of a PyVista in a matplotlib subplot with a style consistent with the rest of matplotlib.
2) Some kind of WYSIWYG editor that will let you propagate changes back into your plot easily. It's faster and easier to adjust your plot layout visually rather than in code. I'd love to be able to make a plot, open up a WYSIWYG editor, lay things out a bit, and have those changes propagate back to code so that I can save it for all time.
(If these features already exist I'll be ecstatic ;-) )
Every pixel has a covariance with every other pixel, so sliding though the rows of the covariance matrix generates as many faces on the right as there are pixels in a photograph of a face. However the pixels that strongly co-vary will produce very similar right side "face" pictures. To get a sense of how many different behaviours there are one would look for eigenvectors of this covariance matrix. And then 10 or so static eigenvectors of the covariance matrix (eigenfaces [1]) would be much more informative than thousands of animated faces displayed in the example.
Some times a big interactive visualisation can be a sign of not having a concrete goal or not knowing how to properly summarise. After all that's the purpose of a figure - to highlight insights, not to look for ways to display the entire dataset. And pictures that try to display the whole dataset end up shifting the job of exploratory analysis to a visual space and leave it for somebody else.
Thou of course there are exceptions.
It's hard for me to imagine what you're doing that necessitates such fancy tools, but I'm definitely interested to learn! My failure of imagination is just that.
Your whole third paragraph seems to be criticizing the core purpose of exploratory data analysis as though one should always be able to skip directly to the next phase of having a standardized representation. When entering a new problem domain, somebody needs to actually look at the data in a somewhat raw form. Using the strengths of the human vision system to get a rough idea of what the typical data looks like and the frequency and character of outliers isn't dumping the job of exploratory data analysis onto the reader, it's how the job actually gets done in the first place.
Yup this is a good summary of the intent, we also have to remember that the eigenfaces dataset is a very clean/toy data example. Real datasets never look this good, and just going straight to an eigendecomp or PCA isn't informative without first taking a look at things. Often you may want to do something other than an eigendecomp or PCA, get an idea of your data first and then think about what to do to it.
Edit: the point of that example was to show that visually we can judge what the covariance matrix is producing in the "image space". Sometimes a covariance matrix isn't even the right type of statistic to compute from your data and interactively looking at your data in different ways can help.
Imagine we have some big data - like an OMIC dataset about chromatin modification differences between smokers and non-smokers. Genomes are large so one way to visualise might be to do a manhattan plot (mentioned here in another comment). Let's (hypothetically) say the pattern in the data is that chromatin in the vicinity of genes related to membrane functioning have more open chromatin marks in smokers compared to non smokers. A manhattan plot will not tell us that. And in order to be able to detect that in our visualisation we had to already know what we were looking for in the first place.
My point in this example is the following: in order to detect that we would have to know what to visualise first (i.e. visualise the genes related to membrane function separately from the rest). But then when we are looking for these kinds of associations - the visualisation becomes unnecessary. We can capture the comparison of interest with a single number (i.e. average difference between smokers vs non-smokers within this group of genes). And then we can test all kinds of associations by running a script with a for-loop in order to check all possible groups of genes we care about and return a number for each. It's much faster than visualisation. And then after this type of EDA is done, the picture would be produced as a result, displaying the effect and highlighting the insights.
I understand your point about visualisation being an indistinguishable part of EDA. But the example I provided above is much closer to my lived experience.
Re: wtallis, I think my original complaint about EDA per se is indeed off the mark.
Certainly creating a 20x20 grid of live-updating GPU plots and visualizations is a form of EDA, but it seems to suggest a complete lack of intuition about the problem you're solving. Like you're just going spelunking in a data set to see what you can find... and that's all you've got; no hypothesis, no nothing. I think if you're able to form even the meagerest of hypotheses, you should be able to eliminate most of these visualizations and focus on something much, much simpler.
I guess this tool purports to eliminate some of this, but there is also a degree of time-wasting involved in setting up all these visualizations. If you do more thinking up front, you can zero in on a smaller and more targeted subset of experiments. Simpler EDA tools may suffice. If you can prove your point with a single line or scatter plot (or number?), that's really the best case scenario.
For a sufficiently narrow definition of "dataset", perhaps. I don't think it's the obvious step one when you want to start understanding a time series dataset, for example. (Fourier transform would be a more likely step two, after step one of actually look at some of your data.)
So this is not unheard of for time series analysis.
Matplotlib is okay, but there's definitely room for improvement, so why not go for that improvement?
I agree 100% that matplotlib is really slow and should be made to run as fast as humanly possible. I would add a (3) to my list above: optimize matplotlib!
OTOH, at least for what I'm doing, the code that runs to generate the data that gets plotted dominates the runtime 99% of the time.
For me, adjusting plots is usually the time waster. Hence point (2) above. I'd love to be able to make the tweaks using a WYSIWYG editor and have my plotting script dynamically updated. The bins, the log scale, the font, the dpi, etc, etc.
I think with your 8 slices examples above: my (2) and (3) would cover your bases. In your view, is the rest of matplotlib really so bad that it needs to be burnt to the ground for progress to be made?
edit: regarding runtime, I'm sure this varies a lot based on usecase, but for my usual usecase I store a mostly-processed dataset, so the additional processing before drawing the data is usually minimal.
What I want for EDA is a tool that let's me quickly toggle between common views of the dataset. I run through the same analysis over and over again, I don't want to type the same commands repeatedly. I have my own heuristics for which views I want, and I want a platform that lets me write functions that express those heuristics. I want to build the inteligence into the tool instead of having to remember a bunch of commands to type on each dataframe.
For manipulating the plot, I want a low-code UI that lets me point and click the operations I want to use to transform the dataframe. The lowcode UI should also emit python code to do the same operations (so you aren't tied to a low-code system, you just use it as a faster way to generate code then typing).
I have built the start of this for my open source datatable UX called Buckaroo. But it's for tables, not for plotting. The approach could be adapted to plotting. Happy to collaborate.
The differing approaches probably can be seen in some API choices, although the fastplotlib API is a lot more ergonomic than many others. Having to index the figure or prefixing plots with add_ are minor things, and probably preferable for application development, but for fast-iteration EDA they will start to irritate fast. The "mlab" API of matplotlib violates all sorts of software development principles, but it's very convenient for exploratory use.
Matplotlib's performance, especially with interaction and animation, and clunky interaction APIs are definite pain points, and a faster and better interaction supporting library for EDA would be very welcome. Something like a mlab-type wrapper would probably be easy to implement for fastplotlib.
And to bikeshed a bit, I don't love the default black background. It's against usual conventions, difficult for publication and a bit harder to read when used to white.
As an example, I frequently want to run analytics on a dataframe. More complex summary stats. So you write a couple of functions, and have two for loops, iterating over columns and functions. This works for a bit. It's easy to add functions to the list. Then a function throws an error, and you're trying to figure out where you are in two nested for loops.
Or, especially for pandas, you want to separate functions to depend on the same expensive pre-calc. You could pass the existing dict of computed measures so you can reuse that expensive calculation... Now you have to worry about the ordering of functions.
So you could put all of your measures into one big function, but that isn't reusable. So you write your big function over and over.
I built a small dag library that handles this, and lets you specify that your analysis requires keys and provides keys, then the DAG of functions is ordered for you.
How do other people approach these issues?
> [...] it fixes the problem at the time, then it gets thrown away with the notebook only to be written again soon.
Is one of the reasons I stopped using notebooks.
One solution to your problem might be to create a simple executable script that, when called on the file of your dataset in a shell, would produce the visualisation you need. If it's an interactive visualisation then I would create a library or otherwise a re-usable piece of code that can be sourced. It takes some time but ends up saving more time in the end.
If you have custom-made things you have to check on your data tables, then likely no library will solve your problem without you doing some additional the work on top.
And for these:
> Or, especially for pandas, you want to separate functions to depend on the same expensive pre-calc. [...] Now you have to worry about the ordering of functions.
I save expensive outputs to intermediate files, and manage dependencies with a very simple build-system called redo [1][2].
[1]: http://www.goredo.cypherpunks.su
[2]: http://karolis.koncevicius.lt/posts/using_redo_to_manage_r_d...
For larger datasets, real scripts are a better idea. I expect my stuff to work with datasets up to about 1Gb, caching is easy to layer on and would speed up work for larger datsets, but my code assumes the data fits in memory. It would be easier to add caching, the make sure I don't load an entire dataset into memory. (I don't serialize the entire dataframe to the browser though).
I agree with you regarding matplotlib, although I find a lot of faults/frustration in using it. Both your points on 3D plotting and WYSIWYG editor would be extremely nice and as far as I know nothing exists in python ticking these marks. For 3D I typically default to Matlab as I've found it to be the most responsive/easy to use. I've not found anything directly like a WYSIWYG editor. Stata is the closest but I deplore it, R to some extent has it but if I'm generating multiple plots it doesn't always work out.
I'm surprised by what you said about "EDA". I find the opposite, a shotgun approach, exploring a vast number of plots with various stratifications gives me better insight. I've explored plotting across multiple languages (R,python,julia,stata) and not found one that meets all my needs.
The biggest issue I often face is I have 1000 plots I want to generate that are all from separate data groups and could all be plotted in parallel but most plotting libraries have holds/issues with distribution/parallelization. The closest I've found is I'll often build up a plot in python using a Jupyter notebook. Once I'm done I'll create a function taking all the needed data/saving a plot out, then either manually or with the help of LLMs convert it to julia which I've found to be much faster in loading large amounts of data and processing it. Then I can loop it using julia's "distributed" package. Its less then ideal, threaded access would be great, rather then having to distribute the data, but I've yet to find something that works. I'd love a simple 2D EDA plotting library that has basic plots like lines, histograms (1/2d), scatter plots, etc, has basic colorings and alpha values and is able to handle large amounts (thousands to millions of points) of static data and plot it saving to disk parallelized. I've debated writing my own library but I have other priorities currently, maybe once I finish my PhD.
Maybe VR will change that at some point. :shrug:
My focus is primarily on raw performance, visual quality, and scalability for large datasets—millions, tens of millions of points, or even more.
Implementing the API on top of GSP should be relatively straightforward, as the core graphics-related mechanisms are handled by GSP/Datoviz. We've created a Slack channel for discussions—contact me privately if you'd like to join.
[1] https://github.com/pygfx/pygfx [2] https://github.com/pygfx/wgpu-py
In the plans that we do have for running the browser, Fastplotlib, Pygfx and wgpy-py will still be Python, running on CPython that is compiled to WASM (via Pyodide). But instead of wgpu-py cffi-ing into a C library, it would make JS calls to the WebGPU API.
This code gives you a fully interactive, and performant, histogram plot:
```python
import plotly.express as px df = px.data.tips() fig = px.histogram(df, x="total_bill") fig.show()
```