⏱️ Stuck Waiting on Imports? Free Up Hours with These Time-Saving Tips!
Learn how we play with a 10GB CSV demo dataset in less than 5 minutes
Hey friends,
I hope you had a fantastic weekend and are all recharged and ready to go! 🙂 I took some time off as well, so here we are with a new issue of Bitsy, off-schedule, talking about testing your theories on massive datasets.
Few things are more exciting than starting work with a new client and solving new problems, but new collaborations almost always bring new technical challenges.
This week, we wanted to find out if our queries run faster on GraphDB with a demo data set:
10GB CSV file
~60 millions of rows
import taking ~30 hours
need to produce POC ASAP
Let’s see how we tackled these challenges.
🦾 Use AI to get started Fast
Have you never heard about them? Me neither. 😅 This incredible technology has existed for some time now, but the CRUD web apps that most of us build don’t benefit from GraphDBs’ selling points.
Thanks to ChatGPT, I wrote advanced queries in less than a day in Cypher (a GraphDB query language).
Here’s roughly how the week started
Have you worked with GraphDBs before?
(me) No, sir
Great, me neither. Start rewriting our most business critical queries in it and have a POC for a query by the end of the week.
It means one thing: I don’t have time to read through an O’Reilly book on GraphDBs and Cypher. 😃
Lucky me, ChatGPT already did that for me. As a bonus, it explains the queries it writes step by step, so you can learn the language while solving your actual problems.
⏳ Time
ChatGPT gives you something you can run and test in less than a second.
The problem?
It doesn’t always work. I mean, it literally messes up your dataset. This wouldn’t be an issue unless the anonymized CSV data import wouldn’t take approximately 30 hours.
But we found ways to skip the long wait times.
Cut the Data
10MBs of anonymized data for a POC is as good as all 10GBs of it.
But how do you extract this kind of data if text editors, Excel, can’t even open a CSV of 10GBs?
The widely available head command prints the top N number of data of the given input. Throw in some Linux basics, and you produce a file consisting of only the first 1000 lines of your demo data without ever opening the file in an editor and making your system unresponsive.
head -n 1000 demo_10GB.csv > demo_1000_lines.csv
Use DB dumps
Dumbing the DB state right before you run a query that might cause harm is another way of cutting the time needed to work with a clean state. You can restore the DB from the dumb if you mess things up.
🧑🤝🧑 Share the work
Do you have more than one theory to test (query to implement, in our case)?
Ask for more resources from your manager.
There are several benefits to exploring a new tech as a group:
People learn differently. Everyone finds something else more interesting, more valuable to dive into. This can result in a fantastic knowledge-sharing session with lots of learning and mixed experiences.
You might reach your goals faster since everyone works on a dedicated query.
There are different points of view. Maybe I’m biased because I’ve worked with RDBs all my life and want to work with GraphDBs because they are new and exciting. Another developer might enjoy proprietary binary formats and managing memory in a C++ program for maximum efficiency. A third one could point out some flaws in both approaches.
And that’s it for this week!
Have you ever worked with datasets of this size?
Please share your experiences below so we can become more efficient at working with big data. 👏
📰 Weekly shoutout
The Apprentice, The New Boss, The Successor and The Pioneer by
How Indexing Information Can Make You a Better Engineer by
📣 Share
There’s no easier way to help this newsletter grow than by sharing it with the world. If you liked it, found something helpful, or you know someone who knows someone to whom this could be helpful, share it:
🏆 Subscribe
Actually, there’s one easier thing you can do to grow and help grow: subscribe to this newsletter. I’ll keep putting in the work and distilling what I learn/learned as a software engineer/consultant. Simply sign up here: