Launch HN: Syndetic (YC W20) – Software for explaining datasets

dodata · on Feb 24, 2020

Neat! Congrats on the launch - the demo is very helpful to understand the product. Having consumed long, painful PDF data dictionaries in the past, this is a big breath of fresh air. Excited to see where Syndetic goes!

For me, the most painful part of working with 3rd party data was actually figuring out the "match rate" to internal data. For example, you might be a consumer-facing company who hopes to add more context to your internal data by pulling in 3rd party information for existing clients. To match your internal data to a 3rd party dataset, you usually match on some hashed email (or similar identifier) to see what percentage of your consumer records will be available in the 3rd party dataset. Have you thought about something like that with your tool? Maybe you can upload a sample of hashed emails and see how different match rates pan out.

stevepike · on Feb 24, 2020

Yes! This has come up across multiple industries and is probably the feature on our roadmap I'm most excited about. The implementation is tricky but customers definitely care about the intersection of a provider's data with their own. Some more sophisticated providers have internal tools for generating things like sample sets customized to a prospect.

We're going to be adding a feature where we can flag fields as identifying keys and index them. We'll start with a simple intersection count ("upload 100 stock tickers, see how many records match"). Then we'll add an interactive feature to let a prospective customer generate all of the stats in the dictionary scoped down to the subset of data they care about. It's important to be able to answer questions like "for the 100 tickers I care about, how many NULLs are there for this other column?".

Maybe someday we'll even get into the more general record linkage problem when there's no reliable matching key.

carterehsmith · on Feb 25, 2020

That sounds very useful.

I am also super impressed that you managed to present your product without mentioning "big data" or "machine learning" or AI - given that anyone that does anything these days crams those big words in.

Thats is good, good luck.

stevepike · on Feb 25, 2020

Thanks, I'm with you on the big data buzzwords and trying to avoid overengineering things (one of my favorite HN posts ever is https://news.ycombinator.com/item?id=8908462).

Right now the data scanning is just a fork of https://github.com/BurntSushi/xsv/ running in a container with plenty of ram, and we've handled files in the ~20GB range with no problem. I think we could actually scale up to ~100GB files with xsv, which seems to cover 99%+ of data providers we're running in to. Providers might be processing massive amounts of data but the eventual deliverable they share with their customers is rarely too big for one machine.

That said, we will probably move away from our super simple stack towards running a Spark cluster in the medium term. Not for "big data" (actually I expect Spark to have higher latency and possibly to be slower for a moderate sized dataset than the rust solution) but because we want to be able to run multiple parallel scans over the datasets for upcoming future features. Some of that will involve a DAG of dependencies (e.g., do type detection first to figure out fields that are categorical, then generate visualizations where the plots are grouped by whatever the values are of the categorical field). There are also a bunch of nice libraries in the spark world for more comprehensive stats.

loganfrederick · on Feb 24, 2020

This (or something like it) makes a lot of sense to me. I've been at multiple organizations where there have been efforts to create these "data dictionaries" explaining the meaning of the data, especially when the schemas or APIs are not well designed.

But then manually writing documentation is obviously tedious and can typically only be written by the data team that knows the underlying data well, which is not always the best use of their time.

I'll definitely be following Syndetic and hope they can help crack this problem.

aswihart · on Feb 24, 2020

(This is Allison) - thank you! It's interesting for us to see how the problem is handled at different types of organizations, because as you point out the data team knows the underlying data best but is not often customer-facing.

gvv · on Feb 25, 2020

Great concept, although worried about uploading datasets with sensitive data to your cloud.

I usually use this for Pandas Profiling to accomplish EDA tasks https://github.com/pandas-profiling/pandas-profiling

knes · on Feb 24, 2020

Super great idea. I've been talking to data scientist/eng in B2B SaaS space on how we should bring best practices like that to the Sales/Marketing/Business ops world too.

What would you say are the differences between syndetic and qri.io (not affiliated in anyway)

stevepike · on Feb 24, 2020

Thanks for the kind words! I hadn't seen qri.io before, but from a read of their website I think it's broadly similar to Dolt (https://www.liquidata.co/) which is git for datasets. Kaggle has a similar data hub at https://www.kaggle.com/datasets that's not open source but is in the same space.

Our approach is to scan the datasets wherever they currently live in production rather than being a new way to store the data. The industry seems to have settled on FTP and S3 for now, and we think it's important that we connect to the same exact thing a customer would access. That lets us keep the dictionary up to date automatically without the data providers needing to change their storage infrastructure.

SamuelAdams · on Feb 24, 2020

I wonder if you could combine your service with freely available datasets like Google Dataset Search [1] to demo what a large amount of various datasets would look like under your service.

[1]: https://datasetsearch.research.google.com/search?query=puppi...

aswihart · on Feb 24, 2020

We would love to do that at some point. There is tons of open data out there, but not a lot of it has useful descriptions at the field level (only the dataset level), so it would take some time to put a collection together that is robust. Also the demo on our splash page is a demo of the artifact we create (i.e. the published dictionary) only. The other side of the web app is a management layer to bundle datasets into collections, annotate fields, configure sample sets, and share the artifacts. We'll work on fleshing out our demo to show what the system looks like when there are hundreds or thousands of datasets.

bitprincess · on Feb 25, 2020

I have been doing this for my company, when there is a sector you have data on - there is also external factors like how much of the market does your db cover for prices etc. and NLP on different items and metrics for different types of data in your offerings. What do you do with this when the actual field names are very general (i.e. item metric region unit)

aswihart · on Feb 25, 2020

We actually built in the concept of "display names" for exactly this purpose. You want to keep the actual field name in the schema so that ingestion works properly, but you also want to describe the field as helpfully as possible. In another comment, Steve mentioned that we are trying to tackle the first problem you mention (how much overlap is there between my dataset and other external factors) with the concept of creating dataset intersection or relevance scores.

mason55 · on Feb 24, 2020

Any plans on a hosted version? Either regular on prem or as something that can be privately hosted on e.g. AWS like DataBricks or Snowflake?

I love the idea but we could never expose most of our data to a public SaaS. There are all kinds of restrictions we have on things like data privacy and data needing to stay in specific regions.

stevepike · on Feb 24, 2020

Yeah, on-premise or at least private cloud has come up a few times. Beyond the data privacy and licensing requirements it'd also just be plain faster in some cases. We haven't offered it yet just because we're a small company and are rapidly adding features. Our backend is mostly running in k8s so I don't think it'll be a _huge_ technical rewrite to get it running in private cloud. Frankly I just don't have experience supporting software running outside my control and want to make sure we take the time to do it correctly.

mason55 · on Feb 24, 2020

Cool. We're an enterprise B2B SaaS company that does a ton of data interchange with our customers. Each project burns a bunch of engineering hours and calendar time because customers send us data that doesn't match the spec or we just don't have anyone outside the engineering org who can confirm that the data looks how it's supposed to.

Something which simplified the process of analyzing sample data and provided a view of it to a non-technical user would be very valuable. But as I mentioned in my previous comment we could never expose any of our data outside of our private infrastructure, so we can't use this until there are other options for hosting.

aripickar · on Feb 24, 2020

This looks really cool! I've worked with large data sets before and one of the most annoying things was when they were split up into multiple files. Do you currently support statistics across multiple datasets?

Also, how did you come up with the pricing? 500 and call us seems like a lot per month

stevepike · on Feb 24, 2020

Thanks! We can definitely combine multiple files into one dataset so long as they share the same fields. We've got one customer that keeps their data in an S3 bucket as one JSON file per record, so for them we're scanning ~480K files to construct the stats. If you've got multiple different datasets we've got a concept of "collections" for organization.

We came up with the pricing strategy based on conversations with early customers. We want to be able to say yes to integrations with whatever system they're using to store their data so we need flexibility at the early stage.

tzm · on Feb 25, 2020

This reminds me of readme generator for Frictionless 'Data Package' https://frictionlessdata.io/

aswihart · on Feb 25, 2020

Interesting, I just came across Frictionless recently as part of the FISD group that's trying to implement a set of standards across the financial industry for documenting datasets.

pplonski86 · on Feb 24, 2020

What data types do you plan to support? Have you considered supporting datasets with images+labels used in computer vision? How would you like to handle them?

Are you going to support data labeling tasks?

stevepike · on Feb 24, 2020

We've talked with some image data providers who are creating datasets for use in machine learning. We're not running any of our own models on image data right now, so I think the place we can be most useful is in summarizing metadata about the images in cases where the dataset isn't just an image file. For example, if the dataset is images of intersections plus bounding box coordinates of street signs we can tell a prospective consumer what % of images have a street sign in them. If you have a little more metadata (e.g., what time of day the photo is) the stats get much more useful out of the box.

I don't understand the data labeling question. How would you imagine us getting involved there?

wefarrell · on Feb 28, 2020

Makes a ton of sense to me, I really wish the datasets that I work with had this or some kind of equivalent.

Would be really great if you could generate a fake dataset of similar size for testing purposes. It would take some thinking but it would be really useful for building a consumer before getting the full dataset.

aswihart · on March 4, 2020

I missed this comment earlier - we can pilot with sample/fake data to give the user a sense of how the system works and what workflows within it work best for their company. The versioning/diffs wouldn't make much sense, but that's probably ok.

coolsank · on Feb 24, 2020

very cool product! I've worked on much smaller datasets with pandas and their inbuilt profiling report method can slow things down to a crawl!. Hoping to see more from you guys :)

stevepike · on Feb 24, 2020

A very reductionist version of our company that Allison hates when I use is "csvstat but on the internet" :-). I think the problem of auto-summarizing datasets has hit kind of a local maximum in what pandas dataframe summaries (csvstat is a similar python tool) can do on one machine. We will be able to add much fancier things like sophisticated type classification (e.g., is this field a stock ticker) without burning your CPU.

coolsank · on Feb 24, 2020

hah! but this is a very interesting area. You're right on the auto-summarizing issue becoming a problem these days with the usage of larger datasets. Data versioning also is starting to become a larger problem and I saw that you guys already have addressed it in your enterprise product. Hoping to see some sort of API-like version for comparison of data troves from different timelines in the future.

shostack · on Feb 24, 2020

What industries do you see this being most useful to?

aswihart · on Feb 24, 2020

We think data-as-a-service is a new and growing category, and this includes startups we would consider "pure" data companies (e.g. a company that sells data on airfares across the web) and companies that have been around for decades selling csvs that they deliver over FTP (e.g. a giant company like ADP that sells payroll processing data). Based on our backgrounds we have a fair amount of experience in the alternative data space, which is basically any data that might have some signal to a hedge fund that isn't market data. I'm finding that the providers in that space are interested in expanding their customer base to corporates (e.g. Walmart, McDonalds) whereas the providers currently selling to corporates are interested in expanding into alternative data.

adampgreen · on Feb 24, 2020

Awesome! Congrats on the launch! Excited to check it out.

gkoberger · on Feb 24, 2020

Hey! I’m Greg of ReadMe... think Syndetic but for APIs rather than datasets!

Congrats on the launch :) I’ll find you at Alumni Demo Day, or feel free to reach out if I can help with anything! And welcome to the war on PDFs :)

aswihart · on Feb 24, 2020

This made our day : ) Thank you!

nocitrek · on Feb 24, 2020

Does this support more complex data structures - for example parquet files?

stevepike · on Feb 24, 2020

The manual data upload is restricted to well-formed CSV files w/ headers on the first row right now. For the "contact us" higher tier we'll handle any file format that we can extract columns from, so parquet would be fine.