Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Syndetic (YC W20) – Software for explaining datasets
124 points by stevepike on Feb 24, 2020 | hide | past | favorite | 34 comments
Hi HN,

We're Allison and Steve of Syndetic (https://www.getsyndetic.com). Syndetic is a web app that data providers use to explain their datasets to their customers. Think ReadMe but for datasets instead of APIs.

Every exchange of data ultimately comes down to a person at one company explaining their data to a person at another. Data buyers need to understand what's in the dataset (what are the fields and what do they mean) as well as how valuable it can be to them (how complete is it? how relevant?). Data providers solve this problem today with a "data dictionary" which is a meta spreadsheet explaining a dataset. This gets shared alongside some sample data over email. These artifacts are constantly getting stale as the underlying data changes.

Syndetic replaces this with software connected directly to the data that's being exchanged. We scan the data and automatically summarize it through statistics (e.g., cardinality), coverage rates, frequency counts, and sample sets. We do this continuously to monitor data quality over time. If a field gets removed from the file or goes from 1% null to 20% null we automatically alert the provider so they can take a look. For an example of what we produce but on an open dataset check out the results of the NYC 2015 Tree census at https://www.getsyndetic.com/publish/datasets/f1691c5d-56a9-4....

We met at SevenFifty, a tech startup connecting the three tiers of the beverage alcohol trade in the United States. SevenFifty integrates with the backend systems of 1,000+ beverage wholesalers to produce a complete dataset of what a restaurant can buy wholesale, at what price, in any zipcode in America. While the core business is a marketplace between buyers and sellers of alcohol, we built a side product providing data feeds back to beverage wholesalers about their own data. Syndetic grew out of the problems we experienced doing that. Allison kept a spreadsheet in dropbox of our data schema, which was very difficult to maintain, especially across a distributed team of data engineers and account managers. We pulled sample sets ad hoc, and ran stats over the samples to make sure the quality was good. We spent hours on the phone with our customers putting it all together to convey the meaning and the value of our data. We wondered why there was no software out there specifically built for data-as-a-service.

We also have backgrounds in quantitative finance (D. E. Shaw, Tower Research, BlackRock), large purchasers of external data, where we've seen the other side of this problem. Data purchasers spend a lot of time up-front evaluating the quality of a dataset, but they often don’t monitor how the quality changes over time. They also have a hard time assessing the intersection of external datasets with data they already have. We're focusing on data providers first but expect to expand to purchasers down the road.

Our tech stack is one monolithic repo split into the frontend web app and backend data scanning. The frontend is a rails app and the data scanning is written in rust (we forked the amazing library xsv). One quirk is that we want to run the scanning in the same region as our customers' data to keep bandwidth costs and transfer time down, so we're actually running across both GCP and AWS.

If you're interested in this field you might enjoy reading the paper "Datasheets for datasets" (https://arxiv.org/pdf/1803.09010.pdf) which proposes a standardized method for documenting datasets modeled after the spec sheets that come with electronics. The authors propose that “for dataset creators, the primary objective is to encourage careful reflection on the process of creating, distributing, and maintaining a dataset, including any underlying assumptions, potential risks or harms, and implications of use.” We agree with them that as more and more data is sold, the chance of misunderstanding what’s in the data increases. We think we can help here by building qualitative questions into Syndetic alongside automation.

We have lots of ideas of where we could go with this, like fancier type detection (e.g. is this a phone number), validations, visualizations, anomaly detection, stability scores, configurable sampling, and benchmarking. We'd love feedback and to hear about your challenges working with datasets!

Neat! Congrats on the launch - the demo is very helpful to understand the product. Having consumed long, painful PDF data dictionaries in the past, this is a big breath of fresh air. Excited to see where Syndetic goes!

For me, the most painful part of working with 3rd party data was actually figuring out the "match rate" to internal data. For example, you might be a consumer-facing company who hopes to add more context to your internal data by pulling in 3rd party information for existing clients. To match your internal data to a 3rd party dataset, you usually match on some hashed email (or similar identifier) to see what percentage of your consumer records will be available in the 3rd party dataset. Have you thought about something like that with your tool? Maybe you can upload a sample of hashed emails and see how different match rates pan out.

Yes! This has come up across multiple industries and is probably the feature on our roadmap I'm most excited about. The implementation is tricky but customers definitely care about the intersection of a provider's data with their own. Some more sophisticated providers have internal tools for generating things like sample sets customized to a prospect.

We're going to be adding a feature where we can flag fields as identifying keys and index them. We'll start with a simple intersection count ("upload 100 stock tickers, see how many records match"). Then we'll add an interactive feature to let a prospective customer generate all of the stats in the dictionary scoped down to the subset of data they care about. It's important to be able to answer questions like "for the 100 tickers I care about, how many NULLs are there for this other column?".

Maybe someday we'll even get into the more general record linkage problem when there's no reliable matching key.

That sounds very useful.

I am also super impressed that you managed to present your product without mentioning "big data" or "machine learning" or AI - given that anyone that does anything these days crams those big words in.

Thats is good, good luck.

Thanks, I'm with you on the big data buzzwords and trying to avoid overengineering things (one of my favorite HN posts ever is https://news.ycombinator.com/item?id=8908462).

Right now the data scanning is just a fork of https://github.com/BurntSushi/xsv/ running in a container with plenty of ram, and we've handled files in the ~20GB range with no problem. I think we could actually scale up to ~100GB files with xsv, which seems to cover 99%+ of data providers we're running in to. Providers might be processing massive amounts of data but the eventual deliverable they share with their customers is rarely too big for one machine.

That said, we will probably move away from our super simple stack towards running a Spark cluster in the medium term. Not for "big data" (actually I expect Spark to have higher latency and possibly to be slower for a moderate sized dataset than the rust solution) but because we want to be able to run multiple parallel scans over the datasets for upcoming future features. Some of that will involve a DAG of dependencies (e.g., do type detection first to figure out fields that are categorical, then generate visualizations where the plots are grouped by whatever the values are of the categorical field). There are also a bunch of nice libraries in the spark world for more comprehensive stats.

This (or something like it) makes a lot of sense to me. I've been at multiple organizations where there have been efforts to create these "data dictionaries" explaining the meaning of the data, especially when the schemas or APIs are not well designed.

But then manually writing documentation is obviously tedious and can typically only be written by the data team that knows the underlying data well, which is not always the best use of their time.

I'll definitely be following Syndetic and hope they can help crack this problem.

(This is Allison) - thank you! It's interesting for us to see how the problem is handled at different types of organizations, because as you point out the data team knows the underlying data best but is not often customer-facing.

Great concept, although worried about uploading datasets with sensitive data to your cloud.

I usually use this for Pandas Profiling to accomplish EDA tasks https://github.com/pandas-profiling/pandas-profiling

Super great idea. I've been talking to data scientist/eng in B2B SaaS space on how we should bring best practices like that to the Sales/Marketing/Business ops world too.

What would you say are the differences between syndetic and qri.io (not affiliated in anyway)

Thanks for the kind words! I hadn't seen qri.io before, but from a read of their website I think it's broadly similar to Dolt (https://www.liquidata.co/) which is git for datasets. Kaggle has a similar data hub at https://www.kaggle.com/datasets that's not open source but is in the same space.

Our approach is to scan the datasets wherever they currently live in production rather than being a new way to store the data. The industry seems to have settled on FTP and S3 for now, and we think it's important that we connect to the same exact thing a customer would access. That lets us keep the dictionary up to date automatically without the data providers needing to change their storage infrastructure.

I wonder if you could combine your service with freely available datasets like Google Dataset Search [1] to demo what a large amount of various datasets would look like under your service.

[1]: https://datasetsearch.research.google.com/search?query=puppi...

We would love to do that at some point. There is tons of open data out there, but not a lot of it has useful descriptions at the field level (only the dataset level), so it would take some time to put a collection together that is robust. Also the demo on our splash page is a demo of the artifact we create (i.e. the published dictionary) only. The other side of the web app is a management layer to bundle datasets into collections, annotate fields, configure sample sets, and share the artifacts. We'll work on fleshing out our demo to show what the system looks like when there are hundreds or thousands of datasets.

I have been doing this for my company, when there is a sector you have data on - there is also external factors like how much of the market does your db cover for prices etc. and NLP on different items and metrics for different types of data in your offerings. What do you do with this when the actual field names are very general (i.e. item metric region unit)

We actually built in the concept of "display names" for exactly this purpose. You want to keep the actual field name in the schema so that ingestion works properly, but you also want to describe the field as helpfully as possible. In another comment, Steve mentioned that we are trying to tackle the first problem you mention (how much overlap is there between my dataset and other external factors) with the concept of creating dataset intersection or relevance scores.

Any plans on a hosted version? Either regular on prem or as something that can be privately hosted on e.g. AWS like DataBricks or Snowflake?

I love the idea but we could never expose most of our data to a public SaaS. There are all kinds of restrictions we have on things like data privacy and data needing to stay in specific regions.

Yeah, on-premise or at least private cloud has come up a few times. Beyond the data privacy and licensing requirements it'd also just be plain faster in some cases. We haven't offered it yet just because we're a small company and are rapidly adding features. Our backend is mostly running in k8s so I don't think it'll be a _huge_ technical rewrite to get it running in private cloud. Frankly I just don't have experience supporting software running outside my control and want to make sure we take the time to do it correctly.

Cool. We're an enterprise B2B SaaS company that does a ton of data interchange with our customers. Each project burns a bunch of engineering hours and calendar time because customers send us data that doesn't match the spec or we just don't have anyone outside the engineering org who can confirm that the data looks how it's supposed to.

Something which simplified the process of analyzing sample data and provided a view of it to a non-technical user would be very valuable. But as I mentioned in my previous comment we could never expose any of our data outside of our private infrastructure, so we can't use this until there are other options for hosting.

This looks really cool! I've worked with large data sets before and one of the most annoying things was when they were split up into multiple files. Do you currently support statistics across multiple datasets?

Also, how did you come up with the pricing? 500 and call us seems like a lot per month

Thanks! We can definitely combine multiple files into one dataset so long as they share the same fields. We've got one customer that keeps their data in an S3 bucket as one JSON file per record, so for them we're scanning ~480K files to construct the stats. If you've got multiple different datasets we've got a concept of "collections" for organization.

We came up with the pricing strategy based on conversations with early customers. We want to be able to say yes to integrations with whatever system they're using to store their data so we need flexibility at the early stage.

This reminds me of readme generator for Frictionless 'Data Package' https://frictionlessdata.io/

Interesting, I just came across Frictionless recently as part of the FISD group that's trying to implement a set of standards across the financial industry for documenting datasets.

What data types do you plan to support? Have you considered supporting datasets with images+labels used in computer vision? How would you like to handle them?

Are you going to support data labeling tasks?

We've talked with some image data providers who are creating datasets for use in machine learning. We're not running any of our own models on image data right now, so I think the place we can be most useful is in summarizing metadata about the images in cases where the dataset isn't just an image file. For example, if the dataset is images of intersections plus bounding box coordinates of street signs we can tell a prospective consumer what % of images have a street sign in them. If you have a little more metadata (e.g., what time of day the photo is) the stats get much more useful out of the box.

I don't understand the data labeling question. How would you imagine us getting involved there?

Makes a ton of sense to me, I really wish the datasets that I work with had this or some kind of equivalent.

Would be really great if you could generate a fake dataset of similar size for testing purposes. It would take some thinking but it would be really useful for building a consumer before getting the full dataset.

I missed this comment earlier - we can pilot with sample/fake data to give the user a sense of how the system works and what workflows within it work best for their company. The versioning/diffs wouldn't make much sense, but that's probably ok.

very cool product! I've worked on much smaller datasets with pandas and their inbuilt profiling report method can slow things down to a crawl!. Hoping to see more from you guys :)

A very reductionist version of our company that Allison hates when I use is "csvstat but on the internet" :-). I think the problem of auto-summarizing datasets has hit kind of a local maximum in what pandas dataframe summaries (csvstat is a similar python tool) can do on one machine. We will be able to add much fancier things like sophisticated type classification (e.g., is this field a stock ticker) without burning your CPU.

hah! but this is a very interesting area. You're right on the auto-summarizing issue becoming a problem these days with the usage of larger datasets. Data versioning also is starting to become a larger problem and I saw that you guys already have addressed it in your enterprise product. Hoping to see some sort of API-like version for comparison of data troves from different timelines in the future.

What industries do you see this being most useful to?

We think data-as-a-service is a new and growing category, and this includes startups we would consider "pure" data companies (e.g. a company that sells data on airfares across the web) and companies that have been around for decades selling csvs that they deliver over FTP (e.g. a giant company like ADP that sells payroll processing data). Based on our backgrounds we have a fair amount of experience in the alternative data space, which is basically any data that might have some signal to a hedge fund that isn't market data. I'm finding that the providers in that space are interested in expanding their customer base to corporates (e.g. Walmart, McDonalds) whereas the providers currently selling to corporates are interested in expanding into alternative data.

Awesome! Congrats on the launch! Excited to check it out.

Hey! I’m Greg of ReadMe... think Syndetic but for APIs rather than datasets!

Congrats on the launch :) I’ll find you at Alumni Demo Day, or feel free to reach out if I can help with anything! And welcome to the war on PDFs :)

This made our day : ) Thank you!

Does this support more complex data structures - for example parquet files?

The manual data upload is restricted to well-formed CSV files w/ headers on the first row right now. For the "contact us" higher tier we'll handle any file format that we can extract columns from, so parquet would be fine.

Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
