Now that we have figured out what data to collect, let’s briefly dwell on how to do it. In the case of many sources, you can systematically collect all available data. There are many ways to manage data flows. You can use the application programming interface (API) or collect files from an FTP server; you can even parse screen data and save what you need. If it’s a one-time task, it’s easy to handle. However, if you frequently update or add data, you need to decide how to work with this stream. It may be easier for smaller tables or files to be completely replaced with a new, larger dataset. In my team, tables with up to 100,000 rows are considered small.
A more complex process with change analysis needs to be established to work with larger datasets. In the simplest case, new data is always entered into new rows (for example, transaction logs, where there should be no updates or deletions of current data). In this case, you can INSERT the new data into the current data table.
In more complex cases, you need to decide whether you will insert (INSERT) a row with new data, delete (DELETE), or update (UPDATE).
For other data sources, you may need to make a selection. Conducting surveys and processing the results can sometimes be too costly, such as conducting clinical trials or analyzing all Twitter posts. How sampling is done has a huge impact on the data quality. However, biased sampling greatly affects data quality and usability. The simplest approach is to form a “simple random sample” where the data to be included in the sample is determined by a simple flip of a coin. The bottom line is that the sample should truly represent the larger dataset from which it is drawn.
Careful attention should be paid to forming a sample of data collected over a certain period. Let’s say you want to sample site sessions per day. You select 10% of sessions and load information about them into a database for further analysis. If you do this every day, you will have a set of random independent sessions, but you may miss out on the users who will visit the site in the following days.
The sample may not contain information about users with multiple sessions: they may be in the sample on Monday but will not be there when they return to the site on Wednesday. So if you’re more interested in subsequent repeat sessions and your site’s users return frequently, it may be more efficient for you to randomly select visitors and track their sessions over time than randomly sample sessions. In this case, you will get higher-quality data to work with. (Though you might not be too pleased to see users who do not return to the site.) The sampling mechanism should be determined by the business question you are looking for the answer to.
Finally, should raw or aggregated data be collected? Some data providers offer dashboards where data is aggregated according to analysts’ key metrics. For analysts, this can be of great help. However, suppose the data is really valuable. In that case, this approach will not be enough for analysts: they will want to go deeper into their study and consider them from various angles, which will not be possible with dashboards.
All these reports and dashboards can be effectively used for archival data storage. In other cases, in my experience, it is better to collect raw data whenever possible since you can always aggregate according to the indicators, but not vice versa. Once you have the raw data, you can work with it. Of course, there are rare cases.