File size upper limit when collecting from File System?
I am seeing some irregularities with collecting large files from a filesystem.
We batch process files anywhere from 100MB to 100GB in size. I am currently noticing an issue with larger files.
To troubleshoot I created a collector that reads the data in, and directly writes it back to disk. No other ETL is done on the data.
Result of 1 collection
Result of collecting the same file from the same location using the same event breakers and going to the same destination.
My current environment is 1 leader, 1 worker, and the file is being picked up from an NFS mount.
As you can see from the screenshots, only 3-4 million of the events are being collected of the ~24 million events in the file. The destination is writing to disk about 5-6 GB from the original ~38GB.
There are no errors that I see in the job log, and I can't find any setting regarding worker process limits or job limits that would affect this.
Answers
-
What version of Stream are you on?
Can you show your Collector settings?
Turning on debug on the collector could possibly provide more information.
0 -
Might be worth opening a ticket. I think we will need to take a closer look at your Logs/configurations through a diag.
0 -
Sounds good. Ill open a case with a link to this thread.
0 -
After doing that you can look at the logs for the job by going to Monitoring → Job Inspector.
0 -
Stream version: 3.4.1
Most of the collector settings are default.
I have added my event breakers.
Set a custom field for routing specifically back out to the filesystem.I am going to run a collection with debug on now.
0 -
The majority of the debug logs are…
"message: failed to pop task reason: no task in the queue"
and
"message: skipping metrics flush on pristine metrics"Nothing sticks out to me as bad or breaking. No tasks in queue because its one file… so only 1 task.
Also this 3rd run has a different amount of events captured again.
0