We have updated our Terms of Service, Code of Conduct, and Addendum.

Download file

Jeff G
Jeff G Posts: 4

Total Cribl noob. I've been asked to use Cribl to download a file at regular intervals, but I can't see how that is done. The source is not anything in our environment - it's an external entity. If I were to do it manually in a scheduled python job, it would look like this.

oldest = (datetime.utcnow() - timedelta(hours=2)).strftime("%Y%m%d%H")
file_name = f"{oldest}.tar.bz2"
link = f"https://site.com/feeds/files/hourly/{oldest}"
r = requests.get(link, headers={"x-apikey": apikey})
file = BytesIO(r.content)

I've been asked to download the file with Cribl, send it to s3, then extract it and ingest the contents into elastic. The thought is they expect more data sources in the future as I would probably consider it overkill, but I said I would give it a try and see. But not sure if it does that sort of thing. Thanks for any advice you can offer.

Best Answer

  • Brendan Dalpe
    Brendan Dalpe Posts: 201 mod
    Answer ✓

    Hi @Jeff G, I think the REST Collector can accomplish this use case.

    For the URL, you can use JavaScript to format the required timestamp. An example configuration for the Collect URL:

    `https://site.com/files/hourly/${C.Time.strftime(earliest || (new Date().getTime() / 1000) - (60 * 60 * 2), "%Y%m%d%H")}`
    

    (Note: the backticks are important to copy!)

    Since the contents are bzip2 compressed, you'll need to use the custom command to pass the contents to the bunzip2 command to get the decompressed contents.

    I don't know what your data looks like, so you'll need to build an appropriate pipeline to process the data for the destination. You can configure this under the Result Routing tab on the left. You'll also need to configure your ElasticSearch destination.

    To run everything on a schedule, click the Schedule button on the collector configuration. You can change the earliest time parameter to your liking (note that the variable is found in the URL)

    Remember to Commit & Deploy your changes before trying to feed data to the destination!

    Let us know how it goes!

Answers

  • Brendan Dalpe
    Brendan Dalpe Posts: 201 mod
    Answer ✓

    Hi @Jeff G, I think the REST Collector can accomplish this use case.

    For the URL, you can use JavaScript to format the required timestamp. An example configuration for the Collect URL:

    `https://site.com/files/hourly/${C.Time.strftime(earliest || (new Date().getTime() / 1000) - (60 * 60 * 2), "%Y%m%d%H")}`
    

    (Note: the backticks are important to copy!)

    Since the contents are bzip2 compressed, you'll need to use the custom command to pass the contents to the bunzip2 command to get the decompressed contents.

    I don't know what your data looks like, so you'll need to build an appropriate pipeline to process the data for the destination. You can configure this under the Result Routing tab on the left. You'll also need to configure your ElasticSearch destination.

    To run everything on a schedule, click the Schedule button on the collector configuration. You can change the earliest time parameter to your liking (note that the variable is found in the URL)

    Remember to Commit & Deploy your changes before trying to feed data to the destination!

    Let us know how it goes!

  • Jeff G
    Jeff G Posts: 4

    How come "Collectors" does not show up as a "Source" on the sources page for Quick Connect?

  • Jeff G
    Jeff G Posts: 4
    edited June 2023

    For the file format, it is a bzip2 compressed UTF-8 text file that contains one JSON structure per line. I'll want to save the compressed file to s3 and then uncompress and send each line in the file (json object) to elastic (bulk import?).

  • Jeff G
    Jeff G Posts: 4

    Thanks Brendan, I'm able to download the file and save to s3. Haven't tried elastic yet. I mention some questions above, but have some additional ones. Appreciate your time and help.

    The default folder formatting for minio is C.Time.strftime(_time ? _time : Date.now()/1000, '%Y/%m/%d') but I'm getting folders all over the timeline from 2015 - 2023 and everything in between. Shouldn't it just be yesterday, today, tomorrow…?

    The default file formatting is `.${C.env["CRIBL_WORKER_ID"]}.${__format}${__compression === "gzip" ? ".gz" : ""}`. If my data file is json, can I use what is defined in the file as "id" for the filename?

    The file formatting brings up a larger question.. right now I'm unzipping with bunzip2 as you show in your example and with the event breaker on carriage returns, the "source" seems to break things down to events. In the case above, I'm trying to save those events individually based on the id. However, I'd be fine to just download the file from the site and save it to s3 (no unzip, event breakout, or file rename required). I only need to unzip it when I import it into Elastic, but so long as it's not unduly inefficient on resources and disk space, I don't mind. Just trying to figure out how best to do this. Thanks again!