Post 52 - Is that a log or a big snake?

02 Dec 2023

Task 8, Day 02, Log analysis O Data, All Ye Faithful

Data Science

Often incvolves programming, statistics, and the use of AI to examine and understand trends and patterns. Data scientist roles:

Role Description
Data Collection Collecting raw data (i.e. List of recent transactions)
Data Processing Turning raw data from collection into standard format
Data Mining
(Clustering/Classification)
Create relationships between data, find patterns and correlations
Analysis
(Exploratory/Confirmatory)
Bulk of analysis. Look for answers and projections
Communication
(Visualisation)
Visualize data as charts, tables, maps, etc.

Data Science in Cybersecurity

Analysing data like log events leads to an intelligent understanding of ongoing events in the org. Other uses include:

Jupyter Notebooks

Open source documents with code, text, and terminal functionality; easily shared and executed across systems. Good way to demonstrate and explain proof of concepts in Cybersecurity. Notebooks consist of cell that can be executed one at a time, step by step. Cells indicated by โ€œIn [#]โ€ where number indexes to next higher number after successful completion. Can rearrange cells by grabbing around the cell number.

Lists are used to store a collection of values as a variable.

Pandas

Series - Data structure of key-value pair.

Turn data set from list variable (transportation) into a series by:

  1. creating a new variable transportation_series
  2. invoke Pandasโ€™ Series function pandas.Series
  3. provide variable transportation to function
  4. print out series print(transportation_series)
transportation_series = pd.Series(transportation)

print(transportation_series)

Print also returns dtype: which include:

DataFrames are a grouping of series similar to spreadsheet or database. Provide data similar to:

data = [['Ben', 24, 'UK'],
        ['Jacob', 32, 'US']
        ['Alice', 19, 'Germany']]

Convert to dataframe variable df by:

df = pd.DataFrame(data, columns=['Name', 'Age', 'Country of Residence'])

Return all data by calling variable (i.e. df).

Can also return a specific line with df.loc[#] where # is the line number (starting with 0).

Can then group dataframes by column, row, or comparing with groupby.

Group by two columns and sum them similar to:

df.groupby(['Department'])['Prize'].sum()

Group by two columns and provide summary of data in percentile:

df.groupby(['Department'])['Prize'].describe()

Matplotlib

Allows creation of visualisations of data such as bar charts, histograms, pie charts, waterfalls, etc. If using within Jupyter notebook begin with %matplotlib inline otherwise it generates plots elsewhere. Create plots with plot() function. Similar to:

plt.xlabel('Months of the year')
plt.ylabel('Number of toys produced')
plt.title('A Line Graph Showing Number of Toys Produced Monthly')
plt.plot(['January', 'February', 'March', 'April'],[8,14,23,40])

Combine Pandas and Matplotlib to create a bar graph based on a csv file similar to:

spreadsheet = pd.read_csv('drinks.csv')

drinks = spreadsheet['Drink']
votes = spreadsheet['Vote']

plt.figure(figsize=(10, 6)) # adjust as necessary for readability

plt.barh(drinks, votes, color='skyblue') # barh for horizontal, barv for vertical.
plt.xlabel('Number of Votes')
plt.ylabel('Name of Drink')
plt.title('Bar Graph Showing Favorite Drinks')
plt.gca().invert_yaxis() # optionally invert the y-axis.

The Task

Analyse a packet capture saved as a csv. Answers gotten.