My Scientific Journey: From Idea to Insights with AI

6 minute read

Published:

Scientific research is an exciting but complex process. It requires structure, creativity, and careful attention to detail. In this article, I describe my own scientific “flow” — the sequence of steps I follow, and show how artificial intelligence (AI) tools can dramatically improve each stage.

1. My Personal Scientific Flow

My approach to research can be represented as five main stages. Each step logically follows from the previous one, creating a cohesive process — from the birth of an idea to final conclusions.

Here’s what this scheme looks like:

graph TD;
    A[Stage 1: Topic and Hypothesis Formulation] --> B[Stage 2: Deep Literature Review];
    B --> C[Stage 3: Research Planning and Data Collection];
    C --> D[Stage 4: Data Analysis];
    D --> E[Stage 5: Interpretation, Writing, and Publication];

2. AI Application Points in My Flow

AI is not a replacement for the researcher, but a powerful assistant that automates routine tasks, accelerates analysis, and helps identify non-obvious connections. Let’s examine its role at each stage.

Stage 1: Topic and Hypothesis Formulation

At this stage, the main task is to find a relevant and not fully studied problem.

  • Tool/Approach: Generative models (ChatGPT, Gemini), trend analysis tools (Scite, ResearchRabbit).
  • Tasks: Brainstorming, searching for “white spots” in research, formulating initial hypotheses. You can ask the model: “What are the open questions in the field of [your field] at the intersection with [another field]?”
  • Benefits: Acceleration. AI quickly generates dozens of ideas and summarizes the latest publications, which would take days to do manually.

Stage 2: Deep Literature Review

Traditional literature review is one of the most labor-intensive stages.

  • Tool/Approach: Retrieval-Augmented Generation (RAG) systems, such as Elicit, Consensus, or specialized AI tools for scientists.
  • Tasks: Quick search, summarization, and systematization of dozens and hundreds of scientific articles. A RAG system can answer a specific question (“Which treatment methods for disease X showed the highest effectiveness in clinical trials?”), providing a summarized answer with source references.
  • Benefits: Depth of analysis and automation. Instead of reading 100 articles, you can analyze their key findings in an hour. This frees up time for critical thinking.

Stage 3: Research Planning and Data Collection

Proper experimental or research design is the key to reliable results.

  • Tool/Approach: AI assistants (Copilot), statistical simulators.
  • Tasks: Help in choosing appropriate statistical methods, calculating required sample size, generating code for data collection from open sources (parsing).
  • Benefits: Accuracy and efficiency. AI helps avoid common methodological errors and automates data collection.

Stage 4: Data Analysis

This is the area where AI and machine learning (ML) reveal their full potential.

  • Tool/Approach: ML models (regression, classification, clustering), Python libraries (Pandas, Scikit-learn), AI tools for visualization.
  • Tasks: Identifying complex, non-linear dependencies in data that are difficult to notice with traditional statistical methods. Building predictive models.
  • Benefits: Deeper analysis. ML can find hidden patterns in large datasets, leading to new scientific discoveries.

Stage 5: Interpretation, Writing, and Publication

Writing a scientific article requires clarity, logic, and adherence to academic style.

  • Tool/Approach: Generative models (GPT for drafts), grammar and style checking tools (Grammarly), AI assistants for translation and formatting.
  • Tasks: Creating drafts of individual sections (introduction, methods description), rephrasing sentences for clarity, checking text for plagiarism, preparing abstracts.
  • Benefits: Speed and quality. AI significantly accelerates the writing process, allowing focus on content rather than form.

3. & 4. Practical Example: Dataset Analysis in Google Colab

For demonstration, I took the classic “Wine Quality” dataset from the UCI repository. It contains chemical characteristics of white wine and its quality rating on a 10-point scale.

Task: Calculate basic descriptive statistics and visualize the distribution of wine quality to understand which ratings are most common.

Here’s the code that can be executed in Google Colab:

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# URL to the white wine dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'

# Load data into DataFrame, using semicolon as separator
try:
    wine_df = pd.read_csv(url, sep=';')
    print("Data loaded successfully.")
except Exception as e:
    print(f"Data loading error: {e}")

# Display first 5 rows for familiarization
print("First 5 rows of data:")
print(wine_df.head())

# 1. Calculate descriptive statistics for all indicators
print("\nDescriptive statistics:")
# .T transposes the output, making it more readable
print(wine_df.describe().T)

# 2. Calculate mode for quality indicator
quality_mode = wine_df['quality'].mode()[0]
print(f"\nMode for 'quality' indicator: {quality_mode}")

# 3. Build visualization - histogram of wine quality distribution
plt.style.use('seaborn-v0_8-whitegrid') # Use modern style
fig, ax = plt.subplots(figsize=(10, 6))

sns.histplot(wine_df['quality'], bins=6, kde=False, ax=ax, color='skyblue', discrete=True)

ax.set_title('Distribution of White Wine Quality Ratings', fontsize=16)
ax.set_xlabel('Quality Rating (from 3 to 9)', fontsize=12)
ax.set_ylabel('Number of Samples', fontsize=12)
ax.set_xticks(range(wine_df['quality'].min(), wine_df['quality'].max() + 1)) # Clear marks on X axis

# Add labels with counts on each bar
for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 9), textcoords='offset points')

plt.show()

Distribution of White Wine Quality Ratings

Download PDF with Jupyter notebook execution results

5. Brief Interpretation of Results

Analysis of the provided data allows us to draw several key conclusions:

  1. Central tendencies:

    • Mean value of wine quality is approximately 5.88.
    • Median (50%) equals 6. This means that half of the wines have a rating of 6 or lower, and the other half — 6 or higher.
    • Mode, or the most frequent rating, also equals 6.
  2. Data spread:

    • Standard deviation (std) for quality is 0.88. This indicates that most ratings are grouped quite close to the mean value (mainly in the 5-7 range).
    • Quality ratings range from 3 (min) to 9 (max). There are no wines with the highest (10) or lowest (1/2) ratings in the dataset.
  3. Conclusions from visualization:

    • The histogram clearly shows that the distribution of quality ratings is not uniform. It resembles a normal distribution with a left skew.
    • The vast majority of wines (2198 samples) have a rating of 6. Second place goes to wines with a rating of 5 (1457 samples).
    • Wines with very high (8, 9) or very low (3, 4) ratings are significantly fewer.

General conclusion: Based on this statistics, we can assert that most white wines in this dataset are of “average” quality. This changes the initial understanding, since without analysis one might assume that ratings are distributed uniformly. Now we see that extreme ratings (very good or very bad) are rare. This can become a starting point for the next step: identifying chemical characteristics that most distinguish “outstanding” wines (rating 8+) from “average” ones.